«Bible et Informatique: méthodes, outils, résultats»,
Jérusalem, 9-13 Juin 1988 [[319]]
COMPUTER ASSISTED IDENTIFICATION AND RECONSTRUCTION OF FRAGMENTARY MANUSCRIPTS
(Papyri, Leather, Paper): CHESTER BEATTY GREEK PAPYRUS 5 (Genesis) = RAHLFS
962
Reported by Robert Kraft (University of Pennsylvania) on behalf of his
Graduate Seminar Research Team (T.Bergren, N.Hubler, A.Humm, R.Kraemer,
D.Louder, K.Onesti, T.Smith, J.Treat, B.Wright)
Numerous unidentified fragments of ancient writings in various languages
are preserved in the storerooms and collections of our Libraries and Museums,
to mention only the most obvious places. Some of these fragments come from
hitherto unknown or lost writings, while others are pieces of known literature
but defy exact identification for various reasons: too few letters are legible,
the fragmentary text varies significantly from the more fully known version,
adequate indices of known works are not available through which to identify
the fragment, etc. Usually such fragments remain unpublished and relatively
unknown except to curators or to other experts whose
opinions might be solicited. Occasionally the scholarly world, or even a
wider world of potentially interested readers, is provided a glimpse of these
frustrating and mysterious treasures -- usually through the publication of
photographs.
The advent of computer technology is providing new tools and new possibilities
for working with such materials. The trained computer can serve as a super
index for searching texts that have been put into computer accessible form.
The computer can be told to ignore such potentially misleading matters as
upper or lower case letters, diacritics, spacing between words, or even
anticipated spelling variants/errors. Fragments containing only a few letters
on two or more consecutive lines, or on front and
back of a page, may provide the tireless computer with enough information
to identify where such combinations exist at the expected intervals from
each other in extant texts. The formats and line structures of fragmentary
pages or sections can be plotted on computer and manipulated to determine
the most likely reconstruction. Ancient calligraphy can be imitated and
reproduced to show approximately how the reconstructed portion may have
looked. And, of course, interim worksheets as well as final results can
be printed up for more traditional examination and distribution.
The following descriptions and discussions are intended both to illustrate
the processes by which computer assisted research can be applied to the
study of hitherto unidentified fragments of ancient literature, and to provide
significant results for the selected area of scholarly research on ancient
Greek biblical texts. We have chosen as the primary focus Chester Beatty
Papyrus 5, a Codex of Genesis (Rahlfs 962), which in many ways provided an
ideal basis from which to operate. We realize that this is an
exceptional case -- not many ancient manuscripts will be as accessible
or as extensive -- but we also believe that the [[320]]
techniques described here are, with appropriate adjustments, widely applicable
to the study of other such fragmentary materials.
Getting Started: computer Assisted Reconstruction of MS 962
The Chester Beatty 5 Codex fragments provided an obvious candidate for
developing and testing computer approaches to fragment identification. This
was clear to Kraft already in the early 1980s, when he made a few probes
into the subject -- see Discover: The Newsmagazine of Science (Feb
1984), p.81. Photographs and transcriptions of the major portions of the preserved
materials had been published by F. C. Kenyon in 1936, and A. Pietersma corrected
and updated Kenyon's work in 1977 in connection with his study of the textcritical
significance of the manuscript. Furthermore, Pietersma published photographic
reproductions not only of the additional fragments he was able to identify
in the Chester Beatty archives, but also of all the remaining unidentified
fragments of which he was aware that seemed to come from the same manuscript.
The possibilities were exciting. On the basis of the identified portions,
the approximate page and line formats of the codex could be determined and
replicated to some extent. Since the edition of Greek Jewish Scriptures
by A. Rahlfs was already available on computer from the Thesaurus Linguae
Graecae Project [TLG], it was a small step to reformat Genesis into the expected
line and page lengths, without word division or diacritic markings. Then
this rough replica of MS 962 could be searched and examined for probable
locations of small fragments, without needing to worry about where word
divisions might fall within the preserved letters. Letters from adjacent
lines would appear on the screen in their approximate original locations.
Front and back of a reconstructed page could be examined easily for appropriate
matches. Fragments from the beginning of lines, or from the ends, could often
be found in their expected locations on the computer replica pages. Problems
were anticipated -- textual variations between the fragments and the Rahlfs
text, abbreviated words, spaces left between letters in the original text
of 962 -- but they were often predictable (orthographical itacisms, abbreviations
of numbers and nomina sacra).
Overview of the Experiment and its Results
In the spring term of 1988, Kraft's advanced graduate seminar undertook
a systematic computer assisted study of the 962 materials, with the primary
goal of identifying as many of the fragments as possible. Appropriate files
and programs were set up on three different computer systems: the IBYCUS
mini computer, the Apple Macintosh, and the IBM/DOS type computer. Various
participants took responsibility for various special aspects of the study
-- paleography, special markings in the MS, computer programs, etc. Once
the files were set up and appropriate programs were in place, a number of
identifications were made [[321]] with relative
ease. For the remaining unidentified fragments, it was assumed that special
problems (especially variant texts) might exist, and thus more careful examination
was made, based partly on the known textual tendencies in the identified
portions of the manuscript. Ultimately, we were able to identify with relative
confidence all but 12 fragments of the 43 unidentified scraps of 962 published
in photographic plates by Pietersma. We suspect that four of those 12 are
not from 962 after all, leaving 8 still to be located. Such a success rate
cannot, of course, be expected with all such projects -- scroll fragments,
for example, are more difficult to locate since there is no possibility
of using the writing on the back side of the page for further clues. But
the value of the computer as a tool in this type of work was demonstrated
spectacularly in this experiment, and we look forward to applying these methods
to other fragments in the near future.
--
Bible et Informatique: methodes, outils, resultats)), Jerusalem, 9-13
Juin 1988 [[322]]
CREATING AND USING A CD-ROM (LASER DISK) DATA BANK FOR BIBLICAL AND
OTHER TEXTUAL STUDIES
by Robert Kraft (University of Pennsylvania)
Compact disk read-only memory (CD-ROM) technology offers many advantages
for computer assisted study of textual material, as is amply illustrated
in the demonstrations of the IBYCUS Scholarly computer and its CD-ROM access
capabilities. Massive amounts of textual data (550 megabytes) can
be stored and distributed easily and relatively inexpensively on one small
(five inch) disk. The contents of such a laser disk cannot be modified on
that disk, thus providing a secure check-point for further correction and
development -- a fact that is especially comforting to librarians and archivists.
Assuming that the disk has internally consistent coding and formatting, the
same software can be used to access all its materials.
There are also disadvantages. Errors cannot be corrected until a new
disk is mastered (or in some instances, through the accessing software),
and mastering a CD-ROM is not a simple or inexpensive process (typically
US$3000). Hardware and software to access the laser disk, and subscription
or purchase of the disk itself, also require significant funding (typically,
about US$1000). Adequate accessing software may not be readily available
for any particular machine.
Since early in 1986, and with support of the David and Lucile Packard
Foundation, the Center for Computer Analysis of Texts (CCAT) at the University
of Pennsylvania has been gathering and processing biblical and other text
data for inclusion on an experimental CD-ROM. The first fruits of this
labor are now available, including samples and contributions from a number
of projects, in a disk produced in cooperation with the Packard Humanities
Institute (PHI). The CCAT portion of the PHI/CCAT Demonstration CD-ROM focuses
on biblical materials and includes full texts, in ASCII coding, of the following
Bibles: Hebrew, Old Greek and NT, Latin Vulgate, Authorized (King James)
English with Apocrypha, and Revised Standard English with Apocrypha. It also
has sample portions of the Syriac, Sahidic Coptic, and Armenian, as well
as the Aramaic Pentateuch Targums Neofiti and Ps-Jonathan, and the pre-Qumran
Targum of Job. Other related tools include the parallel aligned Hebrew and
Greek Jewish scriptures (part l), the morphologically analyzed Old Greek
and NT, samples of the Old Greek textual variant files, a short dictionary
of NT Greek, and a lengthy Latin word list. The PHI portion of the disk
includes more than 40 classical Latin texts and some Greek inscriptions.
CCAT has also appended a number of miscellaneous texts in various languages
(Arabic, French, Danish, Italian, English, Sanskrit, Tibetan) and formats
to encourage wide experimentation. [[323]]
Since many of the texts were contributed by other projects or individuals,
and were originally produced in a variety of codings, the task of imposing
a consistent scheme for the location markers (called "ID codes," to identify
where things are located in any body of texts -- e.g. title, page and line,
chapter and verse, etc.), internal formats, and foreign character representation
in these texts was formidable. Thus three separable sets of problems needed
to be faced in attempting to facilitate use of all these texts on a single
CD-ROM:
(l) Ideally -- to facilitate searching and manipulating the texts as
well as generating appropriate fonts for screen and printer -- each
text in a given language (e.g. Sanskrit) should contain the same transcriptional
coding, even though different systems may have been used by different contributors.
This ideal was only partly realized in the present PHI/CCAT disk, where
different coding schemes are still present for some of the Sanskrit and Arabic
texts, although for each of the other languages, including the remaining
Semitic languages as a group (Hebrew, Aramaic, Syriac), consistent coding
has been imposed.
(2) With regard to internal format of the various texts and related
materials (columns, paragraphs, poetic structures, language shifts, titles
and headings, textual notations, etc.), a degree of consistency was achieved
in the CCAT materials, but again, the ideal was not always reached, for
a variety of reasons. A conscious attempt was made to include a number of
visually different formats to encourage experimentation and to test the
value of each. Thus the CCAT materials inc1ude examples of each of the following:
simple flat files containing consecutive text in a single language: similar
multilingual consecutive files that switch in and out of the appropriate
languages, simple multilingual vertical parallel files in which each language
or type of analysis is contained in a specific column (e.g. the parallel
Hebrew and Greek text, the morphologically analyzed Greek, 5 Ezra): simple
multilingual horizontal parallel files in which each language is contained
on its own line, under or above the parallel material (e.g. 3 Corinthians):
complex multilingual parallel files in which the horizontal and vertical
formats are mixed (e.g. Origen's Homily on Jeremiah 2.21f, where
Greek and Latin texts are aligned horizontally while the respective English
translations are in parallel vertical columns to the left and right). The
aim was to work with "flat" files (not fragmented data-base management type
files), for easy access and editing, while at the same time building in certain
relational features (parallel blocks, notes, etc.): and also to experiment
with various ways of marking shifts from one language or type of material
to another in such flat files. The general principle was to let the textual
data itself drive the software development, rather than constricting the
possibilities in advance by adopting a particular software approach. [[324]]
(3) At the level of "ID coding," the PHI/CCAT CD-ROM is internally
consistent, and follows the system developed by the Thesaurus Linguae Graecae
(TLG) project, which has encoded almost all Greek literature to about 500
CE and has made it available on CD-ROM as well, In this and all other important
formatting aspects, the PHI/CCAT CD-ROM is compatible with the TLG disk.
The decision to follow TLG was both arbitrary and practical. The problems
presented by this approach seemed relatively insignificant in relation to
the benefits gained -- this did not seem to be the appropriate time to tamper
with what was already in place and to introduce yet another ID scheme! Thus
the TLG codes were successfully adapted to various types of material including
the Index to the Journal of Biblical Literature, a Greek Dictionary
to the New Testament, commentaries to Dante's Divine Comedy, and my great
grandmother's 1880 diary. The resulting compatibility takes a large step
in the direction of simplifying the tasks of software developers in relation
to these texts, since even if one chooses not to retain the TLG ID coding,
a single program can be applied to all these texts to change them into
the desired form. In fact, CCAT provides just such a program (called CONVERT,
by Jay Treat in its most recent release[1988]), which adds explicit location
indicators to each line so that the TLG and/or PHI/CCAT texts can be searched
and browsed in any available wordprocessing (or similar) program, at the
user's discretion.
At present, the TLG and PHI/CCAT CD-ROMs cannot be used effectively
on machines other than the IBYCUS Scholarly Computer, but CCAT is developing
software for offloading, browsing and searching the files on various other
microcomputers. An OFFLOAD program is available for IBM/DOS machines to
permit the user to transfer texts from the CD-ROM to other media, and thus
to manipulate the data with programs of the user's choice. It is hoped that
software to search and browse the CD-ROM directly (as with IBYCUS) will be
available in the next few months,
Future developments of great potential significance include indexing
the texts for more efficient use and creating integrating software to permit
easy, linked access to the data bank, for "hypertextual" studies. If the
"experimental" PHI/CCAT CD-ROM proves to be as useful as we anticipate, and
if the relevant projects continue to support this effort by permitting their
data to be included, CCAT will commit itself to updating and expanding the
collection in a series of future CD-ROMs to facilitate progress in biblical
and related studies.
[Added Note: The author also conducted a number of demonstrations at
the conference, introducing CD-ROM technology and showing its use on the
IBYCUS Scholarly Computer and on the IBM PC (through the OFFLOAD program)
with the TLG and the PHI/CCAT disks.]
//end; edited 11oc2013 by RAK; scanned by Jeremy Fedus//