«Bible et Informatique: méthodes, outils, résultats», Jérusalem, 9-13 Juin 1988    [[319]]


Reported by Robert Kraft (University of Pennsylvania) on behalf of his Graduate Seminar Research Team (T.Bergren, N.Hubler, A.Humm, R.Kraemer, D.Louder, K.Onesti, T.Smith, J.Treat, B.Wright)

Numerous unidentified fragments of ancient writings in various languages are preserved in the storerooms and collections of our Libraries and Museums, to mention only the most obvious places. Some of these fragments come from hitherto unknown or lost writings, while others are pieces of known literature but defy exact identification for various reasons: too few letters are legible, the fragmentary text varies significantly from the more fully known version, adequate indices of known works are not available through which to identify the fragment, etc. Usually such fragments remain unpublished and relatively unknown except to curators or to other experts whose opinions might be solicited. Occasionally the scholarly world, or even a wider world of potentially interested readers, is provided a glimpse of these frustrating and mysterious treasures -- usually through the publication of photographs.

The advent of computer technology is providing new tools and new possibilities for working with such materials. The trained computer can serve as a super index for searching texts that have been put into computer accessible form. The computer can be told to ignore such potentially misleading matters as upper or lower case letters, diacritics, spacing between words, or even anticipated spelling variants/errors. Fragments containing only a few letters on two or more consecutive lines, or on front and
back of a page, may provide the tireless computer with enough information to identify where such combinations exist at the expected intervals from each other in extant texts. The formats and line structures of fragmentary pages or sections can be plotted on computer and manipulated to determine the most likely reconstruction. Ancient calligraphy can be imitated and reproduced to show approximately how the reconstructed portion may have looked. And, of course, interim worksheets as well as final results can be printed up for more traditional examination and distribution.

The following descriptions and discussions are intended both to illustrate the processes by which computer assisted research can be applied to the study of hitherto unidentified fragments of ancient literature, and to provide significant results for the selected area of scholarly research on ancient Greek biblical texts. We have chosen as the primary focus Chester Beatty Papyrus 5, a Codex of Genesis (Rahlfs 962), which in many ways provided an ideal basis from which to operate. We realize that this is an
exceptional case -- not many ancient manuscripts will be as accessible or as extensive -- but we also believe that the [[320]] techniques described here are, with appropriate adjustments, widely applicable to the study of other such fragmentary materials.

Getting Started: computer Assisted Reconstruction of MS 962

The Chester Beatty 5 Codex fragments provided an obvious candidate for developing and testing computer approaches to fragment identification. This was clear to Kraft already in the early 1980s, when he made a few probes into the subject -- see Discover: The Newsmagazine of Science (Feb 1984), p.81. Photographs and transcriptions of the major portions of the preserved materials had been published by F. C. Kenyon in 1936, and A. Pietersma corrected and updated Kenyon's work in 1977 in connection with his study of the textcritical significance of the manuscript. Furthermore, Pietersma published photographic reproductions not only of the additional fragments he was able to identify in the Chester Beatty archives, but also of all the remaining unidentified fragments of which he was aware that seemed to come from the same manuscript.

The possibilities were exciting. On the basis of the identified portions, the approximate page and line formats of the codex could be determined and replicated to some extent. Since the edition of Greek Jewish Scriptures by A. Rahlfs was already available on computer from the Thesaurus Linguae Graecae Project [TLG], it was a small step to reformat Genesis into the expected line and page lengths, without word division or diacritic markings. Then this rough replica of MS 962 could be searched and examined for probable locations of small fragments, without needing to worry about where word divisions might fall within the preserved letters. Letters from adjacent lines would appear on the screen in their approximate original locations. Front and back of a reconstructed page could be examined easily for appropriate matches. Fragments from the beginning of lines, or from the ends, could often be found in their expected locations on the computer replica pages. Problems were anticipated -- textual variations between the fragments and the Rahlfs text, abbreviated words, spaces left between letters in the original text of 962 -- but they were often predictable (orthographical itacisms, abbreviations of numbers and nomina sacra).

Overview of the Experiment and its Results

In the spring term of 1988, Kraft's advanced graduate seminar undertook a systematic computer assisted study of the 962 materials, with the primary goal of identifying as many of the fragments as possible. Appropriate files and programs were set up on three different computer systems: the IBYCUS mini computer, the Apple Macintosh, and the IBM/DOS type computer. Various participants took responsibility for various special aspects of the study -- paleography, special markings in the MS, computer programs, etc. Once the files were set up and appropriate programs were in place, a number of identifications were made [[321]] with relative ease. For the remaining unidentified fragments, it was assumed that special problems (especially variant texts) might exist, and thus more careful examination was made, based partly on the known textual tendencies in the identified portions of the manuscript. Ultimately, we were able to identify with relative confidence all but 12 fragments of the 43 unidentified scraps of 962 published in photographic plates by Pietersma. We suspect that four of those 12 are not from 962 after all, leaving 8 still to be located. Such a success rate cannot, of course, be expected with all such projects -- scroll fragments, for example, are more difficult to locate since there is no possibility of using the writing on the back side of the page for further clues. But the value of the computer as a tool in this type of work was demonstrated spectacularly in this experiment, and we look forward to applying these methods to other fragments in the near future.


Bible et Informatique: methodes, outils, resultats)), Jerusalem, 9-13 Juin 1988 [[322]]


by Robert Kraft (University of Pennsylvania)

Compact disk read-only memory (CD-ROM) technology offers many advantages for computer assisted study of textual material, as is amply illustrated in the demonstrations of the IBYCUS Scholarly computer and its CD-ROM access capabilities. Massive amounts of textual data (550  megabytes) can be stored and distributed easily and relatively inexpensively on one small (five inch) disk. The contents of such a laser disk cannot be modified on that disk, thus providing a secure check-point for further correction and development -- a fact that is especially comforting to librarians and archivists. Assuming that the disk has internally consistent coding and formatting, the same software can be used to access all its materials.

There are also disadvantages. Errors cannot be corrected until a new disk is mastered (or in some instances, through the accessing software), and mastering a CD-ROM is not a simple or inexpensive process (typically US$3000). Hardware and software to access the laser disk, and subscription or purchase of the disk itself, also require significant funding (typically, about US$1000). Adequate accessing software may not be readily available for any particular machine.

Since early in 1986, and with support of the David and Lucile Packard Foundation, the Center for Computer Analysis of Texts (CCAT) at the University of Pennsylvania has been gathering and processing biblical and other text data for inclusion on an experimental CD-ROM. The first fruits of this labor are now available, including samples and contributions from a number of projects, in a disk produced in cooperation with the Packard Humanities Institute (PHI). The CCAT portion of the PHI/CCAT Demonstration CD-ROM focuses on biblical materials and includes full texts, in ASCII coding, of the following Bibles: Hebrew, Old Greek and NT, Latin Vulgate, Authorized (King James) English with Apocrypha, and Revised Standard English with Apocrypha. It also has sample portions of the Syriac, Sahidic Coptic, and Armenian, as well as the Aramaic Pentateuch Targums Neofiti and Ps-Jonathan, and the pre-Qumran Targum of Job. Other related tools include the parallel aligned Hebrew and Greek Jewish scriptures (part l), the morphologically analyzed Old Greek and NT, samples of the Old Greek textual variant files, a short dictionary of NT Greek, and a lengthy Latin word list. The PHI portion of the disk includes more than 40 classical Latin texts and some Greek inscriptions. CCAT has also appended a number of miscellaneous texts in various languages (Arabic, French, Danish, Italian, English, Sanskrit, Tibetan) and formats to encourage wide experimentation. [[323]]

Since many of the texts were contributed by other projects or individuals, and were originally produced in a variety of codings, the task of imposing a consistent scheme for the location markers (called "ID codes," to identify where things are located in any body of texts -- e.g. title, page and line, chapter and verse, etc.), internal formats, and foreign character representation in these texts was formidable. Thus three separable sets of problems needed to be faced in attempting to facilitate use of all these texts on a single CD-ROM:

(l) Ideally -- to facilitate searching and manipulating the texts as well as generating appropriate fonts for screen and printer --­ each text in a given language (e.g. Sanskrit) should contain the same transcriptional coding, even though different systems may have been used by different contributors. This ideal was only partly realized in the present PHI/CCAT disk, where different coding schemes are still present for some of the Sanskrit and Arabic texts, although for each of the other languages, including the remaining Semitic languages as a group (Hebrew, Aramaic, Syriac), consistent coding has been imposed.

(2) With regard to internal format of the various texts and related materials (columns, paragraphs, poetic structures, language shifts, titles and headings, textual notations, etc.), a degree of consistency was achieved in the CCAT materials, but again, the ideal was not always reached, for a variety of reasons. A conscious attempt was made to include a number of visually different formats to encourage experimentation and to test the value of each. Thus the CCAT materials inc1ude examples of each of the following: simple flat files containing consecutive text in a single language: similar multilingual consecutive files that switch in and out of the appropriate languages, simple multilingual vertical parallel files in which each language or type of analysis is contained in a specific column (e.g. the parallel Hebrew and Greek text, the morphologically analyzed Greek, 5 Ezra): simple multilingual horizontal parallel files in which each language is contained on its own line, under or above the parallel material (e.g. 3 Corinthians): complex multilingual parallel files in which the horizontal and vertical formats are mixed (e.g. Origen's Homily on Jeremiah 2.21f, where Greek and Latin texts are aligned horizontally while the respective English translations are in parallel vertical columns to the left and right). The aim was to work with "flat" files (not fragmented data-base management type files), for easy access and editing, while at the same time building in certain relational features (parallel blocks, notes, etc.): and also to experiment with various ways of marking shifts from one language or type of material to another in such flat files. The general principle was to let the textual data itself drive the software development, rather than constricting the possibilities in advance by adopting a particular software approach. [[324]]

(3) At the level of "ID coding," the PHI/CCAT CD-ROM is internally consistent, and follows the system developed by the Thesaurus Linguae Graecae (TLG) project, which has encoded almost all Greek literature to about 500 CE and has made it available on CD-ROM as well, In this and all other important formatting aspects, the PHI/CCAT CD-ROM is compatible with the TLG disk. The decision to follow TLG was both arbitrary and practical. The problems presented by this approach seemed relatively insignificant in relation to the benefits gained -- this did not seem to be the appropriate time to tamper with what was already in place and to introduce yet another ID scheme! Thus the TLG codes were successfully adapted to various types of material including the Index to the Journal of Biblical Literature, a Greek Dictionary to the New Testament, commentaries to Dante's Divine Comedy, and my great grandmother's 1880 diary. The resulting compatibility takes a large step in the direction of simplifying the tasks of software developers in relation to these texts, since even if one chooses not to retain the TLG ID coding, a single program can be applied to all these texts to change them into the desired form. In fact, CCAT provides just such a program (called CONVERT, by Jay Treat in its most recent release[1988]), which adds explicit location indicators to each line so that the TLG and/or PHI/CCAT texts can be searched and browsed in any available wordprocessing (or similar) program, at the user's discretion.

At present, the TLG and PHI/CCAT CD-ROMs cannot be used effectively on machines other than the IBYCUS Scholarly Computer, but CCAT is developing software for offloading, browsing and searching the files on various other microcomputers. An OFFLOAD program is available for IBM/DOS machines to permit the user to transfer texts from the CD-ROM to other media, and thus to manipulate the data with programs of the user's choice. It is hoped that software to search and browse the CD-ROM directly (as with IBYCUS) will be available in the next few months,

Future developments of great potential significance include indexing the texts for more efficient use and creating integrating software to permit easy, linked access to the data bank, for "hypertextual" studies. If the "experimental" PHI/CCAT CD-ROM proves to be as useful as we anticipate, and if the relevant projects continue to support this effort by permitting their data to be included, CCAT will commit itself to updating and expanding the collection in a series of future CD-ROMs to facilitate progress in biblical and related studies.

[Added Note: The author also conducted a number of demonstrations at the conference, introducing CD-ROM technology and showing its use on the IBYCUS Scholarly Computer and on the IBM PC (through the OFFLOAD program) with the TLG and the PHI/CCAT disks.]

//end; edited 11oc2013 by RAK; scanned by Jeremy Fedus//