TextServer: Toward a Protocol for Describing Libraries
Introduction: definitions and initial considerations [top]
The presentations at the Technology and Classics meeting in June of 2003 illustrated many benefits of semantic markup for a variety of scholarly projects. It is striking how many of these projects can be based on document type definitions conforming to the guidelines of the Text Encoding Initiative (TEI). 1 In this paper, I want to consider a question that follows directly from these observations: how can projects with similar material distributed across the internet interoperate?
Notes to this section
We could rephrase this question less technically as, “How do we take advantage of a growing number of electronic publications to create a distributed electronic library?” but we need to define our terms carefully. The phrases “electronic publication” and “electronic library” are widely used in common speech to mean little more than “electronic dissemination of information” and “a collection of digital data,” respectively. Because these senses of the terms subordinate or miss altogether the essential characteristics of scholarly publications and libraries, I want to stress that when I use these terms, I mean specifically “scholarly electronic publication” and “scholarly electronic library.”
In the scholarly community, publication plays a special role that results in three defining characteristics. Publication is one of several means of disseminating information and ideas, but is distinguished from other scholarly forms of discussion (such as seminars, conference presentations, etc.) because it becomes part of a permanent record that can be referred to and cited in future discussion. This presumably leads to a thoroughly reviewed piece of work that is more carefully thought through and more finished in presentation than less formal exchanges of ideas, but in any case requires the first characteristic we must demand of a publication (whether digital or not): it must be citable. That means that any electronic publication must have a fixed and explicitly identified version (or edition), and must have an explicitly identified citation scheme.
Along with explicitly identified editions and citation schemes, scholarly publications possess a second vital characteristic: they can be replicated identically, so that a citation is valid for any copy of the publication. So long as we are referring to the same edition, there is no need for different scholars to consult the same physical copy of a publication. While permanent citability sets publications apart from less formal communication, the replicability of a publication distinguishes it from archival scholarly resources. The redundancy resulting from replication provides a vital check on the integrity of the scholarly record, both pragmatically (if your library's copy is lost, you can go to another library or request a copy via interlibrary loan), and ethically (false claims about the contents of a publication would easily be exposed by consulting any other copy of the publication).
The publication's role as a permanent, citable record also leads to a third characteristic: the publication is irrevocably alienated from the author. In a very real sense, a publication, by its publication, becomes a possession of the scholarly community. An author may discover that a work contains egregious errors, but the author's only recourse is to publish a new, corrected version: the original erroneous edition cannot be recalled.
By “electronic publication,” therefore, I mean “a publication in digital form with an explicitly identified edition, and explicitly identified citation scheme, that can be irrevocably and perfectly replicated;” the crucial interaction among publications centers on their citability. We must make it possible to identify publications and their citation schemes, and to find passages by canonical reference.
This normally happens in the library, by which I mean simply a collection of these publications. (I refer to collections of unpublished scholarly material as “archives.”) 2 In a universal library containing every known publication, every citation of another publication would be resolvable: a reader could move directly from a reference to the material referred to. For scholarly research, the dilemma of scholar and librarian alike is that a universal library is an impossibility.We might hope to approximate the dream of a comprehensive library for a specific domain, however, and scholars of the classical world are in a particularly inviting situation: vital resources such as the extant corpus of ancient Greek are finite in scope, and essentially static. 3 We might reasonably imagine, for example, that nearly all of extant ancient Greek literature exists in published editions—somewhere. If no library of print publications contains every work, what would be required to create a virtual library of electronic publications assembling a substantial corpus of ancient Greek literature?
Notes to this section
2. Unpublished archival material is also an important part of our scholarly infrastructure, but I will not discuss it further here. For some observations about how digital dissemination does and does not change the relation of archival and published material in the context of archaeological fieldwork, see my chapter, “The Hacimusalar Information Technology Initiative” [add full ref: publication forthcoming in B.A.R.] [back]
3. The continuing discussion of the new Posidippus epigrams illustrates how exciting—and exceptional—an addition to the literary corpus is today. [back]
Anyone familiar with the last quarter century of classical scholarship will realize that classicists should be particularly interested in this problem because of the similar questions already addressed by previous work. In particular, the rich resources and experience our discipline has gained from projects like the Thesaurus Linguae Graecae 4 and the Perseus Digital Library 5 provide the context for understanding our situation today. At the risk of oversimplifying two enormous, extended undertakings, I would suggest that the TLG is fundamentally about what kinds of scholarship are possible when a comprehensive collection of Greek texts is assembled in one place, and Perseus is fundamentally about how our understanding of our sources for a culture like ancient Greece can change when many different kinds of source material are structured to interact with each other. At a technical level, the TLG, founded in the 1970s, is a response to the potential of mass storage and rapid retrieval in digital media. The Perseus project, founded in the 1980s, is a response to richer media and higher-level data structures. The TLG project recognizes how information technology can help overcome limits of scale in our work; the Perseus project recognizes how information technology can help overcome limits of disciplinary boundaries and canons of material.
Notes to this section
More than a decade after the creation of the World Wide Web, and seven years since the first specification of XML, the presentations at this conference contrast with the centralized collections of the TLG and Perseus in two important respects. First, it is quite obvious that, at a technical level, the production of digital scholarly materials that a decade ago could only be found at projects with a high level of technical expertise and support is now trickling down to the individual department and even to the individual scholar. Second, at a scholarly level, the focus on semantic markup at the CHS's conference in June, 2003, shows that many scholars will not be satisfied with predefined, predigested versions of scholarly materials, but will, justifiably, insist on access to the same kind of fully structured representation of a publication that they themselves are creating and using.
With these considerations in mind, let us return to the initial question: what kind of infrastructure do we need to allow interaction among projects distributed across the internet? We need to support scholarly publication with its requirements for explicit citation of replicable fixed editions, but instead of depending on a collection of publications stored in one place, we can take advantage of the activities of individuals as well as large projects, wherever they may be on the internet. The scholar's distributed digital library will function as much like the distributed domain name services that lookup numeric addresses for computer names, or like the indexing of distributed Web pages in Google, as it does like Perseus or the TLG. But in contrast to all of these similes, it will work directly with the semantic structures that projects like those described at this conference are already using.
A TextServer architecture [top]
Identifying and retrieving information across the internet is a generic problem—fortunately. Just as classicists can profit from generic technologies surrounding markup languages like XML, we can profit from generic technologies to extend our reach to marked-up information distributed across the internet. Our desire to take advantage of information in TEI-conformant XML from scholars at a number of institutions is exactly the kind of problem that many businesses and government organizations, as well as academic institutions, are energetically working on. Conceptually, it corresponds closely to what Tim Berners-Lee, the creator of the World Wide Web, and the World Wide Web Consortium call “the semantic Web.” In contrast to the vast quantities of relatively unstructured information on the WWW in HTML, Berners-Lee emphasizes the kinds of interactions that are possible when highly structured information is accessible o computer programs using the minimal communications protocols of the WWW. Simple programs can easily be written to retrieve information in XML. By separating out the questions of how to find and retrieve XML information conforming to a known structure, we make the same information available to an unlimited number of potential applications. The architecture we need to make our distributed library possible can therefore be an example of a Web service—that is, a program providing structured information over the WWW to another program. In our case, we need to provide the services necessary to support our definition of a scholarly electronic publication. I will next describe a set of conventions for providing these services. I will refer to the conventions themselves as “the TextServer conventions” and will call a program implementing these conventions a TextServer.
My goal in defining formal conventions for a TextServer is to meet the absolute minimum requirements of citing, retrieving and replicating an electronic publication. These requirements are not unique: digital libraries must generically provide some way to identify publications, to discover their citation schemes, and to retrieve pieces of on-line publications using those citation schemes. It is not surprising that projects like the Open Archives Initiative 6 have chosen exactly the kind of architecture described here: they define protocols allowing programs to exchange structured information.
Notes to this section
While it might therefore seem that classicists could directly use the protocols of a project like the Open Archives Initiative, I believe that on closer examination current efforts to define protocols for digital libraries fall short of our needs because their notion of citation focuses on what I would term documents, rather than texts in the sense that classicists often use that word. Focusing on documents is a legitimate design choice, but we as classicists need to be aware of its implications
When we refer to classical texts, we most often use a canonical reference system describing a logical hierarchical organization of the text independent of any specific physical version. This remarkable practice is so familiar that we often fail to recognize its consequences. Notably, it allows us to discuss a notional text at many different levels. We can equally easily refer to “Ptolemy, Geography book 1, section 2” in contexts that mean “this passage of Ptolemy—whatever edition or translation you happen to be using” or “this passage of Ptolemy in the translation by Berggren and Jones” or “this passage of Ptolemy in Nobbe's Greek edition” or even “this passage of Ptolemy in the copy on my shelf with the ink smudge in the margin.”
In our distributed electronic library, we want the fundamental scholarly activity of citing a work to take account of this hierarchical organization of our notional text. A citation should be able to point either to specific versions (e.g., to contrast Nobbe and Müller's readings for this passage), or to refer only to a notional text of Ptolemy that one reader might prefer to lookup in English translation, another in German and a third in a Greek edition. We expect a citation form like “Ptol. Geo. 1.2” to be valid at any of these levels.
Contrast the organization this practice implies with the organization of the most thorough inventory of Greek literary texts our discipline has ever produced: the Canon of the TLG project. In the TLG Canon, works are grouped by “authors”—Ptolemy, for example—but within that conventional category individual works are defined by the bibliographic source used by the TLG. Ptolemy's Geography, for example, is not one but two works: books 1-3 are the Geography as edited by Müller, books 4-8 the Geography as edited by Nobbe. Müller died after completing four books of his edition, which is an improvement in every way on Nobbe's edition, but the second (physical) volume of Nobbe's edition begins with book 4, so, apparently for this reason alone, the TLG Canon includes a second entry Geography books 4-8.
One requirement of our TextServer then will be an inventory of works that allows citation of our notional text to operate within a hierarchy of works containing specific versions. To stay with our example, we want a representation of Ptolemy's Geography that could contain information about the separate editions of Müller and Nobbe. This inverts the document-centered organization of other digital library projects that organize works first by physical instances or editions, then index those to associate more than one edition with a single work or “author.” In this way, we will be able more directly and simply to support the habit of citation by canonical reference to a notional text.
To refer to this please cite it in this way:
Neel Smith, “TextServer: Toward a Protocol for Describing Libraries,” C. Blackwell, R. Scaife, edd., Classics@ volume 2: C. Dué & M. Ebbott, executive editors, The Center for Hellenic Studies of Harvard University, edition of April 3, 2004.