Neel Smith


TextServer: Toward a Protocol for Describing Libraries

Article Contents
  Introduction: definitions and initial considerations  
  A TextServer architecture
  Contents Page


Introduction: definitions and initial considerations       [top]

The presentations at the Technology and Classics meeting in June of 2003 illustrated many benefits of semantic markup for a variety of scholarly projects. It is striking how many of these projects can be based on document type definitions conforming to the guidelines of the Text Encoding Initiative (TEI). 1  In this paper, I want to consider a question that follows directly from these  observations: how can projects with similar material distributed across the internet interoperate?


Notes to this section

1.     [back]

We could rephrase this question less technically as, “How do we take  advantage of a growing number of electronic publications to create a distributed electronic library?” but we need to define our terms carefully. The  phrases “electronic publication” and “electronic library”  are widely used in common speech to mean little more than “electronic dissemination of information” and “a collection of digital data,” respectively. Because these senses of the terms subordinate or miss altogether  the essential characteristics of scholarly publications and libraries, I want  to stress that when I use these terms, I mean specifically “scholarly  electronic publication” and “scholarly electronic library.”  


In the scholarly community, publication plays a special role that results  in three defining characteristics. Publication is one of several means of  disseminating information and ideas, but is distinguished from other scholarly  forms of discussion (such as seminars, conference presentations, etc.) because  it becomes part of a permanent record that can be referred to and cited in  future discussion. This presumably leads to a thoroughly reviewed piece of  work that is more carefully thought through and more finished in presentation  than less formal exchanges of ideas, but in any case requires the first characteristic  we must demand of a publication (whether digital or not): it must be citable. That means that any electronic publication must have a fixed and explicitly  identified version (or edition), and must have an explicitly identified citation scheme.


Along with explicitly identified editions and citation schemes, scholarly  publications possess a second vital characteristic: they can be replicated  identically, so that a citation is valid for any copy of the publication.  So long as we are referring to the same edition, there is no need for different  scholars to consult the same physical copy of a publication. While permanent  citability sets publications apart from less formal communication, the replicability  of a publication distinguishes it from archival scholarly resources. The redundancy  resulting from replication provides a vital check on the integrity of the  scholarly record, both pragmatically (if your library's copy is lost, you can go to another library or request a copy via interlibrary loan), and ethically (false claims about the contents of a publication would easily be exposed by consulting any other copy of the publication).


The publication's role as a permanent, citable record also leads to a third characteristic: the publication is irrevocably alienated from the author.  In a very real sense, a publication, by its publication, becomes a possession  of the scholarly community. An author may discover that a work contains egregious  errors, but the author's only recourse is to publish a new, corrected version:  the original erroneous edition cannot be recalled.


By “electronic publication,” therefore, I mean “a publication  in digital form with an explicitly identified edition, and explicitly identified  citation scheme, that can be irrevocably and perfectly replicated;”  the crucial interaction among publications centers on their citability. We must make it possible to identify publications and their citation schemes,  and to find passages by canonical reference.


This normally happens in the library, by which I mean simply a collection  of these publications. (I refer to collections of unpublished scholarly material  as “archives.”) 2 In a universal library containing every known publication, every citation  of another publication would be resolvable: a reader could move directly from  a reference to the material referred to. For scholarly research, the dilemma  of scholar and librarian alike is that a universal library is an impossibility.We might hope to approximate the dream of a comprehensive library for a specific  domain, however, and scholars of the classical world are in a particularly  inviting situation: vital resources such as the extant corpus of ancient Greek  are finite in scope, and essentially static. 3 We might reasonably imagine, for example, that nearly all of extant ancient   Greek literature exists in published editions—somewhere. If no library of print publications contains every work, what would be required to create a virtual library of electronic publications assembling a substantial corpus  of ancient Greek literature?


Notes to this section


2. Unpublished archival material  is also an important part of our scholarly infrastructure, but I will not  discuss it further here. For some observations about how digital dissemination  does and does not change the relation of archival and published material in  the context of archaeological fieldwork, see my chapter, “The Hacimusalar  Information Technology Initiative” [add full ref: publication forthcoming  in B.A.R.] [back]


3. The continuing discussion of  the new Posidippus epigrams illustrates how exciting—and exceptional—an  addition to the literary corpus is today. [back]  


Anyone familiar with the last quarter century of classical scholarship will  realize that classicists should be particularly interested in this problem because of the similar questions already addressed by previous work. In particular,  the rich resources and experience our discipline has gained from projects  like the Thesaurus Linguae Graecae 4 and the Perseus Digital Library 5 provide the context for understanding our situation today. At the risk of   oversimplifying two enormous, extended undertakings, I would suggest that the TLG is fundamentally about what kinds of scholarship are possible when a comprehensive collection of Greek texts is assembled in one place, and Perseus is fundamentally about how our understanding of our sources   for a culture like ancient Greece can change when many different kinds of   source material are structured to interact with each other. At a technical level, the TLG, founded in the 1970s, is a response to the potential   of mass storage and rapid retrieval in digital media. The Perseus project, founded in the 1980s, is a response to richer media and higher-level  data structures. The TLG project recognizes how information technology  can help overcome limits of scale in our work; the Perseus project  recognizes how information technology can help overcome limits of disciplinary  boundaries and canons of material.


Notes to this section

4.     [back]

5.     [back]

More than a decade after the creation of the World Wide Web, and seven years since the first specification of XML, the presentations at this conference contrast with the centralized collections of the TLG and Perseus in two important respects. First, it is quite obvious that, at a technical level, the production of digital scholarly materials that a decade ago could  only be found at projects with a high level of technical expertise and support  is now trickling down to the individual department and even to the individual  scholar. Second, at a scholarly level, the focus on semantic markup at the CHS's conference in June, 2003, shows that many scholars will not be satisfied with predefined, predigested versions of scholarly materials, but will, justifiably,  insist on access to the same kind of fully structured representation of a  publication that they themselves are creating and using.


With these considerations in mind, let us return to the initial question: what kind of infrastructure do we need to allow interaction among projects  distributed across the internet? We need to support scholarly publication  with its requirements for explicit citation of replicable fixed editions,  but instead of depending on a collection of publications stored in one place, we can take advantage of the activities of individuals as well as large projects, wherever they may be on the internet. The scholar's distributed digital library  will function as much like the distributed domain name services that lookup  numeric addresses for computer names, or like the indexing of distributed  Web pages in Google, as it does like Perseus  or the TLG. But in contrast to all of these similes, it will work  directly with the semantic structures that projects like those described at  this conference are already using.


A TextServer architecture   [top]  

Identifying and retrieving information across the internet is a generic problem—fortunately. Just as classicists can profit from generic technologies surrounding markup languages like XML, we can profit from generic technologies  to extend our reach to marked-up information distributed across the internet. Our desire to take advantage of information in TEI-conformant XML from scholars  at a number of institutions is exactly the kind of problem that many businesses  and government organizations, as well as academic institutions, are energetically  working on. Conceptually, it corresponds closely to what Tim Berners-Lee,  the creator of the World Wide Web, and the World Wide Web Consortium call  “the semantic Web.” In contrast to the vast quantities of relatively  unstructured information on the WWW in HTML, Berners-Lee emphasizes the kinds of interactions that are possible when highly structured information is accessible  o computer programs using the minimal communications protocols of the WWW. Simple programs can easily be written to retrieve information in XML. By separating  out the questions of how to find and retrieve XML information conforming to  a known structure, we make the same information available to an unlimited  number of potential applications. The architecture we need to make our distributed  library possible can therefore be an example of a Web service—that  is, a program providing structured information over the WWW to another program.  In our case, we need to provide the services necessary to support our definition  of a scholarly electronic publication. I will next describe a set of conventions  for providing these services. I will refer to the conventions themselves as “the TextServer conventions” and will call a program implementing  these conventions a TextServer.


My goal in defining formal conventions for a TextServer is to meet the absolute minimum requirements of citing, retrieving and replicating an electronic publication. These requirements are not unique: digital libraries must generically provide  some way to identify publications, to discover their citation schemes, and  to retrieve pieces of on-line publications using those citation schemes. It  is not surprising that projects like the Open Archives Initiative 6  have chosen exactly the kind of architecture described here: they define protocols  allowing programs to exchange structured information.


Notes to this section

6.     [back]

While it might therefore seem that classicists could directly use the protocols of a project like the Open Archives Initiative, I believe that on closer examination  current efforts to define protocols for digital libraries fall short of our  needs because their notion of citation focuses on what I would term documents, rather than texts in the sense that classicists often use that word.  Focusing on documents is a legitimate design choice, but we as classicists need to be aware of its implications


When we refer to classical texts, we most often use a canonical reference  system describing a logical hierarchical organization of the text independent  of any specific physical version. This remarkable practice is so familiar  that we often fail to recognize its consequences. Notably, it allows us to  discuss a notional text at many different levels. We can equally   easily refer to “Ptolemy, Geography book 1, section 2”  in contexts that mean “this passage of Ptolemy—whatever edition  or translation you happen to be using” or “this passage of Ptolemy in the translation by Berggren and Jones” or “this passage of Ptolemy in Nobbe's Greek edition” or even “this passage of Ptolemy  in the copy on my shelf with the ink smudge in the margin.”


In our distributed electronic library, we want the fundamental scholarly  activity of citing a work to take account of this hierarchical organization  of our notional text. A citation should be able to point either to  specific versions (e.g., to contrast Nobbe and Müller's readings for  this passage), or to refer only to a notional text of Ptolemy that  one reader might prefer to lookup in English translation, another in German and a third in a Greek edition. We expect a citation form like “Ptol. Geo. 1.2” to be valid at any of these levels.


Contrast the organization this practice implies with the organization of  the most thorough inventory of Greek literary texts our discipline has ever  produced: the Canon of the TLG project. In the TLG Canon, works are grouped by “authors”—Ptolemy, for example—but  within that conventional category individual works are defined by the bibliographic  source used by the TLG. Ptolemy's Geography, for example,  is not one but two works: books 1-3 are the Geography as  edited by Müller, books 4-8 the Geography as edited by Nobbe.  Müller died after completing four books of his edition, which is an improvement  in every way on Nobbe's edition, but the second (physical) volume of Nobbe's  edition begins with book 4, so, apparently for this reason alone, the TLG  Canon includes a second entry Geography books 4-8.


One requirement of our TextServer then will be an inventory of works that  allows citation of our notional text to operate within a hierarchy of works  containing specific versions. To stay with our example, we want a representation  of Ptolemy's Geography that could contain information about the separate   editions of Müller and Nobbe. This inverts the document-centered organization of other digital library projects that organize works first by physical instances   or editions, then index those to associate more than one edition with a single work or “author.” In this way, we will be able more directly and   simply to support the habit of citation by canonical reference to a notional   text.

To refer to this please cite it in this way:

Neel Smith, “TextServer: Toward a Protocol for Describing Libraries,”   C. Blackwell, R. Scaife, edd., Classics@ volume 2: C. Dué  & M. Ebbott, executive editors, The Center for Hellenic Studies of Harvard  University, edition of April 3, 2004.