Hugh A. Cayless
Directory Services for Classical Informatics
This article describes the need and outlines a proposal for an infrastructure that will manage registries of uniquely identified entities, allowing scholarly projects to preserve, share, and link information with as little human intervention as possible.
Introduction: Background and Issues [top]
Modern humanities scholarship is at a crossroads. There is increasing emphasis on traditional research and publication as the path to tenure and advancement, but at the same time there are growing pressures, both external, in the form of what has been called the “crisis in scholarly publishing” and internal, from the sheer volume of secondary scholarship that is produced every year. It is becoming harder and harder to publish monographs for economic reasons, while at the same time more and more articles are published, as the requirements for gaining tenure become more and more stringent. As this goes on, libraries are being forced to cut back on journal subscriptions, again for economic reasons.The amount of information is increasing, while obtaining it and producing it becomes more difficult. The commonly accepted model of humanities scholarship, with its emphasis on individual research and publication is becoming untenable as the sole path to an academic career.Yet, with few exceptions, this is the only path available. For more detail on this issue, see the remarks of John Unsworth at the 2003 Annual Meeting of the American Council of Learned Societies
As a final ingredient to the soup, advances in information technology have made possible a vast array of new scholarly projects. But these typically do not fit into the standard model: frequently they are collaborative in nature, do not have a fixed publication date, do not have a fixed form, and often rely on the skilfull management and arrangement of information. Many of these are “foundational” projects, such as electronic corpora and registries, which expose source material in new ways. These projects are exciting, useful, and are here to stay. They are, however, frequently not recognized as valid scholarly output but such institutions as tenure committees, which means that scholars must either engage in this type of research outside their regular duties, or must wait until they are tenured and somewhat immune from criticism for devoting energy to activities not considered by their colleagues to be research. Neither solution is particularly satisfactory.
Despite the braking effect on digital scholarship exerted by these issues, there have been and continue to be many interesting developments in the field.The necessity for managing scholarly information is greater than it was a century ago, yet it is now harder to publish tools for managing this growing complexity, such as annotated bibliographies,catalogs of useful data, or indeed anything that will not sell well as a monograph and is too large to be an article.More and more often,digital technology presents the only or the best avenue for producing and publishing research.Perhaps in part as a response to these pressures, “Humanities Computing” as a separate discipline, and one intimately concerned with digital publication,is a growing field.Funding agencies frequently are interested in digital publication projects, even if their home departments are not.
Additional issues emerge from an examination of the state of digital publishing technologies. Understanding how properly to use new media is something that takes both time and experience. The World Wide Web has been in existence now for a little over a decade, and it is only now that we are beginning to understand what it means for scholarly communication and publication. The vast majority of the scholarly projects hitherto undertaken have involved mere transferrals of the monograph and article to the new medium, without any change in their essence. Monolithic online collections of these materials are produced, which are analogous to the print books or series in which they would have been published in the past. Experience with the production of these materials has led to increasing sophistication, however, and the gradual realization that the internet is not just a new kind of library into which new books and articles can be placed, but is also more basically a medium for transferring packets of any sort of information. As such, it does not place any firm constraints on the format, size, or structure of the information it conveys. Nodes in this network may provide services as well as information, such as transforming data from one format to another, extracting information from data sent to them, or providing interfaces to distrubuted collections of information. In other words, the traditional physical conventions for delivering scholarly communication need not apply. At the same time, the medium is inherently less stable than the physical library: servers break down, or are renamed; websites cease to be maintained. The phenomenon of stale links is a real problem confronting scholars who wish to cite digital materials.1
Notes to this Section:
1. See, for example, Carlson (2004) on the impermanence of digital scholarly resources and Dellaville et al. (2003) on the problem of inactive links in online scholarly articles. [back]
The discipline of Information Science has recognized these problems for some time and has evolved a number of schemes to solve them. Persisent URLs involve one or more institutions that assume the responsibilty of maintaining active links to resources and provide links to them that are guaranteed not to change when the location of the recource moves.2 They are, in other words, resolution services. Digital Object Identifiers provide a way of assigning unique identifying strings to digital resources, analogous to the ISBN number for printed books.3 So-called Handle services are then able to provide resoltuion services for resources with DOI’s that are registered with them.4 This is the same idea as the PURL, but more robust, since it allows for identifiers to be tightly bound to digital objects, and perhaps for the automated detection of resolvable resources. These types of service do not provide for the resolution of portions of a resource, since they are agnostic about the nature, or format of the resource.
Notes to this Section
4. The DOI is an implementation of the Corporation for National Research Initiatives (CNRI) Handle System. A “handle” in computer programming argot is a means of retrieving and manipulating an in-memory object, such as a window. [back]
In the discipline of Classics there are many resources, such as corpora, lexica, concordances and the like that are standard references in the field and which could benefit from a digital environment, but which have not yet been digitized, and may not be for some time. Dealing with this kind of mixed environment, in which some resources are physical and some digital presents its own challenges. Moreover, with these kinds of resource, it is typically not the entire object that a potential user wants, but some information contained within the object. In a typical web environment, the workflow involved for a scholar writing an article for which a citation from, e.g., a corpus would be to look up the address of the resource and then navigate to its interface and query it for the invormation desired. One of the goals of this project is to enable this process to take place entirely within the scholar’s working environment.
One effect of the digital realm is that issues which are not immediately apparent in the physical world are suddenly magnified. A link in printed scholarship cannot be immediately dereferenced. Work must be done in order for a work referenced in a citation to be retrieved. If the citation is wrong, frequently just a little more work will allow the link to be made. Misspellings, incorrect page references and issue numbers are common, but usually are easy problems to solve.If your own library does not have the work, it can be retrieved through interlibrary loan.If a hyperlink is misspelled, however, or its target removed, the problem is immediately obvious to a reader, and perhaps impossible to resolve. In printed works, tasks like the disambiguation of people with identical names are accomplished via hints in the text or are left as an exercise for the reader. In the digital realm, disambiguation (if it is to be handled by machine processing) must be explicit, because a computer program is likely to have a hard time inferring which referent is meant by a word or phrase from its context. A printed work cannot link to scholarship published after a given article or monograph, unless that infromation is already in the publishing pipeline (forthcoming). But links do not need to be explicitly encoded in a digital document.They may be generated and regenerated, so a later work may be “referenced” by an earlier work. Ongoing maintenance work must be done in the digital environment, however, to enable such things as consistent links, disambiguation, and the generation of links between resources, and this is the kind of time investment that the current academic atmosphere prevents.
I have briefly sketched a constellation of issues, some having to do with institutional attitudes and some with the developing technologies available to academics. I wish to propose a mechanism for addressing some of these problems. To summarize: economic pressures and the increased size and complexity of the information available are pushing scholars in the direction of digital publication,while attitudes towards the nature of scholarship in the Academy prevent most of its members from giving this avenue of scholarly communication their full attention. At the same time, problems and limitations inherent in the nature of the simplest electronic publication mechanisms (e.g. HTML and PDF) will inhibit everyone’s ability to manage the information that is placed online.What is needed is an infrastructure that makes it easy for scholars publishing their work digitally to include the “hooks” that will make managing their information feasible. Some criteria for this system are:
- It must support digital scholarship and publication as it happens.
- It must be able to uniquely identify features, preferably in terms of some standard published reference work.
- It must be able to resolve links between resources based on those identifiers.
- It must support the preservation of digital publications in ways that are not dependant on the scholars themselves.
Some solutions meeting these criteria for Classics are being implemented by participants in the 2003 CHS summer workshop on technology. The remainder of this article will describe one ingredient of the proposed infrastructure and its relationships to other components
A Proposal for Directory Services [top]
This proposal makes a number of assumptions that must be stated at the outset.
- Features in an electronic document may be identified in terms of some standard reference. By “feature,” I am referring to things like people, places, and texts, which have been collected and indexed by one or more registries (such as gazetteers, prosopographies, or corpora). It is common practice for such registries to assign identifying numbers or strings to features in order to facilitate discovering and referring to them. The central idea in this proposal is that if identifiers are applied to the features in a document, then explicit links to associated information are no longer necessary. Documents that also refer to those features can be linked automatically, on the basis of the identifying number or string. Since the identifiers are derived from a standard reference work, they will not become stale, in the way that an explicit link might, and since they do not point at a specific resource, the appearance of additional resources can be accounted for with ease.
- Instances of documents, registries, etc. should not be hosted at one location only. It is a generally accepted fact that a document which is copied more and has copies at more locations has a greater chance of long term survival. The history of the transmission of ancient documents makes this perfectly clear. What is true in the physical domain is equally true in the digital: more copies equals a greater chance of survival. From the point of view of the internet, of course, identical documents hosted in different places have different URLs, and are therefore not identical. But if standard identifiers are applied to documents, as discussed in point #1, then identity can be established.
- Simple electronic publication mechanisms, such as HTML and PDF, are likely not robust enough to support the kinds of tasks envisioned here. Scholarly communication will be better served by standards like the Text Encoding Initiative and XML, which provide mechanisms for the fine-grained tagging of semantic and structural features, as well as metadata that are not available in presentation-oriented formats.
Given these assumptions, the proposed infrastructure should support the discovery and retrieval of resources by identifying number or string, independent of their actual location or (preferably) locations, over standard communication protocols like HTTP. Given the current state of technology, these are not difficult problems to solve. It is a simple matter for what I am calling a directory service to resolve an identifier passed to it to an actual URL. If multiple copies of the resource are known, it can simply select an appropriate one from the list. All the users of such a service would need to know is how to construct an identifier that the service will recognize, and where the directory service is located. Under the hood, there will be a “stack” of protocols, of which the directory service is the top layer. 5
The Classical Informatics Protocol Stack
The figure illustrates the three layers of the stack, with example service nodes included and an example request and response path.
The directory service is responsible for routing requests to the right electronic registry, which in turn collects the identifiers for features it knows about. A registry service, given an identifier, can resolve it to a resource, which may be a document, or a piece of information, such as a bibliographic reference or the coordinates for a place.
Notes to this Section
5. The system as described here could fit into larger, and similar initiatives, such as the CNRI Handle System. [back]
The identifiers for features would consist of a prefix identifying the registry from which they were drawn and a number or string assigned by that registry. The prefix must be detachable from the rest of the identifier, so it should be separated by a character to be agreed upon, such as a colon. The directory service would enable the discovery of registry prefixes by type (gazetteer, corpus, etc.) and by searching on bibliographical information such as title and author. The registry services in turn would enable the discovery of actual feature identifiers via whatever searching or browsing mechanisms they support. As new registry services become available, whether they are originals or mirrors of existing services, they can be registered with the directory service.
An interesting feature of this design is that it would be able to support registries that are not on line, or are only partially online. If a scholar wishes to reference an inscription in CIL, for example, the directory service would be able to provide her with the prefix to use, and bibliographic information about the corpus itself. This would enable her to identify the inscription in a canonical fashion, so that when the inscription or information about it becomes available online, the service will be able to retrieve it, without the document’s author needing to do any further work. The maintenance work necessary to maintain linkages between resources essentially disappears, with a small front-end investment in properly identifying features that could or should be linked.
The directory service will be accessible via a web-based API, in which commands and requests for information are issued in the form of HTTP GETs. Responses will be in the form of XML documents containing the requested information or the status of the command’s execution.The proposed methods for the DirectoryServer protocol may be grouped into three areas, discovery, resolution, and administration.Some possible methods for the API are outlined below:
- Registry Discovery
- Discover registries via keyword search.
- Discover registries via category search.
- Discover registries via available protocols.
- Discover registry based on some combination of the above.
- List all registries.
- Registry Resolution
- Resolve registry via its identifier.
- Server Administration
- Authenticate administrator.
- Add new registry and acquire new identifier.
- Edit existing registry entry.
- Merge two DirectoryServers.
This architecture differs from systems like the CNRI HandleServer or the OpenURL framework in its implementation of the protocol stack.6 The directory service is not simply a system for resource resolution, rather it is a system for routing requests to the appropriate web services. It provides a means for not just resource discovery given an identifying handle, but also distributed service usage. Moreover, it accomplishes this task in such a way that linkages created by text editors can be useful and readable even in the event of service interruptions or disappearance, and will provide a framework for the creation and usage of mirrored services to avoid such eventualities.
Notes to this Section:
The implementation of the directory layer of the protocol will not be a particularly difficult task. In itself, it is just a resolution service. It will work in tandem with registry and text servers currently under development. The more complex problem is a political one: finding institutions willing to host the directories. What will be required is a committment from one or more institutions whose forseeable futures are secure to host and maintain a DirectoryServer instance. In the environment described above, registries and services may come and go, but the directory must remain stable. It is to be hoped that a forward-thinking institution (such as the CHS) will assume that responsibility.
The potential benefits from the adoption of a system like the one described here to the community of Classical scholars (and beyond) are great. Resources developed within this framework will be able to scale to handle linkages to new resources and services as they become available without having to be re-edited. This will mean that scholars can handle the tasks of electronic publication without altering their workflow in ways that detract from their research.
Sam Sun, Larry Lannom, Brian Boesch, Handle System Overview - RFC 3650 ( http://www.ietf.org/rfc/rfc3650.txt , November 2003)
John M. Unsworth, “The Crisis in Scholarly Publishing in the Humanities”. ARL Bimonthly Reportissue 228 ( http://www.arl.org/newsltr/228/crisis.html , June 2003)
Robert P. Dellavalle, Eric J. Hester, Lauren F. Heilig, Amanda L. Drake, Jeff W. Kuntzmann, Marla Graber, Lisa M. Schilling, “Going, Going, Gone: Lost Internet References”. Science Magazine volume 302, number 5646, issue 31, pages 787-788 (31 October 2003)
The DOI Handbookedition 3.3.0 ( http://www.doi.org/hb.html , November 2003)
Thanks must go to Gregory Nagy for inviting me to participate in the CHS Summer Workshop on Technology, to Ross Scaife, Neel Smith, and Chris Blackwell, for pushing ahead on this initiative, and especially to my wife, Jennifer, for putting up with me while I spend my precious free time working on such things.
To refer to this please cite it in this way:Hugh A. Cayless, “Directory Services for Classical Informatics,” C. Blackwell, R. Scaife, edd., Classics@ volume 2: C. Dué & M. Ebbott, executive editors, The Center for Hellenic Studies of Harvard University, edition of April 21, 2004.