Demos: Challenges and Lessons
Christopher W. Blackwell
Down the Pipe
What You Get for Free
The PDFs and the Economics of Transformation
The Challenge of Canonical Citation, or, Don’t Mess with the DTD
A Good Use of Standoff Markup
The Future: Inscriptions, Ancient Works, and TextServices.
D?mos: Classical Athenian Democracy (www.stoa.org/projects/demos/home) is a medium-sized digital library of texts aimed at inviting non-specialist readers to engage in the a critical reading of primary and secondary sources for this ancient historical topic. This article will not describe the contents of the site at length—for which see the site itself or the description of the project forthcoming in the New England Classical Journal—but will focus on this site as an object lesson in the benefits, and potential pitfalls, of building an XML-based collection of humanist texts.
D?mos: Classical Athenian Democracy (www.stoa.org/projects/demos/home) is an online collection of articles about Athenian democracy in the 5th and 4th centuries BCE. As of this writing, D?mos consists of 113 articles, totalling approximately 400,000 words of content, by at least 13 different authors. The 26 major articles are available as PDF files, and these alone toal 785 pages of content. So the site is a medium-sized digital library, if we define small as a few documents and large as many thousands of documents.
From the outset, D?mos was intended to bring together secondary materials—from relatively straightforward descriptions of institutions and biographies of historical figures to scholarly argumentative essays—with easy and meaningful access to the primary sources. Easy access was not hard to envision: direct hyperlinks to the primary sources in the original language or in translation whenever those sources were online. To make this access meaningful, however, we wanted to provide contextual information about those sources, so readers would not only be able to read, say, a passage from Demosthenes, but to understand who Demosthenes was, the nature (and limits) of his speeches as evidence, and the details of whatever speech is being cited. We wanted to deliver similar contextual information for place names, personal names, and other elements that might not be familiar to a general readership.
For such a site to work and to continue to work as its contents grew and as technology and available resources changed, the site had to be dynamic and had to adhere to a separation of concerns. The content of each article had to deal with Athenian democracy and its sources, in some form that was both human-readable and machine readable but that did not presuppose any particular operating system or medium of publication. Some other mechanism had to take take care of cross-referencing and linking, and had to reflect automatically, to the extent possible, the current state of resources in the site locally and those resources available online at other sites. And yet another mechanism had to take care of the ultimate display of the linked and cross-referenced data.
In early 2002, just as we had accumulated enough content for D?mos to consider publishing its first edition, a combination of open standards for encoding and manipulating electronic texts, and freely available software or working with those texts reached a critical point of stability, maturity, and simplicity. What had once required programmers of the highest degree of experience was now within the grasp of classicists with mere enthusiasm and technological competence.
The first of the standards in question was the Document Type Definitions for XML files developed by the Text Encoding Initiative (the TEI-DTDs). The second was the Extensible Stylesheet Language Transformations (XSLT) which describe how to transform one XML document into another. The third was Cascading Stylesheets (CSS), which allow a site to impose formatting on HTML documents by means of stylesheets that are external to the HTML. The software in question was the combination of Jakarta Tomcat and Cocoon, two Java-based, open-source applications that, together, allow (among other things) XML documents to be delivered to a web-browser in HTML, having undergone various transformations along the way.
So the content of D?mos consists of TEI-conformant XML documents. These are dissected, merged, and transformed in various ways by a set of custom XSLT stylesheets, with Tomcat/Cocoon doing the work, before being delivered as HTML to a reader’s browser, which formats them according to custom CSS stylesheets for attractive and intuitive reading and interaction.
The work of assembling this system took place between January and June of 2003. In this article, I want to describe some of the technological decisions, innovations (sometimes mine, sometimes borrowed), and mistakes (all mine) that went into publication of D?mos, the debt that this project owes to other scholars working on similar projects, and some challenges and possibilities I foresee for the future of the site.
Down the Pipe[top]
When a user visits an article in D?mos, her browser contacts the site, sending a request for a named article along with three pieces of information: the name of the requested article, the section of the requested article (by number or name), and the preferred way of presenting Greek text.
This request goes to Cocoon, which matches the “article” request to a pipeline. The pipeline is a defined process that begins by reading one or more XML files, transforms them in various ways, and ends by outputting an XML file. In the case of a request for a section of an article, the pipeline works like this:
Cocoon finds a file whose title matches the requested article and reads it.
Cocoon then applies an XSLT stylesheet that prepares any Greek text present for subsequent treatment. By the end of this pipeline, Greek text should appear in the user’s preferred encoding and should be linked to morphological and lexical tools. To do that, first, it is necessary to mark explicitly each individual Greek word, using the tag (for “word”) as defined in the TEI Guidelines (http://www.tei-c.org/P4X/AI.html#AILC). Greek text appears in the XML files in Beta Code, and even if the user wants to see Greek transliterated or in Unicode, that Beta Code information must be preserved, to be passed on later to morphological parsers. So this first transformation includes it as an attribute to the “w”-tag.
So, “ o( a)/nqrwpos sofo/s e)stin. ” in the original file would be transformed to:
Latin words are handled similarly.
Cocoon notes the user’s preferred method of encoding Greek—whether in Latin transliteration, Beta Code, or Unicode. Using that information, the application processes the XML file with the Transcoder, a Java function written by Hugh Cayless. The Transcoder looks for any elements in the XML that have a “lang” attribute of “grc” (for “Greek”). It then assumes that the text of that element is in Beta Code, and transliterates that text into the preferred encoding. Now all the Greek words are encoded according to the user’s preference, with their original beta-code preserved.
Cocoon then applies the “article” XSLT stylesheet. This stylesheet begins the process of transforming the XML into XHTML for display to the user. The first thing it does is note the requested section, find that section in the XML file, and work only on that (so rather than extracting the requested section from the file, the stylesheet in effect discards all the other sections from the file).
This stylesheet actually calls many other stylesheets that perform specific transformations—building the table of contents and adding navigational elements, wrapping text marked as quotations with typographic quotation marks, italicizing emphasized text or text in Latin, and so on.
It also marks text that will eventually be linked. These include phrases explicitly marked as cross-reference, but also include personal names, place names, and (most importantly) citations to sources. For now, these are simply marked with tags, since Cocoon does not yet have the information necessary to make the links specific.
The next stage of the pipeline takes the almost-fully-tranformed document and merges it with four other XML files:
“descriptions_available” - This is an XML list of all the articles currently in D?mos that describe ancient authors or works, or genres of evidence. Because this file is generated dynamically whenever it is requested, it will always be up-to-date.
“perseus_available” - This is an XML list of all ancient works known to be available in the Perseus Digital Library (http://www.perseus.tufts.edu). At the moment, D?mos relies on Perseus for almost all of its linking to the texts and translations of primary sources.
“demos_available” - This is an XML list of all articles currently in D?mos. It is also generated dynamically whenever requested. (That process, by the way, takes place in another Cocoon pipeline; pipelines can call other pipelines, which is part of their strength.)
“lookup” - This file is discussed at length below. Briefly, it allows editorial intervention in the process of automatically generated linking.
With all of the information from the above four files now available, one last XSLT stylesheet fills in the targets for all links. If a citation points to an ancient work that is available as Perseus, the stylesheet links to that work; if a cross-reference points to an article that exists in D?mos, then the stylesheet links to that. If there is a citation to a work for which there is a descriptive article, or whose author is the subject of a descriptive article, the stylesheet will generate a link.
As a part of this, this transformation also cleans up, removing the extra data and lists that have been added along the way. The result is an document in XHTML, a format that follows the rules of XML but is recognizable to web-browsers.
The final state of the pipeline is the “serializer,” which sends the completed XHTML file back to the user’s browser, which will format it according to the Cascading Style Sheets and display it on the screen.
What You Get For Free[top]
With its ability to perform a series of transformations on an XML file, Cocoon delivers a number of benefits that a site can exploit with very little effort. The most noteworthy is indexing.
One of the common weaknesses of traditional print articles is the lack of indexing, which is too laborious and costly for a serial publication. But with XML content, indexing is almost automatic.
Just as the Cocoon pipeline that delivers an article to the reader takes note of which section the reader wants and works only with that section, indexing using XML and XSLT is not so much a matter of generating an index, but of identifying things to be indexed and throwing away anything else.
For example, if we want an index of personal names, the XSLT stylesheet need include only instructions for dealing with elements marked by or tags. That stylesheet will ignore anything in the XML that is not a personal name. From there it is only a matter of sorting and formatting.
D?mos’ articles offer indices of personal names, place names, names of archaeological artifacts, deme names, and tribe names. Each article also includes a double index locororum. “Double” because the index has two sections that display the same data in two different ways. The first section is sorted by citation, and can thus answer the question, “Is Aristot. Ath. Pol. 12.3 cited, and if so, where?” The second section of the index is sorted according to section of the article, thus answering the question, “What are the sources for Payment for Participation in the Council?”
The PDFs and the Economic Transformation[top]
For the 26 major articles currently in D?mos, the site offers PDF versions for download. While the combination of Cocoon and XSLT, with the addition of another standard, XSL-FO (eXtensible Stylesheet Language Formatting Objects), it is possible to transform XML into PDF dynamically. D?mos’ PDFs, however, are not generated dynamically, but through a process that combines XSLT transformation and hand-editing in a page-layout program. The reason for this follows.
Every view of a D?mos article includes a link that will allow the reader to see the whole article as one long page in the browser. This view can be printed, and offers sufficient clarity and organization to make for relatively satisfactory reading. So even without PDFs users can print out D?mos’ articles for offline reading.
When contemplating making PDFs available, then, and after examining the possibility of using XSL-FO for the job, I decided that the economics of generating PDFs dynamically were wrong. While XSL-FO is powerful, it does not yet offer the level of control over how a text appears on the page that an editor can have using a modern page-layout application such as Adobe’s inDesign.
In order to make D?mos’ PDFs represent a significant improvement over simply printing from the browser, I wrote a basic stylesheet that does 90% of the formatting of the XML and delivers it to the browser. From there, I can copy and paste that text into an inDesign template and polish the layout. The resulting PDFs are formatted in the Adobe Minion Pro font, an OpenType font that contains a full complement of polytonic Greek glyphs; the articles include marginal notes, fully kerned text, ligatures, and other typographic features that are expected of a printed text, but still very difficult if not impossible to accomplish purely automatically.
The Challenge of Canonical Citation, or, Don’t Mess with the DTD [top]
One of the most important features of D?mos is the contextual information offered in addition to traditional scholarly citation. A general readership does not necessarily know what “Dem. 18.123” means, and will scarcely be more enlightened having followed a hyperlink to the middle of some paragraph in “On the Crown.” To be critical interpreters of ancient history, readers need to understand the nature of the evidence. So, whenever possible, D?mos supplements a citation like “Dem 18.123” with a marginal note entitled “Read About the Evidence,” including a link to contextual information about Demosthenes and contextual information about “Dem.18”—that it is an example of oratory, a genre with certain problems and potentials, the historical circumstances of the speech, and so forth.
So the publishing system needs to know, for each citation, to what author and work the citation refers. This is not as easy as it sounds. An experienced reader can look at “Dem. 18.123” and discern that “Dem.” is the author and “18” is the work. And it would be possible to extract this information programmatically. But the same algorithm that could dissect “Dem. 18.123” would be misled by “Hdt. 4.123.” Likewise, one that could handle “Aristot. Pol. ####” would be confused by “Dion. Hal. ####.”
I solved this problem by adding to the TEI-DTD, thus creating a special “DemosTEI-DTD.” This new DTD added three attributes to the element: “author,” “work,” and “primary.” The first two are self-explanatory;the third, “primary” could take “true” or “false” as values, thus allowing the cite to distinguish between primary and secondary sources for purposes of indexing.
This works pretty well, in practice. A citation looks like this: Dem. 18.123.The stylesheets can easily extract the author’s and work’s name and cross-reference them to contextual articles, generating marginal notes and links.
But it was a mistake, nevertheless. By using a non-standard DTD, even one that is differs from the TEI’s DTD in such a small way, I made D?mos a much less friendly citizen in the world of digital libraries, since its files will not validate against the standard DTD. It has made managing the site more difficult, since the “DemosTEI” DTD has to be on any server that hosts the project, which adds another burder to the server’s administrators. And it is untidy—the XML files should not be cluttered with redundant information, since the citations themselves serve as unique identifiers.
The answer to this problem—which I will put in place during the summer of 2004—is “standoff markup,” the practice of using separate XML documents to supplement the information in the TEI-conformant ones.
For example, the articles themselves should contain nothing but the canonical citation in their markup. A separate file should contain the information identifying the author and work for citations following a certain pattern:
The XSLT stylesheets would read a citation from the article, find its match in the external file, and collect information about the author and work accordingly. Had I employed a system like this from the outset, D?mos would be better for it today. Fortunately, because the articles are valid according to a DTD, it will be relatively easy to make sweeping, site-wide changes using simple XSLT scripts.
A Good Use of Standoff Markup [top]
D?mos does take advantage of standoff markup, in the form of a file named “lookup.xml” that the site’s stylesheets consult when building links between resources.
When the stylesheets encounter certain kinds of elements—personal names, place names, names of ancient authors or works—they first see if the element is present as a “keyword” element in the file lookup.xml.
If the element is not, then the stylesheets automatically generate an appropriate link. If the element in question is the name of an ancient work, the stylesheets will look for a D?mos article describing that work and, if one is present, link to it; otherwise, the stylesheets will generate no link. If the element is a personal name or place name, the stylesheet will generate a link that looks up that element by name in the Perseus Encyclopedia.
But if the element is present as a keyword in lookup.xml, the stylesheet can read from that file one or more targets for linking, and will generate a link to a menu of appropriate resources.
For example, many technical terms are marked as elements in the articles, including the term “jury.” While there is no single article dedicated to the topic of juries under the Athenian democracy, there are a number of articles or sections of articles in D?mos, and a number of external resources, that would be helpful to the reader who wants to learn more about juries.
So, in lookup.xml, there is a element whose key attribute is “lawcourt”; under this element there is a element with “jury” as its value. Also under that element are other elements that point to resources relevant to lawcourts:
When the stylesheets encounter "jury" as an element in a D?mos article, they use the information in lookup.xml to generate a page that will invite readers to read the following resources:
- The article entitled “An Introduction to the Legal System,” by Victor Bers and Adriaan Lanni, included in D?mos.
- - The article entitled “Punishment in Athenian Law,” by Danielle Allen, included in D?mos.
- - The entry on “dik?” in the Law Glossary, included in D?mos.
- - The entry on “klepsudra” in the Law Glossary, included in D?mos.
- - The Perseus Encyclopedia entry on the Heliaea.
- - The sections on the lawcourt and the Heliaea from the Athenian Agora Excavations’ site.
- - Epsilon 2505, an entry in the Suda Online, describing the Heliaea.
So this instance of standoff markup allows editorial intervention in the automated process of linking. It also allows one-to-many linking, when many resources shed light on a single entity.
It also allows many-to-many linking. Each element in lookup.xml can have many elements. So “jury,” “juror,” “lawcourt,” “Heliaea,” and “court” will all point to this set of resources having to do with courts, juries, and justice under the Athenian Democracy.
And, of course, this system preserves the “Separation of Concerns.” An individual article need contain information only about its own topic. Its markup does not need to be “aware” of any other articles in D?mos or elsewhere. Each article can contain only information about content. The Cocoon pipeline is the place where the “concern” of Function belongs, and so the pipeline handles the linking by bringing in extra XML file.
And the standoff markup, being itself a well-formed XML file, could be used in other ways, in other projects, since it is also independent of the stylesheets that manipulate it under the current implementation.
The Future: Inscriptions, Ancient Works, and TextServices. [top]
In the near future, we hope to add to the growing D?mos library two important collections. First, Michael Arnush has been editing a collection of inscriptions, with notes and translations, fundamental to our understanding of Athenian democracy. We will mark these up according to the TEI-DTD, following the conventions of the Epidoc initative, which is working to standardize how the TEI tagset should be applied to epigraphic documents. These texts will require a new set of stylesheets to bring them to a wide audience in a useful way—hiding some of the more esoteric conventions of epigraphy from casual readers while making them available to scholars, providing a convenient and intuitive reading environment, and associating those sources with other articles already in place.
We also hope to build a library of ancient literary texts and translations, with rich internal markup, that could be fully integrated into the site. Some of these will be texts not currently available online, such as certain works of Plutarch; others will be new editions of texts now available elsewhere.
This project will stand on the TextServices protocol, described elsewhere in this issue of Classics@, and so the D?mos Ancient Texts collection will be fully integrated into a distributed digital library of humanist texts, with easy and fully automated sharing of data and metadata.
In fact, all of the workings of D?mos are going to be rebuilt in accordance with the TextServices protocol. This will make the site itself more orderly, expandable, and compliant with standards, and will allow any other site to incorporate the texts that make up D?mos into its own collection, for its own purposes.
Some Conclusions [top]
Our discipline is still very new to electronic publishing, its potentials, and it pitfalls, and D?mos: Classical Athenian Democracy has certainly not been excempt from problems. All contributers to the project have been, to a greater or lesser extent, learning on the job.
But the virtue of working in an electronic medium is that dead-end streets are not one-way; as long as the content remains intact and untouched, there is no real danger of “breaking” anything. And the virtue of working with open standards such as XML, XSLT, and CSS, and working with open-source software such as Tomcat and Cocoon, lies in the opportunities to profit from the experience and talents of others.
Hugh Cayless’ contribution of the Transcoder has been a sine qua non for D?mos and dozens of other ongoing projects. Bruce Robertson’s deep knowledge of Java-based server software and his patience in sharing it with the less-enlightened has been invaluable. Their work, and the work and insights of Anne Mahoney, Ross Scaife, Neel Smith, Michael Jones and others were what allowed this project to come together, and will certainly contribute to any future improvements.
And just as the potential of electronic publication has engendered an active and collegial scholarly community, the technologies and techniques at our disposal here at the beginning of the millenium promise to bring our texts—both our ancient texts and those texts we create as we try to understand them—into fruitful new relationships.