Classics@18: Galli and Nieddu

In Codice Ratio: Using VREs in the Study of the Medieval Vatican Registers

Francesca Galli and Elena Nieddu

1. In Codice Ratio

In Codice Ratio is a research project that aims to develop novel methods of supporting content analysis and knowledge discovery from large collections of historical documents. The goal is to provide humanities scholars with novel tools to conduct data-driven studies based on large historical sources. We are currently working on the collection of the medieval Vatican registers, dwelling in particular on documents and letters drawn up under the pontificate of Honorius III (1216–1227). [1]
As explained in the following, our project is designed to involve non-experts in tasks as challenging as the recognition of characters and symbols used in handwritten Latin texts whose graphic form (medieval script, abbreviations, etc.) is quite unfamiliar to modern readers. In this regard, namely in employing and promoting crowdsourcing in humanities, In Codice Ratio may be compared to papyrological projects such as Ancient Lives or Scribes of the Cairo Geniza. [2] At this stage we are engaging high school students who receive specific training — albeit limited in terms of time and effort required — and not the general public, yet those interested in participating in transcription activities will likely be involved at various levels in the future, similarly to what was done in the aforementioned projects. Vice versa, the positive feedback and results we obtained by collaborating with schools could lead others to follow the same path.

2. The Vatican Registers of Pope Honorius III as a case study

Honorius’s registers (Reg. Vat. 9–13) are well suited for the proposed research for various reasons. First, we have digital images (see Figure 1 on the poster), yet many pages have never been transcribed, despite their relevance with respect to medieval history and the history of the Church. Just to mention one well-known (and already published) example, on Reg. Vat. 12, f. 155r we can read the copy of the first lines of the bull Solet annuere, by means of which the Regula bullata of the friars Minor was solemnly confirmed in November 1223. Furthermore, the form of handwriting used in the Papal chancery during the thirteenth century was indeed widespread at the time and is today rather legible even for those who do not have strong skills in palaeography.
However, there are some drawbacks and limitations. These texts include a large number of complex abbreviations — e.g.: uni(versita)te(m) v(estram) ro(gamus), mo(nemus) et hor(tamur) at(ten)te p(er) ap(ostoli)ca v(obis) s(cripta) man(dan)tes q(ua)t(enus) — and many names of places, people, institutions, etc., that are not easily handled by an automatic system. Further, all of these documents are protected by copyright, and several restrictions apply regarding relevant reproduction and dissemination.

3. Challenges in automatic transcription

Since, for now, most of these historical documents are images, a necessary first step in developing any form of data-driven content analysis is to perform a transcription of the manuscripts. The problem is challenging because traditional Optical Character Recognition (OCR) systems cannot be employed here, as irregularities of writing, ligatures, and abbreviations make standard character segmentation ineffective. To overcome these challenges, up-to-date Handwritten Text Recognition (HTR) systems aim at transcribing words or lines of text entirely (e.g. Transkribus) [3] . They are generally trained on line-level transcriptions produced by human annotators; when dealing with medieval manuscripts, this often means involving scholars with a strong knowledge of ancient languages and palaeography.

4. Scalable dataset collection via non-expert crowdsourcing

Conversely, in order to build our own dataset, we have conducted an extensive experiment involving a non-expert pool of more than 700 high school students, in order to test the feasibility of our crowdsourcing approach. Our dataset is composed of 32 symbols: minuscule characters of the Latin alphabet and a set of special symbols, representing the most frequent abbreviations. Through a simple and effective user interface, we provide non-expert users with positive and negative examples and ask them to label symbols that match sample characters. The resulting task is more akin to pattern matching (or solving captchas) than to actual text transcription, as illustrated in Figure 4.
The described VRE allows us to easily collect a significant number of samples for each of the symbols and train a recognition model for characters (rather than lines) based on a convolutional neural network (CNN). Moreover, it provides an opportunity to train students who are not yet attending university both in the fields of information technology and humanities.
However, non-expert crowdsourcing of symbols can result in annotation sparseness: only about 50% of the words we submitted to the crowd had every symbol covered, whereas the remaining ones showed evident gaps in the annotation, thus preventing the generation of a fully crowdsourced line-by-line transcription.

5. Automatic transcription pipeline

Our approach to automatic transcription can be summarized in four main steps (Figure 3):It is worth noting that, because we rely on symbol-level recognition, we are able to solve abbreviations only when their transliteration is unique, meaning that our approach generates diplomatic transcriptions for the most part.

6. Results

We tested our system on 39 high-resolution original manuscript pages from the same Vatican Register (Reg. Vat. 12) used for crowdsourcing, but spanning different writers, transcribed entirely by volunteer palaeographers. We evaluated our performance using the Character Error Rate (CER) metric on the diplomatic transcription of each page.
To assess our pipeline’s effectiveness in low-resource settings, we trained it on 4 subsets of the total crowdsourced annotations, composed of 2, 5, 10, and 20 pages, achieving a CER of 25%, 23%, 21%, and 19% respectively. We also compared ourselves to the Tesseract OCR engine v4.0: [4] using the same training data, we outperform it by up to 10% CER.
The system’s coverage and accuracy could be improved by crowdsourcing new symbols from the Registers, such as abbreviations that have not yet been included, additional capital letters, and syllables that are particularly frequent and/or challenging for the segmentation, thus moving the error chance from the segmentation to the classification accuracy, which is much higher.

7. Ongoing and future work

The pipeline accuracy is already high enough for the transcriptions to be used in a range of information extraction tasks, such as a full-text search and named entity recognition (Figure 5). To this end, we are developing further VREs, such as a web application service able to integrate and display human-made and machine-made transcriptions and annotations over the original texts, using the IIIF format. All transcriptions and annotations will be fully searchable through keywords, patterns, and fuzzy matching.
Furthermore, the transcription system itself could be improved in several ways, by extending the type of writings it is able to recognize, or introducing new symbols, or using the existing transcriptions to train a line-level, segmentation-free recognition model, once again without the need for expert involvement, or with minimal involvement, in a distant supervision framework. As mentioned in the first paragraph, we are considering making use of crowdsourcing platforms as a means of involving a wider audience in the various phases of the transcription tasks. A number of research papers on the topic of crowdsourcing in Digital Humanities highlight the benefits of this approach in terms of results achieved as well as from the viewpoint of encouraging shared responsibility towards cultural heritage. [5] Nevertheless, such a step implies a feasibility study at several levels (e.g. properties and design of the task, assessment of the resources with regard to time and costs, etc.) and a careful evaluation of all the “uncertainties” that crowdsourcing tools necessarily entail. [6]

Bibliography

Ammirati, S., D. Firmani, M. Maiorino, P. Merialdo, and E. Nieddu. 2019. “In Codice Ratio: Machine Transcription of Medieval Manuscripts.” In Digital Libraries: Supporting Open Science, ed. P. Manghi, L. Candela, and G. Silvello, 185–192. Cham.
Boyle, L. E. 1972. A Survey of the Vatican Archives and Its Medieval Holdings. Toronto.
Law, E., K. Z. Gajos, A. Wiggins, M. L. Gray, and A. Williams. 2017. “Crowdsourcing as a Tool for Research: Implications of Uncertainty.” In CSCW ’17: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, ed. C. P. Lee, 1544–1561. New York.
Sánchez, J. A., V. Bosch, V. Romero, K. Depuydt, and J. De Does. 2014. “Handwritten Text Recognition for Historical Documents in the Transcriptorium Project.” In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 111–117. New York.
Smith, R. 2016. “Tesseract Blends Old and New OCR Technology.” https://github.com/tesseract-ocr/docs/tree/master/das_tutorial2016.
Terras, M. 2016. “Crowdsourcing in the Digital Humanities.” In A New Companion to Digital Humanities, ed. S. Schreibman, R. Siemens, and J. Unsworth, 420–439. Chichester.
Vannini, L. 2019. “Trends in Digital Humanities: Insights from Digital Resources for the Study of Papyri.” In Proceedings of the Digital Humanities Congress 2018, ed. L. Pitcher and M. Pidd. Sheffield. https://www.dhi.ac.uk/openbook/chapter/dhc2018-vannini.
Scribes of the Cairo Geniza. https://www.scribesofthecairogeniza.org/.

Footnotes

[ back ] 1. For a general introduction, see at least Boyle 1972.
[ back ] 3. Sánchez et al. 2014. See also https://readcoop.eu/transkribus/.
[ back ] 4. Smith 2016. See also: https://tesseract-ocr.github.io/docs/.
[ back ] 5. See, for example, Terras 2016.
[ back ] 6. See Law et al. 2017.