Monday, 13 January 2014

From Bordeaux to Jerusalem and Back Again: Introducing Recogito (Pt. 1)

Welcome back to another update from our Infrastructure Workpackage 2 - "Annotation Toolkit", affectionately known as IWP2. In our previous IWP2 post, we talked a little bit about the basics of annotating place references in early geospatial documents. We also presented a first sample dataset based on the Vicarello Beakers. What we did not talk about yet, however, is how we actually annotate our documents in the first place.

The general plan behind the Pelagios annotation workflow is this:

  1. We use Named Entity Recognition (NER) to identify a first batch of place names automatically in our source texts. This step is also called "geo-parsing", and tells us which toponyms there are in our text, and where in the text they occur. We implemented NER using the open source Stanford NLP Toolkit, and presently restrict this step to English translations of our documents. In a later project phase, we intend to cross-match the data gathered from the English translations to the original language versions, which is likely more feasible within the lifetime of the project, than trying to attempt latin-language NER.
  2. NER gives us the toponyms. What it does not tell us anything about, however, is which places they represent, or where these places are located. Next, we therefore look up the toponyms in our gazetteer, and determine the most plausible match. This step is called "geo-resolution", and - like NER - is also fully automated.
  3. Naturally, neither geo-parsing nor geo-resolution work perfectly. Therefore, we need to manually verify the results of our automatic processes, correct erroneous NER or geo-resolution matches, and fill gaps where NER or geo-resolution have failed to produce a result at all. And this is where our new Tool Recogito comes in.

Fig. 1: data from the Bordeaux Itinerary in Recogito (interactive version in Latin and English).

The Itinerarium Burdigalense

The first document we've tackled entirely in Recogito is the Itinerarium Burdigalense: the Itinerarium Burdigalense (or Bordeaux Itinerary) is a travel document that records a Pilgrim route between the cities of Bordeaux and Jerusalem. It is considered the oldest Christian pilgrimage document, dated in 333 AD - which is just 20 years after the Edict of Milan from 313, when the Emperor Constantine granted the religious liberty to Christians (and other religions). Formally, this document is very similar in some aspects to the Itinerarium Provinciarum Antonini Augusti: both of them are compiled as a list of places with the distances between them. Additionally, the Itinerarium Burdigalense also marked all the places as mutatio, mansio or civitas (change, halt or city) in a similar way as the Peutinger Table. The format of the document changes when the travel arrives to Judea, where it offers detailed descriptions of important places to Christian Pilgrims. So we can consider it an itinerarium in the tradition of Greek and Roman writing, except for its Christian emphasis. (We've compiled a detailed bibliography for the Itinerarium Burdigalense here. The text of an English translation can be found, for example, on this Website.)

Annotating the Bordeaux Itinerary with Recogito

Recogito presents the results of our automatic processing steps in two flavours: in a text-based user interface, which is primarily designed to inspect and correct what the geoparser has done; and in a map-based interface which is used to work with the results from the geo-resolution step. A screenshot of the latter is shown in Fig.2, and we will explore it in more detail below. The former interface (which benefits from a little pre-knowledge of the map-based interface) we will disucss in a separate blog post.

Geo-Resolution Verification & Correction

The map-based interface separates the screen into a table listing the toponyms, and a map that shows how they are mapped to places. The primary work area for us in this interface is the table: here, we can scroll through all the toponyms and quickly check the gazetteer IDs they were mapped to. As a matter of policy, we want to explicity keep track of which toponyms have been looked at by someone, and which haven't. To that end, each entry in the table can be 'signed off' as either a verified gazetteer match, an unknown place, or a false NER detection. (In addition, there is also a generic 'ignore' flag, for toponyms that may be correctly identified in a technical sense, but which we don't want to appear in the map for whatever reason.)

Fig. 2: Recogito map-based geo-resolution correction interface.

Double-clicking an entry in the table opens a window with details for the toponym (Fig.3): the window shows the previous automatic gazetteer match (if any), the latest manual correction, and a text snippet showing the toponym in context. A lists of suggestions for other potential gazetteer matches, along with a small search widget allows us to quickly re-assign the gazetteer match in case it is incorrect. The change history for each toponym is recorded so we know who has change what (and when), or whether there are places that may see substantially more edits than others in the long run. Furthermore, manual changes are recorded separately from the initial automatic results. This way we will be able to benchmark the performance of NER and automatic geo-resolution later on. Detailed figures for the Bordeaux Itinerary are not yet out - but our initial figures suggest that NER has caught about 2/3 of all toponyms; and that approx. 80% of NER results were correct detections. The automatic geo-resolution correctly resolved between 30%-40% of the toponyms.

Fig. 3: toponym details.

While Recogito is still under heavy construction, Pau is already deeply buried in the next document - which we will present in one of our next blogposts, together with an overview of the text-based interface.

No comments:

Post a Comment