Friday 15 April 2011

Tagging Places on Old Maps: The DME Scenario

Following the productive Pelagios workshop at KCL, we (DME) have been busy tweaking our infrastructure to interoperate according to the "Pelagios Principles".

For DME, the situation is slightly different than for Pelagios' other data partners: First, our existing data model is already based on OAC - which does probably make our transition somewhat easier. Second, instead of owning an extensive existing data set, our "asset" is an end-user toolkit for manual annotation and semantic tagging of multimedia content (affectionately referred to as YUMA - the YUMA Universal Multimedia Annotator).

Tagging Maps with Pleiades References

In other words: instead of mapping existing place references in our data to URIs in the Pleiades namespace, our primary task was to modify YUMA in such a way that users can, from now on, tag media items with references to Pleiades places!

This has turned out to be quite straightforward: The simplest way by which users can add semantic tags to their annotations in YUMA is through an auto-completion textbox: As the user starts typing, available tags will appear in a drop-down box underneath the typed text. (The screenshot above shows this for the 'Corsica' example mentioned in an earlier post by Mathieu.)

Making Pleiades place references available as tags in YUMA involved two steps:
  • First, we needed to incorporate a list of Pleiades place names (and their URIs) into our Tag Server. The Tag Server is the component which hosts tag vocabularies and provides the tag suggestions for the auto-completion hints. We got a Pleiades CSV data dump, from which we built an index using Apache Lucene, and set up the index as an additional tag source.
  • Second, we needed to provide a 'Pelagios' view on our internal data to the outside world. The reason for this is that our existing RDF representation (although based on OAC) is close, but not identical to the Pelagios model. We therefore set up an alternative RDF output channel, mapped to a .pelagios suffix. Appending this suffix to the annotation URI will return Pelagios-compatible annotations, in either RDF/XML, N3 or Turtle notation based on content negotiation.
The 'Corsica' annotation, for example, is available at this URI:

It will resolve to the following RDF (abbreviated for better readability, key parts red):

@prefix oac: <> .

   a  oac:Target .

   a  oac:Annotation ;
      "rsimon" ;
      "Corsica" ;
      "2011-04-11 11:12:34.295" ;
      "2011-04-11 11:12:55.747" ;
      <> ;
      <> .
   a  oac:ConstrainedTarget ;
      <> ;
      <> .

Lessons Learned

Two concluding remarks: first, I would eventually like to see our 'YUMA OAC representation' unified with the Pelagios one. This is currently hindered a bit by the way the OAC model is organized (i.e. with only one body per annotation allowed, and no official specification on structured bodies). Right now, the difference is therefore that our oac:Body is a separate node that includes all elements that comprise the user's annotation (title, text body, multiple tags, etc.), whereas in Pelagios we consider the Pleiades reference to be the oac:Body directly. (For comparison, the YUMA OAC representation of the example above is here.)

Second, one of the benefits that we gain for YUMA from linking annotations to Linked Data resources is that we can get extra context information by dereferencing them. For example: if we link to DBpedia we can get things such as labels in different languages or alternative spellings for person names. This, in turn, provides us with a very lightweight mechanism for including (at least limited) multilingual and synonym search functionality in our system. (There's a video screencast demonstrating this here.) With our Pleiades integration, we don't quite get this yet: What's still missing is a mapping between Pleiades URIs and matching resources from e.g. Geonames or DBpedia where possible. (However, Pleiades+ might be just the way to address this I guess!)

Wednesday 13 April 2011

To ASCII or not to ASCII

So, you've got your locations all beautifully geoparsed, but how is anyone going to search for Liévin, let alone Łódź or Uherské Hradiště?

In the old days of the web, this wasn't a problem. They couldn't. At least, not without a huge amount of learning and expense on everyone's part. In the English-speaking world, text on the web meant ASCII. This restricted you, basically, to the characters on a standard US keyboard: 128 characters, a third of which were things like 'tab' and 'line feed'. If you wanted your readers to read 'é', you sent them a something called an entity (&eacute; in this case). If you wanted your readers to read 'Ł', you might even have to send them a little image of the letter.

This was stupid, obviously. And, thank god, we now have browsers, databases, and programming languages that speak a different text 'language': UTF-8 (Unicode). This gives us millions and millions of characters, and means that if you want people to read 'š', or even 'אַ', you can send them plain text and not have to worry too much about operating systems, browser support, or reproducing 2000 .gif files whenever you change your site design (been there, done that, please don't judge if you haven't supported Internet Explorer 4).

But this only solves half of the problem: reading, but not writing. If you have to type 'Ł', and you're using a standard US/UK keyboard, we haven't come very far. The existing methods are either clumsy, slow, or require a lot of memorisation (or all three). It's all very well expecting admins to take a bit of extra time to enter the correct name - although a little bit of AJAX and GeoNames means that they don't have to. But, as this rather forthright blog points out, is it realistic to expect this from our users?

To give you an example of how this applies to the Pleiades data, take the case of Mérida, in Spain. Mérida (Roman Emerita Augusta, Pleiades 256155, GeoNames 2513917) is quite an important location for the APGRD (Archive of Performances of Greek and Roman Drama), as its wonderfully preserved Roman theatre is the venue for many new versions of classical plays - last year's Lysistrata, for example (warning, very slightly NSFW!).

Because the APGRD has no time frame - we're interested in every single performance, from antiquity onwards - many of our users won't know that Mérida used to be Emerita Augusta, so I can't rely on them searching for that. However, a search on the Pleiades dataset for Merida (no accent) returns no results because for the computer, 'Merida' and 'Mérida' are two different values.

I imagine those of a non-English-speaking background will be tempted to say "so what, we've been typing non-ASCII characters all our lives - you do know our keyboards look different, don't you?". Likewise, those who've spent the effort learning the betacode or combining unicode commands to get the best out of TLG and Perseus may be less forgiving.

However, I must admit that my own sympathies lean somewhat toward the author of that post. My own experience, and what I've learned from geoparsing our database's locations (entered over 10 years by a variety of people) is that:
  1. people don't necessarily know that a location has non-ASCII characters, or what the right ones are (pop quiz, no googling: how do you (properly) spell Liege?). They'll get no results and not realise that they've entered the wrong search string.
  2. a substantial proportion of people, even academics, have no idea how to enter non-ASCII characters - and even those that can are generally only good at doing it in languages/ character sets with which they are familiar.
  3. of these, how many are going to cut-and-paste text from elsewhere (which does not always give you the characters you want)?
So, I'm sure of the problem, but not necessarily of the solution. It seems to me that the best way to do this might be to convert all search strings to ASCII before matching them against ASCII-ised locations (at least, for user search). But does this lead to a horrible loss of precision? Are there better ways?

And I have it easy - all my locations are at least in the Latin alphabet!

Tuesday 5 April 2011

A new comer to Pelagios: the Ptolemy Machine

In our extant literature from the Greco-Roman world, Claudius Ptolemy's Geography is a unique treatise on mathematical cartography. Its relevance for Pelagios is in part that it includes the largest scientific data set to survive from antiquity: more than 6000 named points are located with longitude-latitude coordinates in books 2-7, and more than 350 key cities are again located in an alternate astronomical coordinate system in book 8. The text of book 1 provides a systematic introduction, including two methods for drawing world maps. Books 7 and 8 include instructions for other kinds of visualizations.

With its combination of methods and data, Ptolemy's text is a kind of software for GIS. Ptolemy includes specific suggestions about how to realize his plan in hardware: he goes beyond the mathematical theory to discuss machinery and physical procedures for constructing maps. Scribes manually copying the text of the Geography applied these procedures directly to create maps in their manuscripts, but print editions have obscured the nature of Ptolemy's project as a dynamic (if not rapid) GIS.

To understand Ptolemy's work today, his software should be implemented in software. I have recently completed an initial digital edition of the Geography that is currently accessible via the Canonical Text Services protocol from From this edition, I can automatically extract all the named points with their lon-lat coordinates and several other attributes that Ptolemy assigns them in his text (e.g., type of feature, political unit, larger physical aggregate it belongs to...), and can serialize this information in a variety of formats for use in a GIS, including KML for direct use in Google Earth or Google Maps. (Above, Ptolemy's points in Google Earth colored by Roman province or foreign "satrapy.")

I am currently applying a variety of automated tests to the source text. In addition to traditional Greek "spell checking," I am analyzing Ptolemy's data for the spatial consistency of his attributes. (When all but one of the points assigned to a given Roman province fall uniquely within the convex hull of that group, but one point falls within the cluster of a different province, the most probable cause is a textual error. I'll blog a couple of telling examples in a later post.)

During the period of the Pelagios project, I'm planning to complement the text-view of the Geography at with a GIS view of Ptolemy's text exposed through one or more network services. (I am still experimenting with a variety of options; I'm a newcomer to Google's Fusion Tables, but am impressed with their easy access to people who want to mash up geographic features, in the spirit of Pelagios.) I'm looking foward to seeing what others might do with this material.

Neel Smith
College of the Holy Cross
Worcester, MA