Friday, 20 July 2012

How does CLAROS make its annotations for Pelagios?

The CLAROS data aggregation setup ( has been through several iterations since we first got involved with Pelagios in 2010, and work on OAC annotations has proceeded fitfully. However, we are now generating triples in the proper form as part of our regular data build, which makes it an appropriate time to report on what we have achieved.

CLAROS currently aggregates data from 12 partners, most of whose material relates to the ancient world. The input is RDF XML against the CIDOC CRM, largely describing objects:

arachneArachne185119 objects
ashmolJameel Collection, Ashmolean2316 objects
beazleyBeazley Archive130960 objects
bsaBritish School at Athens (pending)
bsrBritish School at Rome, photographs and plans16043 objects
creswellCreswell Photographic Archive, Ashmolean6521 objects
cycladicCycladic Museum, Athens348 objects
lgpnLexicon of Greek Personal Names251821 people
limcLIMC Paris 4724 objects
limcbaselLIMC Basel55852 objects
metamorphosesGazetteer9396 places (6325 geolocated)
waaWorld of Ancient Art 406 places

For the purposes of Pelagios, the more interesting figures relate to our coverage of places:
  • c.9300 places known
  • c.6200 places geolocated
  • c.1500 places linked to Pleiades
  • c.4330 places linked to
Overall, we can access in CLAROS c.125000 objects via their geolocated find spot (out of c.400000 in total, and access c.161000 people via a "birth" place (out of c.250000). This means that we have a considerable amount of work to do in mapping place names together, and relating named places to geolocation - although bear in mind that many places will never (?) be in Pleiades (eg those in Japan and China), so the daunting-looking discrepancy of 1500 Pleiades vs 4300 geonames is not as bad as at first appears.

The majority of the data hitting CLAROS uses a simple place name, so the main work of our ingest procedure is to attempt to map that to known place (and thence to Pleiades). The procedure may be of interest:
  1. Does the <E53_Place> in the RDF already have a geolocation? OK
  2. Normalize place name. Translate space to -, lower-case, etc
  3. Does the name match an entry in our mapping table?
      from="academy" to="athens-academy"
      from="aegypten" to="egypt"
      from="agios-ioannis" to="athens-agios-ioannis"
      from="agli" to="aglie"
      from="agrigento" to="sicily-agrigento"
      from="aidinjik" to="edincik"
    if so, use the canonical form
  4. Does name of place match a known place? link to that place
  5. Does name of place partially match a place? create an <E53_Place> which has a <P89_falls_within> linking to the half-match. Example "athens-kerameikos"
  6. Does <E53_Place> have a geonames link? get lat/long from
Flaky though this can sound, it largely works. More or less all places which have a hundred or more objects attested from them are safely identified. The CIDOC CRM RDF which identifies a place can look something like this:

   <E53_Place rdf:about="">
      <rdfs:label>[GR] Astypalaia</rdfs:label>
         <E48_Place_Name rdf:about="">
         <E47_Place_Spatial_Coordinates rdf:about="">
               <geo:Point xmlns:geo="">
      <skos:closeMatch rdf:resource=""/>
      <skos:closeMatch rdf:resource=""/>
      <P89_falls_within rdf:resource=""/>

Once we have the hundreds of thousands of objects and people duly linked to a place, it is easy to associate them with Pleiades, via the <skos:closeMatch> shown in the example. The data is loaded into a RDF triple store (Jena), and then we can run the following SPARQL query to generate a new set of triples containing the needed OAC annotations:

  ?anno a oac:Annotation ;
    dcterms:conformsTo <> ;
    oac:hasTarget ?object ;
    oac:hasBody ?pleiades .
    ?object a oac:Target, crm:E22_Man-Made_Object ;
    rdfs:label ?label .
  ?object crm:P16i_was_used_for [
    crm:P2_has_type <> ;
    crm:P7_took_place_at ?place
  ] ;
    rdfs:label ?label .
  ?place skos:closeMatch* ?pleiades .
  FILTER (regex(str(?pleiades), "pleiades")) .
  BIND (uri(concat("", sha1(str(?object)))) as ?anno) .

The resulting triples are loaded into a new graph called "pelagios" in the triple store, and finally we are able to point the Pelagios folk at, and the corresponding VoID at, and results start to appear in Pelagios clients.

So far, so good. But there remain two problems, one practical and one theoretical.

Firstly, the CLAROS collection includes 180000 objects from Arachne; but Arachne is a Pelagios
contributor in its own right. This means that the existence of a gold ring from Athens will be reported twice in Pelagios. To solve this, we need to adjust the SPARQL inference above to run separately against each of the partner data collections, and generate discrete sets of OAC triples. This will allow Pelagios to avoid harvesting Arachne from CLAROS, assuming it is better to come from the source.

Secondly, some of the relationships in CLAROS start to strain the notion of an annotation.  When a person called Alexandros comes from a place called Athens, is it really sensible to say that the person "annotates" the place? It could equally be argued that the place annotates the person. In some ways, this does not matter so long as all the data contributors follow the same conventions, but eventually consumers will find our data sets in isolation, and find them quite confusing. Other similar projects using the same technology may make quite different choices.

The Pelagios idea of using OAC as its structure was a good one, and has let the project proceed fast and efficiently. Whether it can, or should be, maintained as the ancient world semantic web builds up, is debatable.

No comments:

Post a Comment