Tuesday 13 December 2011

Converting the Ure Museum data

The Ure Museum of Classical Archaeology in Reading, is one of the most recent Pelagios partners. I have just started work on converting the collection data into a Pelagios-compliant format with the help of the curator, Amy Smith.

The main task involved in this is finding a way to figure out for each item in the collection whether there are any places in Pleiades associated with the item. Once we have done this, it should hopefully be straightforward to turn this data into OAC annotations for Pelagios.

You can browse the Ure Museum database online here. There are about 3000 objects in the collection. Any information about places associated with an object is generally under either the 'Fabric' or 'Provenance' listed for the object. The fabric is usually an adjective describing where the item was thought to have been made e.g. Boeotian, Etruscan, Daunian. The provenance is generally less structured. Here are some examples of the contents of this field for a selection of different objects:
  • Probably made in Cyprus (Stubbings)
  • Found on Mount Helicon with an arrowhead, 26.7.13
  • Northern Boeotia (?), provenience unknown
  • From a burial somewhere in the Argolid.
  • Thought to be from Cyprus: T.146.II. From Poli? Cf. JHS 1890.
  • Unknown, similar to Larnaca, Kamelarga finds
  • From Carthage (or other North African site)
  • Central Italian, possibly from the vicinity of Rome.
  • Cast from an original in the Acropolis Museum, Athens
  • Said by vendor to have come from between Thebes and Chalcis
I have been given all the data as an XML dump and want to write a script to match any places in Pleiades with this information from the 'Fabric' and 'Provenance' fields in the data. I also have a copy of Pleiades+ which provides toponyms from GeoNames for the places in Pleiades. You can read more about Pleiades+ here and here.

The rough approach I have taken is to go through each item in the collection and then for each item, go through all the places in Pleiades+ and see if any match with anything in the Fabric or Provenance field.
I am hopefully not far off getting all the special cases sorted out and should have this completed in the early new year.

Here are a few of the challenges and issues that I have encountered so far.

1) Uncertainty in the data

This was one of the first things that concerned me. As you can see from the examples above there is often a large degree of uncertainty about where items are from. In addition, items may not have been found in some spot that even has a name and may have more than one place associated with them if they have moved locations. However, as I was reminded by Leif, these are annotations we are providing. So you can provide multiple annotations for an object, it's perfectly fine to annotate an object with any location that it is remotely associated with and an annotation does not indicate an object's definite origin.

2) Location-based adjectives

Most of the fabric information is give as adjectives rather than as the place name e.g. as Corinthian rather than Corinth. Even in the Provenance data there are still lots of adjectives. Adjectives associated with a place are outside the scope of Pleiades+, so I have compiled a list of how adjectives map to places with the help of Amy. It is relatively limited because it's restricted to adjectives used in the Ure Museum database as it stands, but would it be useful for us to share this list and allow other people to add to it in some way?

I should point out that there are still some question marks even with this approach e.g. would you want a reference to Roman Britain to map to Rome? However, I suspect the number of controversial mappings is going to be small.

3) Disambiguating places

Sometimes there are multiple places with the same name. For example, there is more than one place called Salamis. How do we make sure that if we know we want the Salamis in Cyprus then it matches to http://pleiades.stoa.org/places/707617/ rather than say this Salamis? Again this is where the 'connections with' information in Pleiades would help in theory. However in practice, it looks likely that we are going to have to deal with these ambiguities as special cases in the script.

4) Granularity of annotations

If we have an object from Salamis in Cyprus, do we annotate it with both Salamis and with Cyprus or just with the more precise location, Salamis? You wouldn't necessarily expect every item from Rome to also be annotated with Italy so using the more precise location feels sensible. On the other hand it may not do any harm to annotate with both and if we do have two places associated with an object, how do we tell that one is contained within another? Pleiades has information about which places 'connect with' other places and according to Sean Gillies of Pleiades, 'you'd almost never go wrong in Pleiades by inferring containment between a precisely located place of small extent and a much more extensive place' if you used this data. However, there is a great deal of connection information missing from Pleiades, so in practice this approach is unlikely to work well.

5) Pleiades locations enclosed in text not related to the location

If you just go through the Pleiades+ data and search for each place in term in the text associated with the object, you get lots of false hits, partly because there is some slightly odd data in Pleiades such as http://pleiades.stoa.org/places/324652/. Most of these you can rule out by assuming that place names will be capitalised in the Ure Museum data and by insisting on whole word matches. However there are still occasional problems. For example the Pleiades place Artemis matches 'Sanctuary of Artemis Orthia, Sparta' and you may also want to rule out locations of museums mentioned. I have been writing special cases in my script for these. I can do this because the collection isn't too large, but I can see that with a larger collection I can see that you could easily miss some instances like these. I have wondered if the GeoParser used for GAP might help with dealing with this type of unstructured data.

6) Alternative toponyms not in Pleiades+

Pleiades+ doesn't claim to be comprehensive and I have come across a fair number of alternative toponyms, again with Amy's help, not in Pleiades+, also writing these into my script as special cases. Some of these are from the Barrington Atlas Notes in Pleiades but there are others as well. As with the adjectives, I'm wondering if there is some way of sensibly sharing alternative toponyms that we have found so as to prevent other people having to duplicate our work.

7) Vague geographical data

There are quite a few provenance entries which include locations like 'South Italy' or 'Greek Islands'. There is no way of specifying these that I have found in terms of Pleiades locations, so I have had to resort to annotating them just with 'Italy' or 'Greece', losing some of the information. Objects are also often described as being found in modern countries or places that don't always have a clear equivalent in Pleiades.

8) The historical scope of Pelagios

The Ure Museum contains objects from a wide range of periods. Pleiades focuses on the Greek and Roman world and Pleiades in a sense defines the scope of Pelagios. However, should I still annotate a Neolithic object for example with the larger region from which it comes even if the precise location is not in Pleiades?

9) Spelling mistakes in the data

There aren't too many of these, but I have also had to include some special cases for spelling mistakes (as well as for alternate transliterations of place names). Obviously the ideal solution is to get the spelling mistakes fixed in the database itself and then get a new download of the data, but I thought I should highlight this as a potential issue. If the data has only been read by humans previously who unlike a computer can easily understand what is intended, it is easy for these typos to slip through.

10) Dealing with updates to the data

It is obviously likely that more data is going to be added to the Ure Museum database as time goes on. It would obviously be possible to rerun my script but there are enough special cases that it would hard to guarantee that any new results would be comprehensive and accurate.

Next stages

Overall this is proving a really interesting exercise and good introduction to the world of Pelagios.
Once I have finished on the special cases, the next stage will then be to turn the data into OAC annotations and arranging where the data is going to be hosted. In the meantime, I'm off for the next few weeks seeing what my one-year-old makes of Christmas!

Friday 9 December 2011

Pelagios Phase 2: Project Plan

Phase two of Pelagios looks to build on our lightweight framework, based on the concept of place (a Pleiades URI) and a standard ontology (Open Annotation Collaboration), by publishing the Pelagios Toolkit-a set of services and documentation that will assist people in annotating, discovering and vizualizing references to places in open online ancient world resources.

In all, there are four Work-Packages:
§ WP1 casts the net beyond the existing partners in order to allow anyone to publish their data in a way that maximizes its discoverability. This webcrawling and indexing service will find material and - based on the Pelagios framwork and semantic sitemaps - aggregate place metadata in order to create value for the holders of that data.
§ WP2 aims to explore further ways of exploiting the concept of place. The place/space-based APIs and contextualisation service will help other users and data-providers discover relevant data and do interesting things with them.
§ WP3 tackles end-user engagement: i.e. subject specialists who lack the technical coding expertise to use the data underlying what it seen on the screen. The visualization service will explore ways of allowing these users to get to grips with the data both in a single Pelagios interface but also as embedded widgets hosted on each partner’s site.
§ WP4 distils the guidelines into a cookbook providing explicit recipes for producing, finding and making use of geoannotations for the community as a whole. In short, you won’t need to be a Pelagios partner to be able to join-in in making your data discoverable and usable.

The evolving nature of the Pelagios collective reflects the shift towards community engagement. While partners from the original Pelagios proof-of-concept project will continue to be involved, the main work for phase two of Pelagios will be carried out by: Arachne, CLAROS, DME, Fasti-online, GAP, IET (the Open University), Nomisma, Southampton, SPQR, the Ure Museum.

The outcomes, in more detail, are as follows:

D 1.1: Web Crawling and Indexing Prototype. This infrastructure component traverses resource sets on the Web (registered manually or discovered using semantic search engines like Sindice) and catalogues their place metadata. Place metadata encompasses geographical coordinates as well as Pleiades and Geonames URIs.
D 1.2: Pelagios 2 Graph API. This deliverable is an HTTP API that allows querying of the aggregate data graph generated by the Indexing Prototype. The API will provide responses in JSON and RDF format; and possibly in additional formats (e.g. KML or GeoRSS) if the need is identified in WP3. The initial range of possible queries is based on the outcome of the Pelagios project. The exact scope and structure of the final API will be driven by the requirements identified in WP3.
D 1.3: API Statistics and Reporting Interface. This deliverable will extend the Pelagios 2 Indexing Prototype with means to extract statistics and reports on the use of the API. Data partners can use this interface to gain insight into how their data is being discovered, queried and re-used within the larger online community.

D 2.1: Place-based API. This deliverable will extend the Pelagios 2 API with queries that return resources relevant to specific places or those with mereological (part-whole) relationships.
D 2.2: Space-based API. This deliverable will extend the Pelagios 2 API with queries that permit searches based on geographic scope, e.g. within a certain geographic buffer around a given location set.
D 2.3: Contextualisation Prototype. This deliverable is a service that provides ranked, relevant materials for a certain place or particular Named Entities. Results will be enriched with additional data from sources such as GeoNames, DBpedia and Freebase.

D 3.1: Evaluation of User Needs. This deliverable will report on the results of a formal evaluation of user needs regarding data visualization. The evaluation will be conducted in conjunction with project partners, and will inform the design of a set of online visualization widgets. This deliverable will have the form of a series of blog posts.
D 3.2: Widget Suite, Alpha version. This deliverable encompasses the first (alpha) version of the visualization widgets.
D 3.3: Evaluation of Widget Design. This deliverable will report on the results of observational and participatory design studies. The studies will be conducted on the Widgets as they are continuously and iteratively being developed from alpha state to final (beta) prototype. This deliverable will have the form of a series of blog posts.
D 3.4: Widget Suite, Beta version. This deliverable encompasses the final (beta) version of the visualization widgets.

D 4: Pelagios 2 Cookbook. Content Partners will produce regular documentation on data preparation, practices, tool use, etc. in the form of blog posts. The PI, assisted by the Co-Is will distil this information into a “cookbook” which will make it easier for anyone with Ancient World content to publish their data online in conformance with the Pelagios 2 common open standards.

Saturday 3 December 2011

Welcome to Pelagios - Phase 2

Pelagios is a growing collective of ancient world projects who are linking together their data so that scholars and members of the public are able to discover all different kinds of stuff about ancient places.

Phase 1 has been the proof of concept. In this stage we have linked some core ancient world projects to each other through the concept of place (a Pleiades URI) and a baseline ontology (Open Annotation). The value of those linkages is demonstrated in the Pelagios Explorer, which allows users to discover and investigate the data from those different projects in a handy search interface.

The second phase of Pelagios is to formalize that process by which anyone can join or enjoy the fruits of the Pelagios superhighway. We will provide a ‘digital toolkit’ for anyone producing material about the ancient world—not just universities but also museums, libraries, etc­—, so that their data will be more discoverable and usable. We will also be experimenting further with methods of visualizing that data so that subject specialist users and the general public can discover information about places that interest them, without having the technical expertise to do the digging themselves.

The Pelagios kick-off meeting in Greenwich: (back row) Andy Meadows (Nomisma), Sebastian Rahtz (CLAROS), Liz Fitzgerald (IET), Amy Smith (Ure Museum), Elton Barker (OU), Rainer Simon (DME), Alex Dutton (CLAROS); (front row) Leif Isaksen (Southampton), Simon Hohl & Rasmus Krempel (Arachne), Juliette Culver (IET)

It was taken by a plaque reading "Greenwich: still the centre of space and time"