Tuesday, 13 December 2011

Converting the Ure Museum data

The Ure Museum of Classical Archaeology in Reading, is one of the most recent Pelagios partners. I have just started work on converting the collection data into a Pelagios-compliant format with the help of the curator, Amy Smith.

The main task involved in this is finding a way to figure out for each item in the collection whether there are any places in Pleiades associated with the item. Once we have done this, it should hopefully be straightforward to turn this data into OAC annotations for Pelagios.

You can browse the Ure Museum database online here. There are about 3000 objects in the collection. Any information about places associated with an object is generally under either the 'Fabric' or 'Provenance' listed for the object. The fabric is usually an adjective describing where the item was thought to have been made e.g. Boeotian, Etruscan, Daunian. The provenance is generally less structured. Here are some examples of the contents of this field for a selection of different objects:
  • Probably made in Cyprus (Stubbings)
  • Found on Mount Helicon with an arrowhead, 26.7.13
  • Northern Boeotia (?), provenience unknown
  • From a burial somewhere in the Argolid.
  • Thought to be from Cyprus: T.146.II. From Poli? Cf. JHS 1890.
  • Unknown, similar to Larnaca, Kamelarga finds
  • From Carthage (or other North African site)
  • Central Italian, possibly from the vicinity of Rome.
  • Cast from an original in the Acropolis Museum, Athens
  • Said by vendor to have come from between Thebes and Chalcis
I have been given all the data as an XML dump and want to write a script to match any places in Pleiades with this information from the 'Fabric' and 'Provenance' fields in the data. I also have a copy of Pleiades+ which provides toponyms from GeoNames for the places in Pleiades. You can read more about Pleiades+ here and here.

The rough approach I have taken is to go through each item in the collection and then for each item, go through all the places in Pleiades+ and see if any match with anything in the Fabric or Provenance field.
I am hopefully not far off getting all the special cases sorted out and should have this completed in the early new year.

Here are a few of the challenges and issues that I have encountered so far.

1) Uncertainty in the data

This was one of the first things that concerned me. As you can see from the examples above there is often a large degree of uncertainty about where items are from. In addition, items may not have been found in some spot that even has a name and may have more than one place associated with them if they have moved locations. However, as I was reminded by Leif, these are annotations we are providing. So you can provide multiple annotations for an object, it's perfectly fine to annotate an object with any location that it is remotely associated with and an annotation does not indicate an object's definite origin.

2) Location-based adjectives

Most of the fabric information is give as adjectives rather than as the place name e.g. as Corinthian rather than Corinth. Even in the Provenance data there are still lots of adjectives. Adjectives associated with a place are outside the scope of Pleiades+, so I have compiled a list of how adjectives map to places with the help of Amy. It is relatively limited because it's restricted to adjectives used in the Ure Museum database as it stands, but would it be useful for us to share this list and allow other people to add to it in some way?

I should point out that there are still some question marks even with this approach e.g. would you want a reference to Roman Britain to map to Rome? However, I suspect the number of controversial mappings is going to be small.

3) Disambiguating places

Sometimes there are multiple places with the same name. For example, there is more than one place called Salamis. How do we make sure that if we know we want the Salamis in Cyprus then it matches to http://pleiades.stoa.org/places/707617/ rather than say this Salamis? Again this is where the 'connections with' information in Pleiades would help in theory. However in practice, it looks likely that we are going to have to deal with these ambiguities as special cases in the script.

4) Granularity of annotations

If we have an object from Salamis in Cyprus, do we annotate it with both Salamis and with Cyprus or just with the more precise location, Salamis? You wouldn't necessarily expect every item from Rome to also be annotated with Italy so using the more precise location feels sensible. On the other hand it may not do any harm to annotate with both and if we do have two places associated with an object, how do we tell that one is contained within another? Pleiades has information about which places 'connect with' other places and according to Sean Gillies of Pleiades, 'you'd almost never go wrong in Pleiades by inferring containment between a precisely located place of small extent and a much more extensive place' if you used this data. However, there is a great deal of connection information missing from Pleiades, so in practice this approach is unlikely to work well.

5) Pleiades locations enclosed in text not related to the location

If you just go through the Pleiades+ data and search for each place in term in the text associated with the object, you get lots of false hits, partly because there is some slightly odd data in Pleiades such as http://pleiades.stoa.org/places/324652/. Most of these you can rule out by assuming that place names will be capitalised in the Ure Museum data and by insisting on whole word matches. However there are still occasional problems. For example the Pleiades place Artemis matches 'Sanctuary of Artemis Orthia, Sparta' and you may also want to rule out locations of museums mentioned. I have been writing special cases in my script for these. I can do this because the collection isn't too large, but I can see that with a larger collection I can see that you could easily miss some instances like these. I have wondered if the GeoParser used for GAP might help with dealing with this type of unstructured data.

6) Alternative toponyms not in Pleiades+

Pleiades+ doesn't claim to be comprehensive and I have come across a fair number of alternative toponyms, again with Amy's help, not in Pleiades+, also writing these into my script as special cases. Some of these are from the Barrington Atlas Notes in Pleiades but there are others as well. As with the adjectives, I'm wondering if there is some way of sensibly sharing alternative toponyms that we have found so as to prevent other people having to duplicate our work.

7) Vague geographical data

There are quite a few provenance entries which include locations like 'South Italy' or 'Greek Islands'. There is no way of specifying these that I have found in terms of Pleiades locations, so I have had to resort to annotating them just with 'Italy' or 'Greece', losing some of the information. Objects are also often described as being found in modern countries or places that don't always have a clear equivalent in Pleiades.

8) The historical scope of Pelagios

The Ure Museum contains objects from a wide range of periods. Pleiades focuses on the Greek and Roman world and Pleiades in a sense defines the scope of Pelagios. However, should I still annotate a Neolithic object for example with the larger region from which it comes even if the precise location is not in Pleiades?

9) Spelling mistakes in the data

There aren't too many of these, but I have also had to include some special cases for spelling mistakes (as well as for alternate transliterations of place names). Obviously the ideal solution is to get the spelling mistakes fixed in the database itself and then get a new download of the data, but I thought I should highlight this as a potential issue. If the data has only been read by humans previously who unlike a computer can easily understand what is intended, it is easy for these typos to slip through.

10) Dealing with updates to the data

It is obviously likely that more data is going to be added to the Ure Museum database as time goes on. It would obviously be possible to rerun my script but there are enough special cases that it would hard to guarantee that any new results would be comprehensive and accurate.

Next stages

Overall this is proving a really interesting exercise and good introduction to the world of Pelagios.
Once I have finished on the special cases, the next stage will then be to turn the data into OAC annotations and arranging where the data is going to be hosted. In the meantime, I'm off for the next few weeks seeing what my one-year-old makes of Christmas!

No comments:

Post a Comment