Tuesday, 1 May 2012

Ure Museum data available


Back in December, I blogged about converting the Ure Museum data into Pelagios-compliant format. I'm pleased to say that this data is now available here as an N3 file released under a CC-BY licence.Thank you to the curator, Amy Smith for all her help with this!

The conversion process

I thought it might be useful for other people undertaking a conversion of data into a Pelagios-compliant format to outline the rough steps involved.

The overall goal was to find any matches between places mentioned in the museum database and places mentioned in the Pleiades gazetteer and then put these into an RDF file in the format required by Pelagios. That format is not yet documented in a formal way but there are lots of examples and various posts on this blog about the topic.

Here are the steps which I went through:

1) Obtain an XML file of the museum data and identify fields that could contain place data.

This stage was essentially getting a feel for the data available and possible issues. My original blog post discussed these in more detail.

2) Write a script to extract useful information from the XML file, identify candidate places and put the information into a spreadsheet.

I assumed that any capitalised word in either the 'Fabric' or 'Provenance' field could be a place. It is possible that we might have missed a few places that had not been capitalised in the database or that were mentioned in other fields, but based on looking through the data in the previous step, I was fairly confident that we were unlikely to miss more than a few places this way. The role of the spreadsheet was purely to make it easier to check the data manually.
3)  Check all the possible places against the Pleiades+ file for matches and add these to the spreadsheet. Create a list of missed possible matches and false matches by manually checking the spreadsheet.

Looking through the results of this matching process, it was clear that there various places mentioned that were not caught  and that there were also some incorrect  matches.

4) Add in special cases to the script, with expert help. 

This is where Amy and I painstakingly went through the list from the previous step to improve the data. These fell into the following main categories
  • Adjectival references to places e.g. Spartan rather Sparta
  • Toponyms not in Pleiades+ - many of these were in the Barrington Atlas Notes but not Pleiades+ or were alternative transliterations, but there were also others (e.g. Zoan is a toponym for Tanis). 
  • Places that did not exist at all in Pleiades such as for some neolithic sites - for these we either included just the region for or omitted them completely
  • There were a few typos  - I was still working with the original XML file so even with corrections to the main database, I still needed to deal with these, although I could have corrected the XML file instead.
  • False matches based on names of museums or temples e.g. we didn't want 'Temple of Artemis Orthia' to match Artemis or 'Acropolis Museum, Athens' to automatically match Athens. I also learned that East Greece is not in fact the east part of Greece!   
5) Add in special cases to disambiguate Pleiades places with identical names, with expert help.

As well as missed and false matches, tthere were over 40 different place names with multiple matches i.e.where there was more than one place in Pleiades with the same name. For example, there are a lot of cities called Alexandria besides the famous one. Some of these were easier than others but again, this was another rather painstaking process, involving Amy's help, going through the Pleiades gazetteer, identifying which was the correct place to match with the item in question. I also had to decide for example when a city had the same name as the island where it was located, whether to match the city or the island, and in other cases whether to match a geographical or administrative region.   

6) Write the final data as RDF.

This last stage was in fact one of the easiest despite my initial fears about lack of documentation, especially as there are lots of examples of Pelagios-compliant data files now and other project members, especially Rainer Simon, were very helpful here with checking the format of my data. One interesting question was what dcterms:title to give each annotation. Different approaches seem to have been taken by different partners here. I decided to include a title for the object based on the Accession Number, Shape and Period, rather than treating this as a title of the annotation itself, partly because I knew this would feed into the information I was using in the widgets, but it certainly possible to argue that this is technically incorrect.   

I also created a VoID file for the data. We discussed splitting the data into various subdatasets but for the time being, since the dataset is relatively small by Pelagios standards (just over 2000 annotations), I kept it as a single dataset for the sake of simplicity.  

The future

There is an interesting question of what to do about updates to the collection database. Because of the number of special cases I would be a bit nervous about just running the script again without checking things through too. We may have to play things by ear here!

For me the interesting questions that have come up are whether we can expand Pleiades+ in some useful way based on the types of special cases that we came across and whether there is any way that the process of disambiguating places in Pleiades could have been made simpler. I think these are both tasks that are likely to be encountered by other institutions trying to convert their data into Pelagios-compliant format, especially those with their place data in free text format, so it would be great if we could do something to make these processes easier for them. 

No comments:

Post a Comment