Monday, 1 July 2013

A SQL version of the Pleiades dataset



For those of you who use the Pleiades data in your applications I’ve created a .sql file for their newest data dump which you can find here.  I used the Pleiades data dump from June 27, 2013.  What’s nice about the Pleiades people is that when they say a file is comma separated it really is.  I downloaded this file and extracted it into the promised .csv and then imported it directly into Excel where I formatted it into a set of .sql inserts.  The columns seem to have changed somewhat from the previous versions; there are fewer name columns.  As a result I reformatted my Pleiades data table in SquinchPix’s database so that it now looks like this:

CREATE TABLE `PlacRefer` (
  `bbox` varchar(100) default NULL COMMENT 'bounding box string',
  `description` varchar(512) default NULL COMMENT 'Free-form text description',
  `id` varchar(100) default NULL COMMENT 'Pleiades ID number',
  `max_date` varchar(255) default NULL COMMENT 'last date',
  `min_date` varchar(100) default NULL COMMENT 'earliest date',
  `reportLat` float default NULL COMMENT 'Latitude',
  `reportLon` float default NULL COMMENT 'Longitude',
  `Era` varchar(24) default NULL COMMENT 'Character indicate the era',
  `place_name` varchar(100) default NULL COMMENT 'Name string for the place'
);

I formatted the bounding box as a comma-separated varchar.  I do this because the bounding box requires special treatment; it might be missing altogether or it might be less than four points so, if you’re working with it, just get it into a string and split the string on commas.  Then you’ll have an array of items that you can treat as floats.  I finally got it through my thick skull that the description line can be parsed into keywords so I’ll be using that more in the future.  The ‘id’ field is the regular Pleiades ID.  Is it my imagination or did the Pleiades people suddenly get a large dump of data from the Near East?  The number of items in the file is now 34,000+ and this looks like a big increase.  The max_date and min_date fields give the terminus ante quem and terminus post quem, respectively, for any human settlement of the place in question.  The reportLat and reportLon fields haven’t changed.  The ‘era’ field gives zero or more characters that indicate the period of existence of any site: ‘R’ for ‘Roman’, ‘H’ for ‘Hellenistic’, etc.  I included them because it might be handy for your chronological interpretation.  The ‘place_name’ field is the only name field in the current setup.

If this table layout is satisfactory for you then you can get all the Sql to create and populate the table with all the newest Pleiades data from Google Drive here. Be careful; this new .sql deletes the PlacRefer table first.

I modified my Regnum Francorum Online parser to use this renewed table. The relevant code looks like this:

 $place_no = $l5[0];  // $l5[0] is a fragment of the input record which contains the     // Pleiades ID.
 unset($lat);  // we test for unset later
 unset($lon);

 $querygeo = "select a.reportLat, a.reportLon from PlacRefer a where a.id = $place_no;";
 $resultgeo = mysql_query($querygeo);
 $rowgeo    = mysql_fetch_array($resultgeo);

 $lat  = $rowgeo[0];
 $lon  = $rowgeo[1];

This is how you’ll probably use it most of the time – using the Pleiades ID to retrieve the lat/lon pair.  I was pleasantly surprised at how much the data has improved.   I redid all the Regnum Francorum Online records with the new data and it looks a lot better.  So congratulations to the Pleiades guys!  Although they should double check the exact location of Nördlingen.  Here's how the first 500 Regnum Francorum Online records look on a map.

First 500 Regnum Francorum Online records displayed on SquinchPix using new Pleiades data.
A big improvement over the previous version which you can see here.

If you want to do this yourself from the original Pleiades data dump then be sure to convert (no parentheses in the following sequences) all double quote characters to (") , left single quote to (‘) and right single quote to (’).  The data has elaborate description fields which have been formatted with lots of portions quoted in various ways by various workers.  Also many place names in the Near East and many French names contain embedded single quotes that must be changed to (‘) or (’) or the equivalent.  If you need a guide go here.

Get this right first because if you’re not absolutely sure that you’ve got all the pesky quotes taken care of then the sql import won’t run.

But you can avoid all that hassle by just downloading my .sql file from Google Drive and importing it to your DB.  Have fun!

Robert Consoli
Cross-posted from Squinches.

Thursday, 27 June 2013

A New Dimension for Pelagios

For a number of reasons this is merely a News Flash, but we are extremely pleased to report that thanks to the generosity of the Andrew W. Mellon Foundation, Pelagios will shortly be exploring some new directions. Needless to say, we will still maintain our role as a connecting medium for Ancient World resources, but we think the coming year will also unlock a wide range of additional opportunities in historical Linked Open Data as well. To find out more, tune in on September 1st. Until then, we'll continue to keep you updated on partner activities, for which you could do no better than to check out the exciting stuff Bob Consoli has been up to over at Squinchpix...  


Thursday, 25 April 2013

How Ancient History Encyclopedia linked up with Pelagios

We're back for some information on how we linked Ancient History Encyclopedia to Pelagios. I hope that this can be of help for future websites that join this excellent project.

First of all, we need to explain how AHE works. The website is entirely based on tags / keywords. Each tag has one (and only one) definition associated to it, and many possible articles, illustrations, or timeline events. It is possible --and indeed necessary for the website to work properly-- that articles, illustrations, and timeline events are linked to many tags. An article on "Trade in Ancient Greece" would be tagged with "Greece", "Economy", "Trade", "Colonization", and it would subsequently be listed under all those tags' pages.

Now the initial idea was easy: Let's link up every geographical tag of ours (cities, countries, regions) to its equivalent location in Pleiades. We've got 2,400 tags, and we expect to have many more in the future, so we didn't want to do this all by hand. Instead, we wanted something future-proof, that would notify us automatically of possible matches between tags and Pleiades locations.

Every day, we automatically import the Pleiades database of names, their respective location IDs and their locations and mirror it in our database using a cron job. We wrote a nifty little PHP function that converts the Pleiades data to a PHP array -- feel free to use it.

In our editorial team's interface we have a page that automatically tries to find possible matches between Pleiades place names and tags on AHE. For links, we only look at those tags which have a definition -- after all we only want to link up content that is of use to potential readers, not empty tags. Editors can then review the link suggestions and either approve or reject them. That way, we already found most of the links between our datasets.

Suggestion from the automatic linking script

Then there is the problem of links that aren't found by our automatic matching script. For example, on AHE the tag is called "Greece" whereas on Pleiades it's known as "Hellas". Another example would be "Mediterranean" on AHE is known as "Internum Mare" at Pleiades. No script can figure that out!

For those cases, we added another functionality to our tag editor form: Our editorial team can simply search the Pleiades DB mirrored on our server for links, for each tag. An editor could for example see the tag "Greece", notice that it's not linked to Pleiades, open the linking form for the tag Greece and manually search for "Hellas".
Tag listing for editors (2nd last column is the Pleiades link)
The search will give exactly the same type of results as the automatic linking does above, with a map to help the decision-making.

When a tag is linked, we write the Pleiades ID into a newly-created field in that tag's entry in our database (hoping that Pleiades will never change tag IDs).

Now it's time to deliver all this data in a format that Pelagios can understand. We have another script that goes through all the linked tags and fetches their respective definitions, as well as all articles and illustration that are linked to them in our database. Then we output each tag definition as Turtle/RDF in the Pelagios format, linked to a specific Pleiades ID. All articles and images associated with that tag are also output for that Pleiades ID. The final result looks like this. Notice that while each definition only occurs once (one definition per tag), articles and images can appear multiple times, linked to multiple tags (as one article or image is linked to many tags).

Personally, I find that Turtle/RDF is somewhat mindboggling and not exactly easy to understand (I'm not a professional programmer), but with the excellent help of Simon Rainer, Elton Barker, and Leif Isaksen we managed to make it work and validate. Thanks a lot guys... we couldn't have been able to do it without you!

We then submit the generated file to Pelagios (in the next version of Pelagios it'll be imported automatically on a regular basis).

I hope that this was helpful or at the very least interesting to anyone who is looking to link up with Pelagios. If your site is similar to ours, do feel free to drop us a line on {editor AT ancient.eu.com}! We're always happy to help!