Thursday 21 June 2012

Improving the Arachne-Pleiades matching

In this Blogpost we describe how we improved the accuracy of the process by which we aligned Arachne to Pleiades. The fact that the first Arachne-Pleiades matching was strictly string-based brought several problems with it. (See previous posts 12.)

In a place matching process, each usable context can reduce the prospect of making errors, especially when the granularity of data on each side is not the same and one-to-many matchings are possible. Furthermore, every detail used for the refining of matching reduces the number of hits you can get. As far as Pelagios compliancy is concerned, a simple fact reduces the likelihood of error from the start: Pelagios focuses on ancient world data, in fact, primarily material from the ancient Mediterranean. Thus  the confusion of Memphis (Tennessee) in the US with Memphis in Egypt is avoided, even when both places are associated with a (or the) "King"... The main effort in the new process was to use more contexts from Arachne in order to make the matching more accurate.  

The new matching

To help with matching places, we selected the context that had the most reasonable chance of enhancing the process: the geographical coordinate.

But there are problems: point geo-coordinates can differ from exact point coordinates, while definitions of places are often not exact as WGS 84 coordinates, which can be exact up to a few centimetres. Also the geo-coordinates on both sides are rather roughly chosen.

Arachne internally uses three matchings:

  1. exact places in Arachne;
  2. topographical units in Arachne;
  3. the ancient landscapes in which at least those topographical units that are smaller than the landscapes are located. (Basically a topographical unit in Arachne can also be bigger than just one landscape, so their size is very flexible in the Arachne data-model.)
The first and second matchings use the rough coordinates that the Arachne team has found for places.

As the first indicator for linking, we use distance-based matching. This uses the distance between a certain Arachne place to all Pleiades places as an indication of whether the two places could match. If the distance lies below a certain t
hreshold, the Pleiades place is matched by the string-based matching that we described in our earlier blogpost.

Objects and other sources are linked to places and topographies, making it easy to connect them to the Pleiades places.

Where to advance the matching?

After producing a whole lot of links you get into trouble if you have to review them all: it is simply beyond the scope for a single person to look at each linked entity. So, we need a faster way to review large quantities of Links.
Description of the construction of the co-occurence networks (created with yed)


The complexity of handling ancient place-names emerges when you look at the following visualizations, created with Gephi, which show the co-occurrence networks of the Arachne-Pleiades matching: the Pleiades entities are shown as nodes; the edges or links between them are created when they are referenced by the same Arachne place. Like we have described here an Arachne places bears more features than just one place name. It also keeps the country city etc. in hierarchical order.

These visualizations reveal just some of the problems you can expect when you use any other than a 1-to-1 matching. 
For example, the co-occurrence links should reveal a match between the city level and the ancient province. The main result of the place matching should also follow a more or less tree-like structure, because there are places that are part of other places or regions. So, one place in a dataset can be matched to two places in the other place-system.



The old Arachne matching:
Geo-layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
Visualization explanation
  • Nodes: Entites from the Pleiades dataset that occure in the linkage
  • Edges: Entities from Arachne which Link to at least two Pleiades places. An edge means that two Pleiades places are Referenced to the same Arachne place
  • Node Color: Latitude (south-north) blue (south) to grey (center) to green(north)
  • Node Size: Longitude (east-west) from small (west) to big (east)
This overview helps us to find out, on a place matching level, how accurate the place matching is.

This visualization shows that a matching, which mainly uses place-names, produces a co-occurrence network with long edges. The fact that Arachne Places (which are represented by the edges) link a lot Pleiades places with very different coordinates show the level of label based mismatches.This was a clear hint that there were a lot of Errors in our matching that should be removed.

Bad links good links

A good link, as shown in the example picture above, would be produced by the linking of the acropolis in Athens in Arachne(http://arachne.uni-koeln.de/entity/1207983) to both Achaea (http://pleiades.stoa.org/places/579888 the Roman province) and Athenae (http://pleiades.stoa.org/places/579885) in Pleiades. The acropolis is in Athena and it is also in Achaea (on another level of granularity). So in the visualisation the acropolis would produce a link between Achaea and Athenae.

A bad link example would be the linking between Istanbul (http://arachne.uni-koeln.de/entity/887) and 
"Byzantion?"( http://pleiades.stoa.org/places/49929), "Byzantium" (http://pleiades.stoa.org/places/520985). Here we would have the problem that the first two Byzantium/Byzantion (both used in the matching as synonyms) are far away from each other. But we could expect that the name similarity has same origin as in the Memphis example from the introduction. This would also create a very long (very bad) edge from the modern Istanbul to India.

The nodes in this visualization use the coordinates from the Pleiades dataset for positioning,which is archived using a gephi plugin. Even if they are rough, they can give a hint about the distance between the places they describe. This positioning is independent from the coordinates in Arachne so it would also be applicable to datasets that do not have geo-data for their own places.



The new Arachne matching:
Geo-layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
In this graph there are far fewer long links between places. Many of the co-occurrence links are covered by the nodes.  The picture shown is far clearer than the first graph, because there are far fewer co-occurrence links of the "long" sort. These long links indicate that two places are somehow connected that aren't in a similar location.

One should keep in mind that the fact that there are fewer co-occurrence links does not mean that there are fewer links: there could even be more links that just connect one Arachne place (more 1-1 relations) to one Pleiades place, but they won't be shown in the visualization.  We can say this because of the closed context of the Mediterranean Area! If we were to try to extrapolate these results to the whole world, it wouldn't work, because the links on the borders of the projection would go from one end of the projection to the other, but the places on the border would be close to each other.


Overviews

The following visualizations depict the complete view of the matchings. In these graphs, we have changed the method of node positioning from coordinate based to force based.

In the following examples , it is important to interpret the size and color of the nodes. As you can see, the size of the nodes depends on the longitude (east-west axis) of Pleiades places represented by the node. The latitude (south-north axis) value changes in the nodes in colour from green to gray to blue. For example, a place is green and small because it lies in the north west. Another place is big and blue because it lies in the south east. The places in the central region of ancient Greece are medium sized and in grey. (overview)

Here the length of edges or links is not that clear. The force-based layout shows where Pleiades places cluster by referencing the same Arachne Place(s).

The old matching again:

Force-based layout overview of the Pleiades places co-occuring with Arachne Places using the old matching
Here we notice a mixture of size and coloured nodes. This is another indicator for the error in the matching.  


The new matching again:
Force-based layout overview of the Pleiades places co-occuring with Arachne Places using the new matching

This overview is much clearer there are a lot of nearly treelike structures as we expect it.

An example from the old matching:

 detail that shows the places called Alexandria in the old matching 
The obvious Alexandria-disambiguation-example is old hat (there are many Alexandrias in the ancient world!), but it acts as a useful example for showing how to read the visualization. The different Alexandrias are all connected - one placename matches them all - but they differ in their locations, as shown by colour and size. 

What we have learned:

An important lesson has been: don't use modern country names and don't use continent names! This is especially important because the ancient definition of Asia (http://pleiades.stoa.org/places/991380) and Europa (the German spelling of Europe) (http://pleiades.stoa.org/places/991376) differ markedly from their modern definitions. These errors produce a very visible effect, as you can see below:


Detail that shows the influence of Europa in co-occurrence of the old matching
Here Europe links to nearly everything, because the Arachne places contain the information that most places in Arachne are in the space of the modern definition of Europe! The connection between Asia Minor and Europe come from the fact that there are Places with the the same name in modern Asia and  modern Europe so they will also co-occur.

The way in which the enhanced matching works is shown by the term "Asklepieion". Asklepieion is especially interesting because it denotes a god as well as whole sanctuaries or specific temple complexes that can be inside or outside of larger sanctuaries, while those temple complexes or sanctuaries can in turn be inside or outside of cities. In the old matching, all the "Asklepieion" occurrences were matched together and composed a large co-occurrence cluster:

Cluster that shows the the co-occurence of Asklepieion in the old matching 
The new matching seperates the Asklepieions from each other and shows how they move in other contexts:
First occurence of Asklepieion in the new matching with all connected Places highlighted
Second occurrence of Asklepieion in the new matching with all connected Places highlighted
Third occurrence of Asklepieion in the new matching with all connected Places highlighted

What still needs to be done

The new matching does not solve all the problems shown here: for example, there are many different "Asias" that describe different Roman provinces at different times and sizes of expansion. This is not taken account of – we need a much more time-place-interactive Mediterranean Gazetteer. But without the time context, these problems are hard to resolve because they got nearly the same point coordinates. So there is also the conflict about what side should enhance and refine their data.

Reinhard Förtsch, Rasmus Krempel
Arachne Database, CoDArchLab University of Cologne

No comments:

Post a Comment