Monday, 23 April 2012

Expressing uncertain and vague relationships


As part of our work on CLAROS we're again coming up against the problem of modelling uncertainty and vagueness of relationships using the CIDOC Conceptual Reference Model (CRM).

Say a database has a column for "find location", and which stores free text. In CLAROS we model this as the following pattern:
:object a crm:E22_Man-Made_Object ;
 crm:P16i_was_used_for :finding-object .
:finding-object a crm:E7_Activity ;
 crm:P2_has_type claros:FindActivity ;
 crm:P7_took_place_at :place .

Real-world data

However, we don't always know the find location with certainty. For example, in a database you might expect to see entries like:
  • Ankara
  • Ankara(?)
  • near Ankara
  • between Ankara and Peçenek
How should we model these? First, we should consider our requirements:
  • the model should be as simple as possible
  • the model shouldn't lead to unwarranted interpretations by naïve agents (the open world assumption meets progressive enhancement)
  • where possible we should re-use existing approaches
The first one is obvious; we can say unequivocally that the activity took place at Ankara. The others will require more thought.

Vague relationships

The third and fourth are vague; “how near?”, “how far from the midline between the two places counts as 'between'?”. We can probably assume that if X is between Y and Z that X is near both Y and Z. We can also assume that if X is near Y, then X is almost certainly contained within Y's containing area. For example:
  • “P near Ankara”, “Ankara Province contains Ankara” ⊢ “Ankara Province contains P”
  • “Q between Ankara and Peçenek” ⊢ “P near Ankara”, “P near Peçenek”
    “Q near Ankara”, “Ankara Province contains Ankara” ⊢ “Ankara Province contains Q”
We can model “near” using FOAF's foaf:based_near predicate, which maintains the sense of vagueness. A place that is between to other places can be thought of as being near to both of them. Containment can be modelled using CRM's P87 “falls within” property. By combining both approaches we satisfy CRM-only consumers (who discover an imprecise geographic location), while preserving the original relationship to the nearby place for those that care to look for it.
If we don't trust the assertion that being near somewhere means you share a containing place, we can attempt to encode some uncertainty, as we will see next.

Uncertain relationships

Finally, we have the uncertainty encoded using the “(?)” of the second example.
With the next one we interpret '(?)' to mean that there is some uncertainty as to whether it was actually found in Ankara. Because of this it would be misleading to say simply that the activity took place at Ankara.

I posed this problem to the crm-sig mailing list, and received a response suggesting that the RDF reification vocabulary would be a good fit. The RDF semantics document says that “a reification of a triple does not entail the triple, and is not entailed by it”. This might seem a bit counter-intutive, but it makes sense when you consider that the reified statement might be attested more than once, and that it might not necessarily be true. We could thus use a reified statement and attach properties to say “this statement is possibly true”. However, this is rather verbose and produces patterns that are very unlike un-reified RDF.
Martin Doerr suggested superclassing the relevant CRM properties to produce uncertain variants. Hence P7_took_place_at is a subproperty of P7_possibly_took_place_at. This allows agents to either query for the former (finding certain relationships) or query for the latter and its subproperties. If a query engine supports basic inference, then the latter comes for free, otherwise one would turn a query like:
SELECT * WHERE {
 ?event crm:P7_took_place_at ?place
}
into:
SELECT * WHERE {
 ?took_place_at rdfs:subPropertyOf? ex:P7_possibly_took_place_at .
 ?event ?took_place_at ?place
}
Here, the question mark after rdfs:subPropertyOf denotes a property path (a route from one thing to another through the RDF graph) consisting of zero or one instances of the given property. Thus, ?took_place_at is bound in turn to ex:P7_possibly_took_place_at and all its immediate sub-properties. Assuming that there is a well-understood property linking a property to its uncertain variant, ex:uncertainVariantOf, then we could also formulate this as:
SELECT * WHERE {
 ?took_place_at ex:uncertainVariantOf? crm:P7_took_place_at .
 ?event ?took_place_at ?place
}
This is (in my view) rather cleaner than the reification approach. It should be possible to programmatically generate uncertain variants of vocabularies from the original vocabulary definition, and to transform SPARQL queries to take into account uncertain relationships.
So, does defining a way to create uncertain vocabularies linked to their certain counterparts sound like a good idea?

Tuesday, 27 March 2012

What do end users want from Pelagios widgets?

Ancient mashup
'Ancient mashup' (thanks to flunitrazepam)
We weren't 100% sure so we asked a few (and some of them weren't sure either, but that's understandable and useful to know). The 23 people we asked were suggested by Elton and a other Pelagios partners, and 12 folk responded. This post describes:
  1. Who the respondents are in terms of their roles (Who?)
  2. What sorts of things they would like Pelagios widgets to do (What?)
  3. Contexts - the sorts of activities respondents undertake related to ancient places and history, concerns they have and so on.
    and
  4. Next steps
Thanks to all those who responded: you've provided lots of useful information!


Who?

The first couple of question we we asked give us information about the respondents activities and roles with respect to ancient history and ancient places. To give you an idea of who our respondents are, they told us that they used the internet for ancient history/ancient place related activities:
  • as students,
  • as researchers,
  • as teachers or lecturers,
  • as bloggers,
  • for communications and marketing,
  • for hobbies and leisure activities,
  • and for researching and writing historical fiction.
6 of the 12 used the internet to pursue more than one of these activities, the rest for one activity alone. The most popular categories were 'Teacher/lecturer' and 'Researcher'.

What?

We asked them what sorts of things they would like Pelagios widgets to do and got some thought provoking answers including:
  • Embed a map so it works in a wordpress.com blog,
  • Display places and movements represented in specific texts.
  • 'Compare the geographical relationships (and names) represented in ancient texts with historical and modern representations'
  • Serve archaeology/art/museums and go beyond ' classical world'
  • 'My students want quick lookup tools that are linked to authoritative information'.
  • 'Link ancient places with ancient sources, but also with general knowledge about the place, author and work (as many people without university studies about this field might not know who is Herodotus or why he is famous for). I would also love to browse ancient places through geolocalization mobile apps I already use: Google Maps / Google Places, Foursquare, etc.'
  • 'Would love to put them in our own VLE'; 'I'd like stuff that was directly related to current course content on GCSE, AS and A2'.
However, several respondents remarked that they didn't know, because they felt they did not have a good idea of what the options and possibilities are. That's an understandable reaction, and something we will try to address by making initial versions of the widgets available for user testing as soon as these initial versions are ready in April or May 2012.
Overall, responses to other questions indicate an interest that could be satisfied by Pelagios widgets even when the respondent was not clear about precise possibilities and options. For example: 'my interest in places is because some geographical etc. information will illuminate the text being read'.

Contexts


Web sites

Respondents were asked to tell us their favourite ancient history websites, or ones which they visit most often. Answers included:
These suggested websites are also potential target hosts for Pelagios widgets.

We also listed 12 Pelagios partner websites and asked whether the respondents had visited them in the last 3 months. The percentage who had visited any of the sites are shown in the table below.

Name URL %
Perseus Digital Library http://www.perseus.tufts.edu 83.3
Google Ancient Places http://googleancientplaces.wordpress.com/ 50
Fasti Online http://www.fastionline.org/ 33.3
Pleiades http://pleiades.stoa.org 33.3
Arachne http://www.arachne.uni-koeln.de 16.7
Pelagios Graph Explorer http://pelagios.dme.ait.ac.at/graph-explorer/ 16.7
SPQR http://spqr.cerch.kcl.ac.uk/ 16.7
CLAROS http://explore.clarosnet.org 8.3
American Numismatic Society http://nomisma.org 8.3

Respondents also told us about the web sites and blogs that they themselves create, edit or manage. This work includes rogueclassicism.com, contribution of original translations of quotations from Ancient Greek and Roman literature for sententiae antiquae, http://flavias.blogspot.co.uk/ (facts, research, news & topics linked to the children's books, The Roman Mysteries and The Western Mysteries), The world of ancient art (in development and works in tandem with www.clarosnet.org), http://blogs.sapiens.cat/picantpedra (mainly about Roman Archaeology, but also related to ancient places), a website for teachers of Classics (www.theclassicslibrary.com).

Activities and reasons for using the web in relation to ancient history and ancient places

Teachers, students and researchers reported using the internet to research ancient sites and museums prior to visiting them. Students also reported using the web for accessing teaching resources, and for reading journal articles and ebooks.
Researchers and teachers use the internet to access and search classical texts and primary sources in the original language and in translation, and to use tools such as latin and ancient greek dictionaries, lexicons, language parsing tools, calendars.
There was a general interest among students, leisure/hobby folk, researchers and teachers to "see what's there" and "filling in holes in background knowledge".

Concerns about interacting with data related to ancient places online

When asked whether they would have any concerns about interacting with data related to ancient places, more than half of the repondents said that they would not. Those that did raised issues about the tracking that could result from it, and of quality control and accuracy of the data.

Downloading data relating to ancient places

There was interest in output of the widgets being rendered as a Jpeg images or PDF files for use in teaching or publication, and the question of whether the copyright of materials generated in this way would permit this use was raised.
There was also interest that the licence of data should permit students to work with it, e.g. in collaborative projects guided by mentors.

What's next?

Work on the widgets is underway, and there are several iterations of both graphic and functional development shecduled between now and the end of June 2012. The next iteration is due at the end of April 2012.




Monday, 26 March 2012

Assembling the Pelagios 2 Infrastructure

A short update from Pelagios WP1: WP 1 is about assembling the data infrastructure behind our project. To some extent this builds directly on the Graph Explorer - our proof-of-concept visualization interface from Pelagios 1. But, while we were able to reuse the things we've learned from the first stage of the project, we've also come to realize how the things we had previously implemented - features, data model, speed and scalability - have already been surpassed. WP 1 therefore starts off with a complete reappraisal and re-write of the central functionality. Details follow below, along with a rough outline of what's been done so far and what the next steps will be.

Consolidation


Given the need in Pelagios 1 to get a demonstration up and running quickly, the Graph Explorer ended up being rather a bit of a monolith. Key goal of the re-write has been, therefore, to introduce a better modularization of the codebase, and to consolidate the core functionality into one software library that can be more easily re-used elsewhere. We're still working on finalizing and testing this library as partners deliver updates to their data, but the essentials are finished. There's a convenient programming API to work with Pelagios's core model primitives - Datasets, GeoAnnotations and Places - in your own software. Bindings to store Pelagios data in a graph database are included, but without the hard-wired dependency that existed in the Graph Explorer. In this regard the Tinkerpop graph database abstraction framework has greatly helped to achieve good decoupling between data model and implementation classes, reduce code size by eliminating the need for much of the boilerplate code, and keep things generic: i.e. the bindings should be re-usable for a variety of graph database brands now (although some of the more advanced I/O and query functionality remains specific to Neo4j - our DB of choice for Pelagios).

Less Speed, More Memory Consumption

Or was this the other way round? Regarding our toolset to read Pelagios data into the system, we switched our underlying RDF parser from Jena to the OpenRDF Rio parser framework. This allows us to more directly hook into the RDF parsing lifecycle, and avoids the need to construct full RDF graphs in memory before we can actually work with the data. As a result, parsing is now faster and less memory intensive. (Credit goes to Arachne for letting us learn the hard way that datasets can be... LARGE.)

Getting our Feet Wet with Scala

As with Pelagios 1, the technological basis for our server-side components is still the Java Virtual Machine. This time, however, we chose to go with Scala:

  • Scala's syntax is, in general, more compact than that of Java.
  • Scala's functional aspects and comprehensive features for dealing with collections and lists are a very good fit with the things we frequently do when handling Pelagios data. Scala almost always eliminates the need for iterations and loops in those cases, and often achieves the same result with a single line of code.
  • Pattern matching has been another nice feature to make our parser classes (in particular) much more concise.
  • Last but not least: someone once suggested that as a developer, you should learn at least one new programming language every year. Although I find that advice a little fierce, new languages definitely encourage you to think about the same problems in different ways!

Next Steps

With our core library in place, we are now almost ready to replace the old Graph Explorer. While we are busy wrapping an HTTP frontend around our core library, our partners are already starting to make the most of our all-new, third Pelagios Principle: "Expose metadata about your dataset using the VoID vocabulary". But that's for another post!