Monday, 23 April 2012

Expressing uncertain and vague relationships


As part of our work on CLAROS we're again coming up against the problem of modelling uncertainty and vagueness of relationships using the CIDOC Conceptual Reference Model (CRM).

Say a database has a column for "find location", and which stores free text. In CLAROS we model this as the following pattern:
:object a crm:E22_Man-Made_Object ;
 crm:P16i_was_used_for :finding-object .
:finding-object a crm:E7_Activity ;
 crm:P2_has_type claros:FindActivity ;
 crm:P7_took_place_at :place .

Real-world data

However, we don't always know the find location with certainty. For example, in a database you might expect to see entries like:
  • Ankara
  • Ankara(?)
  • near Ankara
  • between Ankara and Peçenek
How should we model these? First, we should consider our requirements:
  • the model should be as simple as possible
  • the model shouldn't lead to unwarranted interpretations by naïve agents (the open world assumption meets progressive enhancement)
  • where possible we should re-use existing approaches
The first one is obvious; we can say unequivocally that the activity took place at Ankara. The others will require more thought.

Vague relationships

The third and fourth are vague; “how near?”, “how far from the midline between the two places counts as 'between'?”. We can probably assume that if X is between Y and Z that X is near both Y and Z. We can also assume that if X is near Y, then X is almost certainly contained within Y's containing area. For example:
  • “P near Ankara”, “Ankara Province contains Ankara” ⊢ “Ankara Province contains P”
  • “Q between Ankara and Peçenek” ⊢ “P near Ankara”, “P near Peçenek”
    “Q near Ankara”, “Ankara Province contains Ankara” ⊢ “Ankara Province contains Q”
We can model “near” using FOAF's foaf:based_near predicate, which maintains the sense of vagueness. A place that is between to other places can be thought of as being near to both of them. Containment can be modelled using CRM's P87 “falls within” property. By combining both approaches we satisfy CRM-only consumers (who discover an imprecise geographic location), while preserving the original relationship to the nearby place for those that care to look for it.
If we don't trust the assertion that being near somewhere means you share a containing place, we can attempt to encode some uncertainty, as we will see next.

Uncertain relationships

Finally, we have the uncertainty encoded using the “(?)” of the second example.
With the next one we interpret '(?)' to mean that there is some uncertainty as to whether it was actually found in Ankara. Because of this it would be misleading to say simply that the activity took place at Ankara.

I posed this problem to the crm-sig mailing list, and received a response suggesting that the RDF reification vocabulary would be a good fit. The RDF semantics document says that “a reification of a triple does not entail the triple, and is not entailed by it”. This might seem a bit counter-intutive, but it makes sense when you consider that the reified statement might be attested more than once, and that it might not necessarily be true. We could thus use a reified statement and attach properties to say “this statement is possibly true”. However, this is rather verbose and produces patterns that are very unlike un-reified RDF.
Martin Doerr suggested superclassing the relevant CRM properties to produce uncertain variants. Hence P7_took_place_at is a subproperty of P7_possibly_took_place_at. This allows agents to either query for the former (finding certain relationships) or query for the latter and its subproperties. If a query engine supports basic inference, then the latter comes for free, otherwise one would turn a query like:
SELECT * WHERE {
 ?event crm:P7_took_place_at ?place
}
into:
SELECT * WHERE {
 ?took_place_at rdfs:subPropertyOf? ex:P7_possibly_took_place_at .
 ?event ?took_place_at ?place
}
Here, the question mark after rdfs:subPropertyOf denotes a property path (a route from one thing to another through the RDF graph) consisting of zero or one instances of the given property. Thus, ?took_place_at is bound in turn to ex:P7_possibly_took_place_at and all its immediate sub-properties. Assuming that there is a well-understood property linking a property to its uncertain variant, ex:uncertainVariantOf, then we could also formulate this as:
SELECT * WHERE {
 ?took_place_at ex:uncertainVariantOf? crm:P7_took_place_at .
 ?event ?took_place_at ?place
}
This is (in my view) rather cleaner than the reification approach. It should be possible to programmatically generate uncertain variants of vocabularies from the original vocabulary definition, and to transform SPARQL queries to take into account uncertain relationships.
So, does defining a way to create uncertain vocabularies linked to their certain counterparts sound like a good idea?