SPARQL in RDFLib

Introduction

This is a short overview of the query facilities added on the top of RDFLib, a Python toolkit to manage RDF triplets. These are based on the October 2004 version of the SPARQL draft worked on at the W3C. SPARQL is based on the RDQL facilities offered by more powerful RDF toolkits like Jena; however, RDQL is not available (to my knowledge) on the top of RDFLib.

The two files necessary for running it are rdflibUtils.py and sparql.py. rdflibUtils.py defines a subclass of RDFLib's TripleStore that includes, among some other things, the SPARQL facilities. (The SPARQL implementation is in the sparql.py, and has been separated into a separate file for purely practical purposes, essentially for an easier maintainance.)

Disclaimer: I do not think any of the methods, specifications, or algorithms in this implementation are new, they probably have been found and implemented a million times by others. I just came up with something that worked without making an extensive literature search...

Disclaimer II: the SPARQL draft describes its facilities in terms of a query language. A full blown SPARQL implementation would (I believe) include a parser of that language that would map on the underlying implementation. This implementation does not include such parser, only the underlying SPARQL engine. The description below shows how the mapping works, but users of rdflibUtils should call the corresponding methods directly.

Basic SPARQL

The basic SPARQL construct is as follows (using the query-like syntax of the draft):

SELECT ?a,?b,?c
WHERE (?a P ?x)
      (Q ?b ?a)
      (?x R ?c)
       ...

The meaning of this construction is simple: the '?a', '?b', etc, symbols (referred to as 'unbound' symbols) are queried with the constraint that the tuples listed in the WHERE clause are 'true', ie, part of the triple store. This functionality is translated into a Python method as:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c")]
result = myTripleStore.sparqlQuery(select,where)

where result is a list of tuples, each giving possible binding combinations for "?a", "?b", and "?c", respectively. P, Q, R, etc, must be the rdflib incarnations of RDF resources, i.e., URIRef-s, Literals, etc. As a convenience to the user, if one of these entries in the tuples is a simple string (ie, not query string starting with a '?'), it will be convereted to a Literal on the fly. This allows coding in the form:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
result = myTripleStore.sparqlQuery(select,where)

Note that the current SPARQL draft does not include to this facility, but it does include a similar facility for integers. It also includes a remark whereby more such types will be defined in future (and I would expect strings would be one of them). On the other hand, rdflib does not include yet a full control of datatypes (eg, datatypes are not taken into account when comparing literals), so I did not implement other datatypes yet.

As a further convenience to the user, if select consists of one single entry, it is not necessary to use a tuple but just giving the string value will do. Similarly, if the where consists of one single tuple, the array construction may be skipped, and the single tuple is accepted as an argument. Finally, if select consists of one entry, result is a list of the values rather than tuples of (single) values.

The rdflibUtils.SPARQLError exception is raised (beyond the possible exceptions raised by rdflib) if there are inconsistencies in the select or where clauses (eg, an unbound variable in select does not appear in the where clause.

Constraining Values

In SPARQL makes it possible to constrain values through operators, like:

SELECT ?a,?b,?c
WHERE (?a P ?x)
      (Q ?b ?a)
      (?x R ?c)
AND   ?x < 10
      ...

The draft also refers to the fact that application specific functions can be used in the 'AND' part. There are two ways to translate this feature into Python (see below for a further discussion): into a global and into a per-pattern constraint.

Global Constraint

This version is based on constraints that refer to the whole binding of the pattern and is therefore excecuted against the full binding. Here is how it looks in Python:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
funcs  = [func1,func2,...]
result = myTripleStore.sparqlQuery(select,where,constraints=funcs)

where constraints is a list of functions (note the keyword argument). Each function is of the form:

def func1(bindingDir) :
    ....
    return True # or False     

where bindingDir is a dictionary of the possible binding, ie, of the form {"?a" : Resource1, "?b" : Resource2, ...}.

Just as in case of the select parameter, if only one constraint method is used, then the usage of a list may be by-passed, ie:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
result = myTripleStore.sparqlQuery(select,where,constraints=func)

is also valid

Per Pattern constraint

This version is based on a contraint that can be imposed on one specific (bound) pattern only. In Python, this is translated by adding a fourth element to the tuple representing the tuple, eg:

select = ("?a","?b","?c")
where  = [("?a",P,"?x",func),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
result = myTripleStore.sparqlQuery(select,where)

where func is a function with three arguments (the bound version of the "?a", P, "?x" in the example.

Why Two Constraints?

The global constraint is clearly a superset of the per pattern constraint. The original comparison method can therefore be expressed in two different ways:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
result = myTripleStore.sparqlQuery(select,where, constraints=lambda dir: int(dir["?x"]) < 10)

or:

select = ("?a","?b","?c")
where  = [("?a",P,"?x",lambda a,P,x: int(x) < 10),
          (Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")]
result = myTripleStore.sparqlQuery(select,where)

However, the second version may be much more effective. The search is 'cut' in the by the constraint, ie the binding tree is not (unnecessarily) expanded, whereas the full binding tree must be generated for a global constraint (see the notes on the implementation below). For large triple stores and/or large patterns this may be a significant difference.

Which is of the version is really required is not really clear from the SPARQL draft. It may well be that only the per-pattern constraint will be retained, in which case the global constaint is an add-on feature. If a global constraint is required, a parser of the SPARQL notation may optimize by calling the per-pattern constraint. (But not in all cases: although it is doable for obvious operators, the draft refers to general application functions as constraint, too, in which case this optimization is not possible.)

Nested Patterns

A slight variation of the basic scheme could be described as:

SELECT ?a,?b,?c
WHERE (?a P ?x)     
      {(Q ?b ?a) (S ?b ?a)}     
      (?x R ?c)
      ...

meaning a logical 'or' on one of the clauses. This construction is, in fact, just a convenience; semantically, it is the concatenation of two queries:

SELECT ?a,?b,?c
WHERE (?a P ?x)
      (Q ?b ?a)
      (?x R ?c)
      ...

and

SELECT ?a,?b,?c
WHERE (?a P ?x)
      (S ?b ?a)
      (?x R ?c)
      ...

ie, it is just syntactic sugar, albeit a very conventient one. This is translated into Python as follows:

select = ("?a","?b","?c")
where  = [("?a",P,"?x"),[(Q,"?b","?a"),(S,"?b","?a")],("?x",R,"?c")]
result = myTripleStore.sparqlQuery(select,where)

It is not clear from the SPARQL draft whether nesting is meant to be of an arbitrary depth. This implementations implements a one level nesting only.

Optional Matching

Another variation on the basic query is the usage of 'optional' clauses:

SELECT ?a,?b,?c,?d
WHERE    (?a P ?x)      
         (Q ?b ?a)      
         (?x R ?c)      
OPTIONAL (?x S ?d)...

What this means is that if the fourth tuple (with ?x already bound) is not in the triple store, that should not invalidate the possible bindings of '?a', '?b', and '?c'; instead, the '?d' unbound variable should be set to a null value, but the remaining bindings should be returned. Semantically, this can be considered as a special join; first a traditional query is performed:

SELECT ?a,?b,?c
WHERE (?a P ?x)      
      (Q ?b ?a)      
      (?x R ?c)

then, for each possible bindings, a second query is done:

SELECT ?d
WHERE (X S ?d)

where X stands for a possible binding of ?x. The Python translation of this facility is simple:

select = ("?a","?b","?c","?d")
where  = [("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")]
opt    = [("?x",S,"?d")]
result = myTripleStore.sparqlQuery(select,where,optional=opt)

and the (possible) unbound "?d" is set to None in the return value. The optional array can be nested just like the where pattern.

The draft, actually, allows for multiple OPTIONAL blocks. This aspect has not been implemented (yet).

Query Forms

The SPARQL draft includes several 'Query forms', which is a term to control how the query results are returned to the caller. In the case of rdflibUtils this is implemented via a separate Python class, called SPARQL_Results. All query results yield, in fact, an instance of that class, and various methods are defined corresponding to the SPARQL Query Forms. The myTripleStore.sparqlObject method can be invoked instead of myTripleStore.sparqlQuery to return an instance of such object. (In fact, the myTripleStore.sparqlQuery method, used in all previous examples, is simply a convenient shorthand, see below.)

SELECT Forms

The SELECT SPARQL query forms are used to retrieve the query results. Corresponding to the draft, the SPARQL_Results has a select method, with two (keyword) argument: distinct (with possible values True and False) and limit (which is either a positive integer or None). For example:

select = ("?a","?b","?c","?d")
where  = [("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")]
opt    = [("?x",S,"?d")]
resultObject = myTripleStore.sparqlObject(select,where,optional=opt)
result       = resultObject.select(distinct=True,limit=5)

returns the first 5 query results, all distinct from one another. The default for distinct is set True and the limit is None. Ie, the myTripleStore.sparqlQuery is, in fact, a shorthand for myTripleStore.sparqlObject(selec,where,...).select() (and that is exactly how it is implemented in myTripleStore; it is probably the most widespread use of select).

CONSTRUCT Forms

The construct method returns a new instance of myTripleStore, that includes all triples in the query patterns by substituting the variables into the graph template. (The best way to describe this is to refer to an old note of Guha & Co.). Note that it makes sense to define a query with an empty select; using this form one yields an appropriate subgraph.

I am a little bit confused by what the draft really means for this form, my interpretation might not be correct... The SPARQL draft refers to a CONSTRUCT clause in the Query. It looks to me that it is identical to adding this pattern to the WHERE clause, define a select with the corresponding variables, and do a traditional query. The CONSTRUCT * case seems to correspond to the whole process running without a SELECT clause.

DESCRIPTION Forms

The current draft is pretty vague as to what this facility is (and leaves is to the implementor). What myTripleStore implements is a form of clustering. The SPARQL_Results.describe() method has two keyword arguments, forward and backward, each a boolean. What it means:

Asking 'yes' or 'no'

Just for the sake of comleteness: the SPARQL draft refers to an ASK query form. This is part of the basic functionalities of the RDFLib TripleStore, it has not been reproduced here. (At some point a unified interface through RDFLib might be necessary, though.)

Implementation

The implementation of SPARQL is based on an expansion tree. Each layer in the tree takes care of a statement in the WHERE clause, starting by the first for the first layer, then the second statement for the second layer, etc. Once the full expansion is done, the results for SELECT are collected by visiting the leaves. In more details:

The root of the tree is created by handing over the full list of statements, and a dictionary with the variable bindings. Initially, this dictionary looks like { "?x": None, "?a" : None, ... }. The node picks the first tuple in the where, replaces all unbound variables by None and makes a traditional query to the triple store. The result are all tuples in the triple store that conform to the pattern expressed by the first where pattern tuple. For each of those a child node is created, by handing over the rest of the triples in the where clause, and a binding where some of the None values are replaced by 'real' RDF resources. The children follow the same process recursively. There are several ways for the recursion to stop:

  1. though there is still a where pattern to handle, no tuples are found in the triple store in the process. This means that the corresponding branch does not produce valid results. (In the implementation, such a node is marked as 'clashed'). The same happens if, though a tuple is found, that tuple is rejected by the constraint function assigned to this tuple (the "per-tuple" constraint).
  2. though there are no statements to process any more in the where clause, there are still unbound variables
  3. all variables are bound and there are no more patterns to process. Unless one of the global constraints reject this binding this yields 'successful' leaves.

The third type of leaf contains a valid, possible query result for the unbound variables. Once the expansion is complete, the collection of the results becomes obvious: succesful leaves are visited to return their results as the binding for the variables appearing in the select clause; the non-leaf nodes simply collect and combine the results of their children.

The implementation of the 'optional' feature follows the semantic description. A pre-processing step separates the 'regular' and 'optional' select and where clauses. Then a regular expansion is done; finally, a separate expansion is attached to each succesful leaf node (obviously, by binding all variables that can be bound on that level). The collection of the result follows the same mechanism except that if the optional expansion tree yields no results, the real result tuples are padded by the necessary number of None-s.


Iván Herman, $Date: 2004/10/23 16:48:36 $, ivan@ivan-herman.net

Creative Commons License
This work is licensed under a Creative Commons License.