This is a short overview of the query facilities added on the top of RDFLib, a Python toolkit to manage RDF triplets. These are based on the October 2004 version of the SPARQL draft worked on at the W3C. SPARQL is based on the RDQL facilities offered by more powerful RDF toolkits like Jena; however, RDQL is not available (to my knowledge) on the top of RDFLib.
The two files necessary for running it are rdflibUtils.py
and sparql.py
. rdflibUtils.py
defines a subclass of RDFLib's TripleStore that includes, among some other
things, the SPARQL facilities.
(The SPARQL implementation is in the sparql.py
, and has been separated into a separate file for purely practical purposes,
essentially for an easier maintainance.)
Disclaimer: I do not think any of the methods, specifications, or algorithms in this implementation are new, they probably have been found and implemented a million times by others. I just came up with something that worked without making an extensive literature search...
Disclaimer II: the SPARQL draft describes its facilities in terms of a query language. A full blown SPARQL implementation would (I believe) include a parser of that language that would map on the underlying implementation. This implementation does not include such parser, only the underlying SPARQL engine. The description below shows how the mapping works, but users of rdflibUtils should call the corresponding methods directly.
The basic SPARQL construct is as follows (using the query-like syntax of the draft):
SELECT ?a,?b,?c WHERE (?a P ?x) (Q ?b ?a) (?x R ?c) ...
The meaning of this construction is simple: the '?a', '?b', etc, symbols (referred to as 'unbound' symbols)
are queried with the constraint that the tuples listed in the WHERE
clause are 'true', ie,
part of the triple store. This functionality is translated into a Python method as:
select = ("?a","?b","?c") where = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c")] result = myTripleStore.sparqlQuery(select,where)
where result
is a list of tuples, each giving possible binding combinations for "?a", "?b",
and "?c", respectively. P
, Q
, R
, etc, must be the rdflib
incarnations of RDF resources, i.e., URIRef-s
, Literals
, etc. As a convenience to
the user, if one of these entries in the tuples is a simple string (ie, not query string starting with a
'?'), it will be convereted to a Literal on the fly. This allows coding in the form:
select = ("?a","?b","?c") where = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] result = myTripleStore.sparqlQuery(select,where)
Note that the current SPARQL draft does not include to this facility, but it does include a similar facility for integers. It also includes a remark whereby more such types will be defined in future (and I would expect strings would be one of them). On the other hand, rdflib does not include yet a full control of datatypes (eg, datatypes are not taken into account when comparing literals), so I did not implement other datatypes yet.
As a further convenience to the user, if select
consists of one single entry, it is not
necessary to use a tuple but just giving the string value will do. Similarly, if the where
consists of one single tuple, the array construction may be skipped, and the single tuple is accepted as an
argument. Finally, if select
consists of one entry, result
is a list of the
values rather than tuples of (single) values.
The rdflibUtils.SPARQLError
exception is raised (beyond the possible exceptions raised by
rdflib) if there are inconsistencies in the select
or where
clauses (eg, an
unbound variable in select
does not appear in the where
clause.
In SPARQL makes it possible to constrain values through operators, like:
SELECT ?a,?b,?c WHERE (?a P ?x) (Q ?b ?a) (?x R ?c) AND ?x < 10 ...
The draft also refers to the fact that application specific functions can be used in the 'AND' part. There are two ways to translate this feature into Python (see below for a further discussion): into a global and into a per-pattern constraint.
This version is based on constraints that refer to the whole binding of the pattern and is therefore excecuted against the full binding. Here is how it looks in Python:
select = ("?a","?b","?c") where = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] funcs = [func1,func2,...] result = myTripleStore.sparqlQuery(select,where,constraints=funcs)
where constraints
is a list of functions (note the keyword argument). Each function is of the
form:
def func1(bindingDir) : .... return True # or False
where bindingDir
is a dictionary of the possible binding, ie, of the form {"?a" :
Resource1, "?b" : Resource2, ...}
.
Just as in case of the select
parameter, if only one constraint method is used, then the usage
of a list may be by-passed, ie:
select = ("?a","?b","?c") where = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] result = myTripleStore.sparqlQuery(select,where,constraints=func)
is also valid
This version is based on a contraint that can be imposed on one specific (bound) pattern only. In Python, this is translated by adding a fourth element to the tuple representing the tuple, eg:
select = ("?a","?b","?c") where = [("?a",P,"?x",func),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] result = myTripleStore.sparqlQuery(select,where)
where func
is a function with three arguments (the bound version of the "?a"
,
P
, "?x"
in the example.
The global constraint is clearly a superset of the per pattern constraint. The original comparison method can therefore be expressed in two different ways:
select = ("?a","?b","?c") where = [("?a",P,"?x"),(Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] result = myTripleStore.sparqlQuery(select,where, constraints=lambda dir: int(dir["?x"]) < 10)
or:
select = ("?a","?b","?c") where = [("?a",P,"?x",lambda a,P,x: int(x) < 10), (Q,"?b","?a"),("?x",R,"?c"),("?x",S,"Some Literal Here")] result = myTripleStore.sparqlQuery(select,where)
However, the second version may be much more effective. The search is 'cut' in the by the constraint, ie the binding tree is not (unnecessarily) expanded, whereas the full binding tree must be generated for a global constraint (see the notes on the implementation below). For large triple stores and/or large patterns this may be a significant difference.
Which is of the version is really required is not really clear from the SPARQL draft. It may well be that only the per-pattern constraint will be retained, in which case the global constaint is an add-on feature. If a global constraint is required, a parser of the SPARQL notation may optimize by calling the per-pattern constraint. (But not in all cases: although it is doable for obvious operators, the draft refers to general application functions as constraint, too, in which case this optimization is not possible.)
A slight variation of the basic scheme could be described as:
SELECT ?a,?b,?c WHERE (?a P ?x) {(Q ?b ?a) (S ?b ?a)} (?x R ?c) ...
meaning a logical 'or' on one of the clauses. This construction is, in fact, just a convenience; semantically, it is the concatenation of two queries:
SELECT ?a,?b,?c WHERE (?a P ?x) (Q ?b ?a) (?x R ?c) ...
and
SELECT ?a,?b,?c WHERE (?a P ?x) (S ?b ?a) (?x R ?c) ...
ie, it is just syntactic sugar, albeit a very conventient one. This is translated into Python as follows:
select = ("?a","?b","?c") where = [("?a",P,"?x"),[(Q,"?b","?a"),(S,"?b","?a")],("?x",R,"?c")] result = myTripleStore.sparqlQuery(select,where)
It is not clear from the SPARQL draft whether nesting is meant to be of an arbitrary depth. This implementations implements a one level nesting only.
Another variation on the basic query is the usage of 'optional' clauses:
SELECT ?a,?b,?c,?d WHERE (?a P ?x) (Q ?b ?a) (?x R ?c) OPTIONAL (?x S ?d)...
What this means is that if the fourth tuple (with ?x already bound) is not in the triple store, that should not invalidate the possible bindings of '?a', '?b', and '?c'; instead, the '?d' unbound variable should be set to a null value, but the remaining bindings should be returned. Semantically, this can be considered as a special join; first a traditional query is performed:
SELECT ?a,?b,?c WHERE (?a P ?x) (Q ?b ?a) (?x R ?c)
then, for each possible bindings, a second query is done:
SELECT ?d WHERE (X S ?d)
where X
stands for a possible binding of ?x. The Python translation of this facility is
simple:
select = ("?a","?b","?c","?d") where = [("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")] opt = [("?x",S,"?d")] result = myTripleStore.sparqlQuery(select,where,optional=opt)
and the (possible) unbound "?d" is set to None
in the return value. The optional array can be
nested just like the where
pattern.
The draft, actually, allows for multiple OPTIONAL blocks. This aspect has not been implemented (yet).
The SPARQL draft includes several 'Query forms', which is a term to control how the query results are returned to the caller. In the case
of rdflibUtils
this is implemented via a separate Python class, called SPARQL_Results
. All query results yield, in fact,
an instance of that class, and various methods are defined corresponding to the SPARQL Query Forms.
The myTripleStore.sparqlObject
method can be invoked instead of myTripleStore.sparqlQuery
to return an instance
of such object. (In fact, the myTripleStore.sparqlQuery
method,
used in all previous examples, is simply a convenient shorthand, see below.)
The SELECT SPARQL query forms are used to retrieve the query results. Corresponding to the draft, the SPARQL_Results
has
a select
method, with two (keyword) argument: distinct
(with possible values True
and False
)
and limit
(which is either a positive integer or None
). For example:
select = ("?a","?b","?c","?d") where = [("?a",P,"?x"),[(Q,"?b","?a"),("?x",R,"?c")] opt = [("?x",S,"?d")] resultObject = myTripleStore.sparqlObject(select,where,optional=opt) result = resultObject.select(distinct=True,limit=5)
returns the first 5 query results, all distinct from one another. The default for distinct
is set True
and the
limit
is None
. Ie, the myTripleStore.sparqlQuery
is, in fact, a shorthand
for myTripleStore.sparqlObject(selec,where,...).select()
(and that is exactly how it is implemented in
myTripleStore
; it is probably the most widespread use of select
).
The construct
method returns a new instance of myTripleStore
, that includes all triples in the
query patterns by substituting the variables into the graph template. (The best way to describe this is to refer to an
old note of Guha & Co.).
Note that it makes sense to define a query with an
empty select; using this form one yields an appropriate subgraph.
I am a little bit confused by what the draft really means for this form, my interpretation might not be correct...
The SPARQL draft refers to a CONSTRUCT
clause in the Query. It looks to me that it is identical to adding this pattern
to the WHERE
clause, define a select with the corresponding variables, and do a traditional query. The CONSTRUCT *
case seems to correspond to the whole process running without a SELECT
clause.
The current draft is pretty vague as to what this facility is (and leaves is to the implementor). What myTripleStore
implements is a form of clustering. The SPARQL_Results.describe()
method has two keyword arguments, forward
and backward
, each a boolean. What it means:
forward=True and backward=False
generates a triple store with a transitive closure for each result of the query:
take, recursively, all the properties and objects that start by a specific resource.forward=False and backward=True
the same as forward
but in the 'other direction'.forward=True and backward=True
combines the two into one triple store.forward=False and backward=False
returns and empty triple store.Just for the sake of comleteness: the SPARQL draft refers to an ASK
query form. This is part of the basic functionalities
of the RDFLib TripleStore, it has not been reproduced here. (At some point a unified interface through RDFLib might be necessary, though.)
The implementation of SPARQL is based on an expansion tree. Each layer in the tree takes care of a statement
in the WHERE
clause, starting by the first for the first layer, then the second statement for
the second layer, etc. Once the full expansion is done, the results for SELECT
are collected
by visiting the leaves. In more details:
The root of the tree is created by handing over the full list of statements, and a dictionary with the
variable bindings. Initially, this dictionary looks like { "?x": None, "?a" : None, ... }
. The
node picks the first tuple in the where
, replaces all unbound variables by None
and makes a traditional query to the triple store. The result are all tuples in the triple store that
conform to the pattern expressed by the first where
pattern tuple. For each of
those a child node is created, by handing over the rest of the triples in the where
clause, and a binding where some of the None
values are replaced by 'real' RDF resources. The
children follow the same process recursively. There are several ways for the recursion to stop:
where
pattern to handle, no tuples are found in the triple
store in the process. This means that the corresponding branch does not produce valid results. (In the
implementation, such a node is marked as 'clashed'). The same happens if, though a tuple is found, that tuple
is rejected by the constraint function assigned to this tuple (the "per-tuple" constraint).
where
clause, there are still
unbound variables
The third type of leaf contains a valid, possible query result for the unbound variables. Once the
expansion is complete, the collection of the results becomes obvious: succesful leaves are visited to
return their results as the binding for the variables appearing in the select
clause; the
non-leaf nodes simply collect and combine the results of their children.
The implementation of the 'optional' feature follows the semantic description. A
pre-processing step separates the 'regular' and 'optional' select
and where
clauses. Then a regular expansion is done; finally, a separate expansion is attached to each
succesful leaf node (obviously, by binding all variables that can be bound on that level). The
collection of the result follows the same mechanism except that if the optional expansion tree yields no
results, the real result tuples are padded by the necessary number of None-s
.