Latest version: ilrt.org/discovery/2002/04/query/
Slides: ilrt.org/discovery/2002/04/query/Overview.html
RSS demo: sw1.ilrt.org/discovery/2002/04/rss/
The SquishQL/Inkling query work was part-funded by the Harmony project
Introduction
This paper is for the 'Netlab and friends' [NETLAB] conference in the section on interoperability. 'NetLab and friends' is a celebration of ten years of Netlab, and of Netlab's contribution to technology on the web and in particular to the development of Digital Libraries. This paper is about RDF query, and specifically how simple RDF query languages can help members of the Digital Libraries community use RDF data right now.
RDF is structured data. Instead of putting information in a Word document or a simple html file, or in 'vanilla' XML, you store your information in XML documents using a particular set of conventions for writing that XML, or in such a way that you can export the data to XML documents using those conventions (for example in a database or a spreadsheet, or even in certain forms of xhtml).
The important thing about RDF is the information model, which is a directed label graph, that is, objects in the world are represented by nodes, and their properties by arcs that link the nodes together. Here is an example of the RDF for this paper, expressed as a graph:
Using the RDF set of conventions for exporting structured data has as its goal data-interoperability. RDF's node-arc-node model provides a minimal structure for interoperability, because it encodes directly information which is often implicit in other formats, such as 'vanilla' XML documents or IAFA templates.
The first hurdle with RDF is the question: why would you want to use RDF as your structured data format for your particular project? After all, there are a number of widely used, well tested and mature protocols and query and data formats in use in the Digital Library community already.
The answer is that if you have control and will always have control over your data you don't need RDF. You can use any data format and protocol you like: XML format of your own choice, a particular profile of Z3950 or Whois++, for example. This applies whether the data is private to your organisation (such as internal company records or databases), or whether you can control the data that is provided for you (for example as with the normalisation of data for the Renardus project). There is no one best way of storing and retrieving structured data.
However, in many cases you will not control the data you are working with. Distributed data is difficult to control, because you have to rely on others to provide it. Data is also difficult to control over time, as requirements for the data change over time. Another example is data from unexpected sources, which might be reuseable, but probably isn't in your preferred format.
If you do not control the data you have to deal with, because you don't own it, or because it may change over time, or because you don't know what you might want to combine it with next, then using RDF and associated modeling strategies and tools ('Semantic Web technologies') is useful now. Part of the reason for this is that the RDF model uses certain principles for modelling data which help with interoperability. Three of these are discussed in the next section, section 1. Section 2 briefly describes a simple RDF query language. In Section 3 we describe two different possibilities for combining data from different sources using RDF and RDF query.
RDF is defined by its model, not its syntax. Using the RDF model for your data from the start can help save time if you ever want to combine the data with any other data, whether you decide to use the RDF syntax and an RDF database and query language or whether you use a relational database or Z39.50 to store and serve up the data. There are three basic modeling principles that are used to create RDF data:
The RDF model is a good sanity check for interoperable data. RDF uses nodes and connecting arcs to talk about objects and their properties, so that you start thinking in terms of objects and their properties rather than in terms of documents and their syntax. This improves extensibility and may help modelling style.
For example suppose you have a ROADS SERVICE file format, which has the field 'Record-Created-Email'. An addition to the template requires a (small) change to the parser, so that the format is not very easily extensible. Moreover, the underyling model is implicit in the format. 'Record-Created-Email' is a shortcut from the record to the person that created the record, to the email address of that person. This is just a different modeling style, but it also obscures some of the structure of the record that might help interoperability. 'Persons' often crop up in different sorts of records, but here the fact that a person created the record isn't clear. This is also a common problem with 'vanilla' XML: the emphasis is on the syntax of the document not the underlying model. There is nothing wrong per se with this type of modelling, but it can limit extensibility and interoperability.
The RDF/XML syntax has been much criticised, but it is highly expressive and extensible, and there are now many tools that process it [JENA], [SIRPAC]. There are also some tools which allow you to create arbitrary RDF using a visual editor [RDFAUTHOR], [ISAVIZ]. You do not have to use RDF/XML syntax as the primary store of your data, but as an interchange format, it is very useful.
Or rather, use well-known, public URIs where possible: don't make them up if you can avoid it. In RDF nodes are either 'blank nodes' or are identified using URIs. RDF processors automatically assume that objects with the same URI are the same object.
In the case of people and certain documents it is not good modelling style to conflate the URI with the thing. A person is not their email address, and a document might have several URLs. In this case it is useful to use indirection. If the document does not have a single URL, then it can still have a dc:identifier pointing to a URL. A person can have a foaf:mbox pointing to their email address. RDF processors will not merge nodes identified indirectly, although schema annotations can be used to do this [SMUSH]. Sometimes it may seem simpler to forgo modeling accuracy for simplicity, but be careful - things can go wrong because RDF will assume that things with the same uri are the same object. For example, characteristics of a document might get confused with the characteristics of a particular instance of the document retrievable by from a url, if you name the document with the url of a retrievable instance of it.
For interoperability purposes, it is most helpful to use well-known, public identifiers, otherwise you'll have to keep stating that 'urn:ilrtperson2356' and 'urn:rdfwebperson347687' are the same individual, and processing information form different sources will be slow and cumbersome.
RDF Schemas describe the broad structure of types of RDF documents, such as what types of arcs can link what types of nodes. They can also include information about class heirarchies: you can build elaborate links describing subclasses of objects.
So for example you could use Dublin Core to represent information about webpages and other documents; the experimental foaf vocabulary or vCard in RDF for people, addresses, relationships. Reusing vocabularies can be difficult if parts of a schema are similar to what you want, but do not quite represent it exactly. RDF has subClassOf and subPropertyOf relationships to accommodate these similarities interoperably. These are not currently processed in very many tools but are useful for fixing the meaning of classes and properties precisely, and will likely be used in the future.
RDF does not require schemas: you can do a great deal without using them at all. However for interoperability it is helpful to describe what you meant your RDF to represent in machine-readable and human-readable form. This enables people to reinterpret your data more accurately.
Once you have used these methods in your modeling, you have various options for actually storing your data. One option is to use RDF/XML documents as the primary store of your data and then use an RDF processor and a pure RDF database (one designed to store RDF). However, you do not have to use a 'pure' RDF database for storing the information, and in fact there are significant overheads to doing so, in particular the immaturity of the technology and the specialist knowledge that will probably be required. Another option is to store your data in (say) a relational database and export it to RDF/XML files.
Having said that, RDF databases are extremely flexible, and if your data structure changes rapidly you may find them very useful. For similar reasons, an RDF database can provide an interim solution for experimenting with combining different types of data until you settle on a good optimised structure in a well-understood database. The rest of this paper provides some examples of using mixed data with an RDF database that understands an RDF query language.
If you do choose an RDF database, you need a way of accessing information that mirrors the flexibility of the RDF information model. RDF query languages such as SquishQL [INKLING] (which I've been working on) all query the RDF information model directly, and do not care about how the information is stored.
For example, the query below says:
"find me the name of the person whose email address is libby.miller@bristol.ac.uk, and also find me the title and identifier of anything that she has created"
select ?name, ?title, ?identifier where (dc::title ?paper ?title) (dc::creator ?paper ?creator) (dc::identifier ?paper ?uri) (foaf::name ?creator ?name) (foaf::mbox ?creator mailto:libby.miller@bristol.ac.uk) using dc for purl.org/dc/elements/1.1/ foaf for xmlns.com/foaf/0.1/
Simple RDF queries such as this one just try and match parts of the RDF model in the database. Because of this, you can describe the query itself as a graph, as below:
The query describes the pattern of the information we want to match, not the way it is stored in the database. If we find some more data about reviews of papers, we can add this to our database and query it without redesigning the database structure, and by rewriting our query slightly:
select ?name, ?title, ?identifier, ?content where (dc::title ?paper ?title) (dc::creator ?paper ?creator) (dc::identifier ?paper ?uri) (foaf::name ?creator ?name) (foaf::mbox ?creator mailto:libby.miller@bristol.ac.uk) (foaf::review ?paper ?review) (foaf::content ?review ?content) using dc for purl.org/dc/elements/1.1/ foaf for xmlns.com/foaf/0.1/
This differs from the relational model where you have to know the structure of tables before you can make the query. The relational model ties the data strongly to the way it is stored, which means that changes to the structure of the data require changes to the structure of the database, as well as changes to the query.
In contrast when new information is added to a pure RDF database, the structure of the database does not change in a relational sense. RDF data is semi-structured and does not rely on the presence of a schema for storage or query. This makes pure RDF databases very useful for prototyping and for other fast-moving data environments.
RDF gets you thinking in terms of combining information, and how to make it (relatively) easy. You might start thinking...
So, let's find out more about the editors at SOSIG. You can already get to a brief description of the editor of a section: why not combine this information with papers they have authored, and with their current SOSIG-related bookmarks list? Technically this is fairly simple.
Using Dublin Core and a simple bookmarks schema all these formats can be converted to RDF/XML files. They can then be harvested into an RDF database and queried (see below for various options for storing and querying RDf data). An example query might look something like this:
"find me all the bookmarks from the bookmarks file of the person with email address 'emma.place@bristol.ac.uk' which are dated more recently than 1st April 2002".
select ?bookmark, ?title, ?date where (foaf::mbox ?person mailto:emma.place@bristol.ac.uk) (bm::bookmarkFile ?person ?file) (bm::bookmark ?file ?bookmark) (dc::title ?bookmark ?title) (dc::date ?bookmark ?date) and ?date > 2002-04-01 using dc for purl.org/dc/elements/1.1/ foaf for xmlns.com/foaf/0.1/ bm for example.com/bookmarks/
The results of such a query might look something like this:
bookmark | title | date |
www.psychology.ltsn.ac.uk/ | LTSN Psychology | 2002-04-02 |
www.bids.ac.uk/ | BIDS | 2002-04-04 |
The social constraints on using these pieces of information together might be more limiting than the technical aspects. Although many people make their bookmarks available publically on the web, some would not dream of doing so. People might object to such a swathe of information being made available about them.
What benefit might something like this give SOSIG?It gets people thinking about different kinds of information that might improve the service. Let's say that the editors aren't happy with everyone being able to see their bookmarks, but would be happy for other editors to see them. Then they could see what other subject editors were working on, check for duplicates, and get useful ideas for where to go next. Or suppose they're not happy for people to see their bookmarks, but would find a blogging tool helpful in the cataloging process, the output of which they would be happy to share.
The user of the service now has an interesting way of checking the credentials of the subject editors. The service does not just say: "trust us because we are trustworthy, non-profit making and have a long history of providing good resources", but "trust us because of the credentials of the named individuals who find and catalogue resources for us."
Finally, combining this information together might make technical people think about how they might generalise this cataloging model to a more inclusive annotating model, and how this might be managed, for example:
"select annotations by people that Emma knows of professionally"
select ?annotation, ?content where (ann::annotates ?annotation ?file) (ann::content ?annotation ?content) (dc::creator ?annotation ?knownPerson) (foaf::mbox ?person mailto:emma.place@bristol.ac.uk) (foaf::knowsOfProfessionally ?person ?knownPerson) using dc for purl.org/dc/elements/1.1/ foaf for xmlns.com/foaf/0.1/ ann for example.com/annotations/
For interoperability between organisations, simple is often better, especially if there are many organisations involved.
RSS [RSS] is a well-known syndication format expressed in RDF/XML, originally designed to syndicate news stories. It's very simple indeed, consisting essentially of a list of links with titles and descriptions, and a container to hold them. As of RSS 1.0, simple RSS files can be extended with modules for a particular purpose, for example with a set of Dublin Core elements to describe webpages in more detail.
At LTSN Economics [LTSN-ECON], Martin Poulter has been experimenting with the RSS 1.0 events module [RSS-EVENTS] to describe conference information [LTSN-EVENTS]. The events module adds a start date and an end date to the standard RSS item, and a location, and an organiser. Other LTSN centres have also been producing ordinary RSS 1.0 feeds [LTSN-FEEDS], as have various organisations in the subject gateway community, spurred on by Ukoln's RSSXpress [RSSXPRESS].
SOSIG's Grapevine service [GRAPE] has a personalization feature that can display feeds described as RSS 1.0. But we can do even more interesting things using query. Let's suppose we have a list of feeds such as that at UKoln's RSSExpress page. Then we produce a piece of RDF describing those feeds, for example by classifying them according to their subject within the SOSIG classification system. Then we can load all the feeds into an RDF database, load in our little RDF description file about the feeds, and with just a couple of queries we have a 'portal'.
First we ask: what feed would you like?
select ?feedUrl, ?title where (dc::subject ?feedUrl ?subject) (rss::title ?feedUrl ?title) and ?subject ~ "economics" using rss for purl.org/rss/1.0/ dc for purl.org/dc/elements/1.1/
results
feedUrl | title |
chewbacca.ilrt.bris.ac.uk/events/events.xml | LTSN Economics events |
www.cepr.org/aboutcepr/cepr.rss | Centre for Economic Policy Research |
www.bized.ac.uk/homeinfo/whatsnew.htm | Biz/ed What's New |
then we create each feed using another query which asks for the items, their urls and titles:
select ?item, ?ti, ?li where (rss::items chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq) (?contains ?seq ?item) (rss::title ?item ?ti) (rss::link ?item ?li) using rss for purl.org/rss/1.0/
And with a little html formatting, we have a portal!
If we have a channel picker like that on SOSIG Grapevine, we could store people's selections of feeds as an RDF file, and then pull them out again next time using a query like this one:
select ?feedUrl, ?title where (sosig::profile ?person ?profile) (sosig::channel ?profile ?feedUrl) (rss::title ?feedUrl ?title) (foaf::mbox ?person mailto:libby.miller@bristol.ac.uk) using rss for purl.org/rss/1.0/ foaf for xmlns.com/foaf/0.1/ sosig for www.sosig.ac.uk/schemas/profiles/
Something as simple as a Perl regex can be used to parse RSS files with great effect. The advantage of using RDF query to do it is that it makes the query and display of RSS extensions or modules very simple. For example, a query of the basic RSS form of the LTSN event feed looks just like the query for any other feed:
select ?item, ?title, ?link where (rss::items chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq) (?anyPredicate ?seq ?item) (rss::title ?item ?title) (rss::link ?item ?link) using rss for purl.org/rss/1.0/
adding the events information makes the query look like this:
select ?title, ?link, ?start, ?end, ?location where (rss::items chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq) (?anyPredicate ?seq ?item) (rss::title ?item ?title) (rss::link ?item ?link) (ev::startdate ?item ?start) (ev::enddate ?item ?end) (ev::location ?item ?location) using rss for purl.org/rss/1.0/ ev for purl.org/rss/1.0/modules/event/
giving us results that look like this:
link | title | start | end | location |
crm.hct.ac.ae/tend2002/ | Bridging the Divide - Strategies for Change | 2002-04-07 | 2002-04-09 | Dubai, United Arab Emirates |
www.scoteconsoc.org/ses2002.html | SEA Annual Conference | 2002-04-11 | 2002-04-12 | Dundee, UK |
We could even start limiting the scope of our searches and combining various feeds, for example:
"find me all the events starting in April 2002 from all feeds which can be picked out using the search term economics"
select ?item, ?title, ?link, ?start, ?end, ?location where (dc::subject ?feedUrl ?subject) (rss::items ?feedurl ?seq) (?anyPredicate ?seq ?item) (rss::title ?item ?title) (rss::link ?item ?link) (ev::startdate ?item ?start) (ev::enddate ?item ?end) (ev::location ?item ?location) and ?subject ~ "economics" and ?start ~ "2002-04" using rss for purl.org/rss/1.0/ ev for purl.org/rss/1.0/modules/event/ dc for purl.org/dc/elements/1.1/
Simple RDF queries like the SquishQL examples here are not particularly powerful: they can't do 'OR' queries for example. But they can make queries of flexible data in a fairly easy to understand fashion.
You may want to optimise your database for certain queries, once you have experimented with combining RDF data. One way of doing this is to collect RDF information in one RDF database, and then use RDF query to pick out the parts you want to optimise for.
For example, suppose we had a number of interesting economics RSS events feeds available. A query such as
select ?item, ?title, ?link, ?start, ?end, ?location where (dc::subject ?feedUrl ?subject) (rss::items ?feedurl ?seq) (?anyPredicate ?seq ?item) (rss::title ?item ?title) (rss::link ?item ?link) (ev::startdate ?item ?start) (ev::enddate ?item ?end) (ev::location ?item ?location) and ?subject ~ "economics" and ?start ~ "2002-04" using rss for purl.org/rss/1.0/ ev for purl.org/rss/1.0/modules/event/ dc for purl.org/dc/elements/1.1/
(repeated from above) gives us the raw ingredients of a simple flat relational database structure, with one table:
id | link | title | start | end | location |
which we can then query using SQL to create a nice html list of conferences for our economics users (or perhaps our own RSS feed). In this way, RDF databases and query can be an intermediate step, helping us to gather and organise diverse data before optimising.
My aim has been to show how RDF tools can be useful to the Digital libraries community now. I've suggested that while RDF tools may not be as fast and well-understood as more conventional databases and protocols, nevertheless they can be used to combine information from multiple sources in interesting and practical ways that can extend the functionality of services.
Tools you might like to try include Jena [JENA], SquishQL/Inkling [INKLING], SquishQL/Ruby [RUBY-RDF] tools for storing and querying RDF in SQL databases. RDFStore [RDFSTORE] includes a Perl implementation; RDFdb [RDFDB] has a similar query language, on which SquishQL was based. Redland [REDLAND] is a fast RDF database. Many more RDF query languages and databases systems are available - see Dave Beckett's RDF resource guide [BECKETT].
Thanks to Dan Brickley and Damian Steer for helpful comments and discussion.
[NETLAB] www.lub.lu.se/netlab/conf/
[JENA] www.hpl.hp.com/semweb/jena-top.html
[SIRPAC] www-db.stanford.edu/~melnik/rdf/api.html
[RDFAUTHOR] rdfweb.org/people/damian/RDFAuthor
[ISAVIZ] www.w3.org/2001/11/IsaViz/
[SMUSH] rdfweb.org/2001/01/design/smush.html
[INKLING] swordfish.rdfweb.org/rdfquery/
[ROADS] www.ukoln.ac.uk/metadata/roads/templates/
[SOSIG] www.sosig.ac.uk/
[EDITORS]
for example www.sosig.ac.uk/profiles/econ_management.html
[RSS] www.purl.org/rss/1.0/
[LTSN-ECON] www.economics.ltsn.ac.uk/
[RSS-EVENTS] groups.yahoo.com/group/rss-dev/files/Modules/Proposed/mod_event.html
[LTSN-EVENTS] chewbacca.ilrt.bris.ac.uk/events/events.xml
[LTSN-FEEDS] www.ltsneng.ac.uk/rssfeeds/rsseg.asp
[RSSEXPRESS] rssxpress.ukoln.ac.uk/
[GRAPE] www.sosig.ac.uk/gv/
[RUBY-RDF] www.w3.org/2001/12/rubyrdf/
[RDFSTORE] rdfstore.sourceforge.net/
[RDFDB] web1.guha.com/rdfdb/
[REDLAND] www.redland.opensource.ac.uk/
[BECKETT www.ilrt.bris.ac.uk/discovery/rdf/resources/