bobdc.blog

"Learning SPARQL" now available

2011-07-27T12:12:53Z

In print and ebook formats.

I'm very happy to announce that the ebook and print editions of Learning SPARQL are now available from O'Reilly. Print editions are also available from amazon.com, amazon.co.uk, maybe some more Amazons, and Barnes and Noble. (Borders says that it's on backorder, but I wouldn't hold your breath for that.) You can read more about how I came to write the book in an earlier blog posting.

Right now it's the only complete book on the W3C standard query language for linked data and the semantic web, and as far as I know the only book at all that covers the full range of SPARQL 1.1 features such as the ability to update data. The book steps you through simple examples that can all be performed with free software, and all sample queries, data, and output are available on the book's website. In the words of Priscilla Walmsley, "It's excellent—very well organized and written, a completely painless read. I not only feel like I understand SPARQL now, but I have a much better idea why RDF is useful (I was a little skeptical before!)"

I will continue to post news about the book and about SPARQL on the book's twitter account at @LearningSPARQL. I'm not starting a separate blog for the book, so I will continue to blog about SPARQL here.

Linking linked data to U.S. law

2011-07-08T12:29:08Z

Automating conversion of citations into URLs.

At a recent W3C Government Linked Data Working Group working group meeting, I started thinking more about the role in linked data of laws that are published online. To summarize, you don't want to publish the laws themselves as triples, because they're a bad fit for the triples data model, but as online resources relevant to a lot of issues out there, they make an excellent set of resources to point to, although you may not always get the granularity you want.

Plenty of government data references laws and related materials.

I'm discussing U.S. Federal law here, but similar principles should apply both in individual states and in other countries. The main sets of laws here are legislation, code, regulations, and court decisions. ("Code" refers to laws passed by legislation, arranged by topic; for example, laws passed about taxes are gathered into the Internal Revenue Code.) If you really want to learn about the various forms of legal material and their relationship, I highly recommend the book Finding the Law, which I found indispensable when I worked at LexisNexis.

Most law consists of narrative sentences arranged as paragraphs, often with metadata assigned to certain blocks of it. It's such a good fit for XML that legal publishers were among the first users of XML's predecessor, SGML. (Their use of XML and SGML account for a large chunk of my career, and I know that some old XML friends like Sean McGrath and Dale Waldt continue to make great contributions in this area.) So, while you wouldn't get much benefit splitting these sentences and paragraphs into subjects, predicates, and objects and publishing them as triples, plenty of government data references laws and related materials, and it's more helpful if they can reference them with URLs that lead to the actual laws. To add these URLs with any kind of scalability, you need to find out the common format for citing a document (or, if possible, a point within a document) and an online source of those legal documents whose URLs can be built from that citation format with a regular expression or some other automated tool.

When creating links to any specific bits of U.S. law, the most valuable book is The Bluebook: A Uniform System of Citation. As the subtitle implies, the book describes the normalized way to refer to legal documents and their components. Once you know these, a regular expression can often turn them into a URL that leads a browser right to the part you want. For example, while people often refer to the Supreme Court case outlawing school segregation as "Brown v. Board of Education", its official name is "347 U.S. 483", which means "the case beginning on page 483 of volume 347 of the official publication of U.S. Supreme Court decisions".

While there are several sites hosting Supreme Court decisions out there, notably Cornell Law School's Legal Information Institute, the one whose URLs are easiest to construct from a proper Supreme Court citation are at justia.com, where the URL for Brown v. Board of Education is supreme.justia.com/us/347/483/case.html. (See also my favorite case, Campbell aka Skyywalker et al v. Acuff Rose Music, Inc. at supreme.justia.com/us/510/569/case.html. Make sure to listen to the relevant work on YouTube while you review it.) If you're really interested in linked data and U.S. Supreme Court cases, DBpedia has lots of great metadata for many important cases, as I wrote about in Court decision metadata and DBpedia.

To create a URL for other U.S. court systems, you'll have to look up the proper way to cite them in a resource like the Bluebook and then look for versions of that court's cases online with URLs that reflect the citation in a manner that lets you automate the creation of the URL. This is a theme for linking to any kind of law on the web, and you can be sure that developers at the Legal Information Institute, LexisNexis, WestLaw, and other legal publishers have put plenty of time into developing regular expressions to make this happen so that they can turn plain text citations into hypertext links. (It would be great if the LII made their regular expressions public. LexisNexis and WestLaw never would, although they're more interested in keeping such proprietary work away from each other than from us.)

Legislation can be more complicated, but two excellent resources make it remarkably simple: the Library of Congress's THOMAS system lets you create persistent URLs for legislation using the handle system (see also its inventor's web page on it), which I hadn't heard of before the Government Linked Data meeting. The Law Librarian Blog has a nice entry showing examples of how to use it. LegisLink is another way to link to legislation, and looks simpler to me. A Legal Information Institute blog entry has a good explanation of this, and LegisLink provides an excellent form to construct the URLs. These even let you construct links to a specific section of a piece of legislation.

Granularity is an even bigger issue when linking to code and regulations, which are often broken down into numbered and lettered pieces of pieces of pieces. Ever since I worked at the grandly named Research Institute of America (a publisher of hyperlinked U.S. tax law and related information), it's always irked me to see people refer to a pension plan as a 401K, because as subsection k of section 401 of the U.S. Tax Code (title 26 of the U.S. Code), it's more properly written 401(k), or, to use its full name, 26 USC 401(k). The Government Printing Office lets you you link directly to section 401, if not subsection k, with the URL frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+26USC401, and the LII lets you link to it with www.law.cornell.edu/uscode/26/usc_sec_26_00000401----000-.html.

That's the US Code, which arranges the laws by topic. Regulations are arranged by topic in the CFR, or Code of Federal Regulations. For example, the legal definition of bourbon is in title 27 of the CFR (Alcohol, Tobacco Products and Firearms), Part 5 (Labeling and Advertising of Distilled Spirits), section 22 (The standards of identity), subsection b (Class 2; whisky) subsubsubsection (1)(i). The full citation would be 27 CFR 5.22(b)(1)(i), but I know of no way to link to anything more specific than 27 CFR 5.22: edocket.access.gpo.gov/cfr_2010/aprqtr/27cfr5.22.htm. (Bookmark that on your phone's browser and then bet a Maker's Mark with the next barroom loudmouth that you hear insisting that bourbon must legally be made in Bourbon County, Kentucky. He's wrong. It can be made anywhere in the United States.)

As you can see, there's some work involved in creating URLs for links to laws, but research for this blog entry led me to new resources like LegisLink that I hadn't heard of before, so I encourage you to let me know if there's anything important that I'm missing.

It was also interesting to see that the LII is involved in efforts to create an international standard for legal document URIs proposed by some Italian legal researchers. (This is particularly interesting when you consider that Italian legal researchers basically invented the concept of linking 900 years ago.)

A comment from Frank Bennett of Nagoya University's Faculty of Law:

These are indeed important developments. The systematic linking of case law and statutory data promise to have a large and positive impact on our access to legal resources. The only point I would take issue with is the reliance on Bluebook citation forms as the rosetta stone for identifying resources. Parsing cites out of plain text is a necessary kludge, given the general absence of meaningful structured metadata from online legal resources (thank you Lexis, thank you WestLaw), but it should be recognized as a kludge.

To get a lively set of service layers running on top of legal data, the metadata contained in or relevant to a particular case, statutory provision or regulatory provision needs to be readily accessible to calling applications. While it is true that string parsing machinery can be written to a good standard, assuming perfectly regular citation forms and uniform document formats, neither of those constraints applies in the wild. The Bluebook shares the field in North America with the ALWD and the McGill Guide. To make matters worse, the Bluebook specifies citation forms for some foreign legal resources that vary significantly from the native citation forms of the target jurisdictions. Document formats vary as well, so getting an accurate string parse may require special-purpose serialization of the document before applying a string parser to the text -- which may be hundreds of pages in length. Although certainly better than nothing, string parsing is a fragile strategy that would be very cumbersome to standardize and does not scale well.

Matching rendered cites to URLs is an important prospect, but we won't see significant progress at the application level until the intervening step of producing true structured metadata -- and embedding it in our online resources -- is covered.

A comment from Augusto Herrmann:

I just read your interesting article intitled "Linking linked data to U.S. law". I'd like to point you to a quite successful government project that uses URN for Brazilian legislation. The portal where you can search for legislation is at www.lexml.gov.br and information about the project can be found on projeto.lexml.gov.br . There you can find the document "Parte 2: LEXML URN" which describes the rules to construct official URN for legislation and court decisions (it's in portuguese, though). The project started circa 2004 and closely followed the footsteps of the Italian Norme in Rete project. If you aren't yet familiar with it, it's worth a look (see also akomantoso.organd metalex.eu).

(Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. If you send me an email about what I've written at snee.com (bob), I'll add it and any response here.)

My upcoming O'Reilly book: "Learning SPARQL"

2011-06-01T14:07:13Z

Querying and Updating with SPARQL 1.1.

51 weeks ago at last year's semtech I couldn't believe that there was still no book about SPARQL available. I had accumulated notes for such a book, and by that point I'd learned enough about SPARQL as a TopQuadrant employee that I decided to start studying the specifications (and especialy the 1.1 update) more systematically and write the book myself. (This explains why I've been writing less on my blog in the last year and writing about SPARQL more when I do.)

I'm proud to announce that I'm publishing the book with O'Reilly. Print and electronic versions will be available in July at the latest, and we're already planning on releasing an expanded edition with additional new material and any necessary updates once SPARQL 1.1 becomes a Recommendation. Anyone who buys the ebook version of the first edition will get the expanded edition on SPARQL 1.1 at no extra cost.

As you can tell from the book's cover on the right, the O'Reilly animal for this one is the anglerfish—the one with the light that hangs off the front of its head, for the pun on "sparkle". (I should really pick up the nightlight version of this lovely fish.)

From what I've seen so far, the only coverage of SPARQL in any existing books is a chapter or two in more general books on the semantic web, and I haven't seen any coverage of SPARQL 1.1 in those books just yet. (The second edition of of Dean Allemang and Jim Hendler's Semantic Web for the Working Ontologist, which is available on Amazon today, covers some SPARQL 1.1 query features, but not SPARQL Update.) "Learning SPARQL" is the first complete book on SPARQL, and covers both 1.0 and 1.1—including SPARQL Update—with working sample queries and data that you can try yourself with free software.

I parked the domain name learningsparql.com some time ago, and now there's a full web site about the book there. For up-to-date information about the book's availability and SPARQL news in general, subscribe to the twitter feed @LearningSPARQL.

Semantic web technology at NASA: lower costs and greater productivity

2011-05-27T21:54:52Z

An inspiring story.

Ian Jacob's recent interview with NASA's Jean Holm on the W3C website is an excellent case study of semantic web technology. It's not a long article, so I recommend that you read the whole thing. Here are few points that caught me eye:

She gives nice hard numbers about money spent and money saved, and saw a downward trend of the costs.
They used publication data to infer social networks and shared expertise and found other related ways to reduce the need for staff data entry.
The use of service agreements encouraged people to share data more easily.
This sharing led to demonstrated serendipitous reuse of data.
They plan to network the vocabularies (she doesn't use this term literally—I know it from a TopQuadrant context—but she's clearly talking about the same thing).

It was nice to see the credit that she gave to Kendall Clark. With my TopQuadrant hat on, I wish she'd mentioned some of the extensive work that Raph Hodgson has done there, but NASA is a big organization.

After reading Danny Ayers' Smell the coffee blog post this morning, which wasn't very hopeful about recent progress in the semantic web, I hoped that Ian's interview with Jeanne would cheer him up.

Using SPARQL to find the right DBpedia URI

2011-05-17T12:40:52Z

Even with the wrong name.

In Pulling SKOS prefLabel and altLabel values out of DBpedia, I described how Wikipedia and DBpedia store useful data about alternative names for resources described on Wikipedia, and I showed how you can use these to populate a SKOS dataset's alternative and preferred label properties. Today I want to show how to use these as part of an application that lets you retrieve data even when you don't necessarily have the right name for something—for example, retrieving a picture of Bob Marley using the misspelled version of his name "Bob Marly".

The DBpedia page for Bob Marley shows that dbpedia:Bob_Marly is one of the dbpedia-owl:wikiPageRedirects values of dbpedia.org/page/Bob_Marley. This means that if you send your browser to en.wikipedia.org/wiki/Bob_Marly, you'll end up on en.wikipedia.org/wiki/Bob_Marley.

It doesn't show that this redirect URI has the rdfs:label value "Bob Marly"@en associated with it, and this is the really handy part for retrieving data based on not-quite-right values. Because of this, the following SPARQL query will return the URI dbpedia.org/resource/Bob_Marley whether the quoted literal value is "Bob Marly" or "Bob Marley":

# First two PREFIX declarations unnecessary on SNORQL
PREFIX rdf: 
PREFIX foaf: 
PREFIX dbo: 

SELECT ?s WHERE {
  {
    ?s rdfs:label "Bob Marly"@en ;
       a owl:Thing .       
  }
  UNION
  {
    ?altName rdfs:label "Bob Marly"@en ;
             dbo:wikiPageRedirects ?s .
  }
}

The graph pattern before the UNION keyword checks whether there is an actual Wikipedia page for the quoted value, and the part after checks whether it's a redirect of something else. Effectively, it will be one or the other; there are only about a dozen labels in DBpedia that can be both.

To use this in a simple application, I created a form that, after you enter a name on it, attempts to display a picture of what you entered. Because the redirect data includes common misspellings as well as nicknames, entering "Bob Marly" will get you a picture of Marley and the URL of the actual resource, as shown below the picture above. Other interesting nicknames and misspellings to try are Bob Dillan, Mary Casat, Prince Billy, Big Blue, and Proctor and Gamble. (Warning: DBpedia image data is incorrect for some very well-known people, like Abraham Lincoln and Barack Obama, even when the Wikipedia page has a picture, so you may see the symbol for a broken image link. I had hoped to have the picture above have a title of "Abe Lincon".)

Because the output creates a specialized web page, I used the technique I described in Build Wikipedia query forms with semantic technology (which can be used with any SPARQL endpoint, not just DBpedia): a CGI Python script stores a SPARQL query, replaces a string in that query with whatever was entered in the form, sends the query off to the endpoint, and then sends HTML based on the result back to the browser. You can see the source here.

It's safe to say that this ability to find the right information based on a nickname or common misspelling could add a lot to a lot of applications. Once again, while the most important part of the semantic web is the data—in this case, DBpedia's wikiPageRedirects values—and not the standards and technologies used to get at the data, the existence of so much useful SPARQL-accessible data should make the SPARQL query language look more and more appealing to people who might have doubted before.

SKOS overview article on IBM developerWorks

2011-05-11T14:04:44Z

SKOS, vocabulary management, the semantic web, and more

I've been interested in the SKOS standard for vocabulary management for several years (and written about it here several times), but since we at TopQuadrant first began planning out the Enterprise Vocabulary Net product, I've learned a lot more about the theory and practice of using SKOS. I've recently written up an overview of SKOS and where it fits into vocabulary management and the semantic web, and IBM developerWorks has just published this as Improve your taxonomy management using the W3C SKOS standard. I hope it provides useful to people who want to learn more about SKOS.

Quick and dirty linked data content negotiation

2011-05-09T14:32:08Z

Not even that dirty.

I've managed to fill a key gap in the world's supply of Linked Open Data by publishing triples that connect Mad Magazine film parody titles to the DBpedia URIs of the actual films. For example:


      mad:FilmParody
              [ prism:CoverDate "1995-08-00" ;
                prism:issueIdentifier
                        "338" ;
                dc:title "Judge Dreck"
              ] .


      mad:FilmParody
              [ prism:CoverDate "1969-03-00" ;
                prism:issueIdentifier "125" ;
                dc:title "201 Minutes of a Space Idiocy"
              ] .

(To prepare the data, I scraped a Wikipedia list, tested the URIs, then hand-corrected a few.) To really make this serious RESTful linked open data, I wanted to make it available as both RDF/XML and Turtle depending on the Accept value in the header of the HTTP request. All this took was a few lines in the .htaccess file (which I've been learning more about lately) in the directory storing the RDF/XML and Turtle versions of the data.

For example, either of the following two commands retrieves the Turtle version:

wget --header="Accept: text/turtle" www.rdfdata.org/dat/MadFilmParodies/
curl --header "Accept: text/turtle" -L www.rdfdata.org/dat/MadFilmParodies/

Substituting application/rdf+xml for text/turtle in either command gets you the RDF/XML version, and omitting the --header parameter altogether gets you an HTML version.

Here's the complete .htaccess file:

RewriteEngine on

RewriteCond %{HTTP_ACCEPT} ^.*text/turtle.*
RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [L]
# no luck:
#RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [R=303,L]

RewriteCond %{HTTP_ACCEPT} ^.*application/rdf\+xml.*
RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.rdf [L]

RewriteRule ^index.html$ en.wikipedia.org/wiki/List_of_Mad's_movie_spoofs

The Apache web server where I have this hosted is configured to look for an index.html file in a directory if the requested URL doesn't mention a specific filename, so the three rules here each modify that "request" to look for something else, depending on what the RewriteCond line finds in the HTTP_ACCEPT value. If it finds "text/turtle", it sends the Turtle version of my data, and the L directive tells the Apache mod_rewrite module that is processing these instructions not to look at any more of them.

The next rule performs the corresponding HTTP_ACCEPT check and file delivery for an RDF/XML request, and the default behavior if neither of those happen is to deliver an HTML version of the data. (I took the lazy way out and just redirected to the appropriate Wikipedia page instead of creating a new HTML file.) As you can see from the two commented-out lines, I had the impression that adding R=303 in the brackets with the L would send an HTTP return code of 303 back to the requester, overriding the default code of 302, but never got that to work. If anyone has any any suggestions about how to fix this, or whether 303 is even the most appropriate return code, please let me know.

From what I've read on how the syntax of these instructions work, I shouldn't have needed the full URLs for the Turtle and RDF/XML versions of the Mad Film Parody data, because they were in the same directory as the .htaccess file, but that was the only way I could get this to work.

Now that I know how to do this, I can do it again for other resources pretty quickly. It took me about five minutes to do it for the little www.snee.com/ns/madMag/MadFilmParody ontology that the data points to. I consider this solution quick and a bit dirty because it requires the maintenance of two copies of the data, but the XML guy in me knows that it would be wrong to perform parallel edits on the two copies, and that I should instead pick one as a master, edit it when necessary, and generate the other from it. If I had to do this on a larger scale, I learned from Brian Sletten at last year's semtech that I should look into NetKernel, but it was a good exercise to do it this way to learn what was really going on.

I'm going to try to get into the habit of doing this for data and ontologies that I create, so I'd appreciate any suggestions about tweaking details before any suboptimal aspects of this become habits.