OpenClose

Back Online after the Spam-fest

August 15th, 2011 by Tom Heath

Just a quick post now this blog is back online after being badly compromised by spammers. I took everything down and let the links 404 for a while in the hope that it would encourage search engines to clear out their indexes, and the search engine referrals seems to be getting cleaner now, which is a relief. May this be the last of it.

Uncategorized Closed

The Linked Data Book: Draft Table of Contents

January 26th, 2011 by Tom Heath

Update 2011-02-25: the book is now published and available for download and in hard copy:

  • www.morganclaypool.com/doi/abs/10.2200/S00334ED1V01Y201102WBE001 (PDF download)
  • secure.aidcvt.com/mcp/ProdDetails.asp?ID=9781608454303&PG=1&Type=BL&PCS=MCP (hard copy)
  • www.amazon.com/Linked-Data-Synthesis-Lectures-Semantic/dp/1608454304/ (hard copy)

Original Post

Chris Bizer and I have been working over the last few months on a book capturing the state of the art in Linked Data. The book will be published shortly as an e-book and in hard copy by Morgan & Claypool, as part of the series Synthesis Lectures in Web Engineering, edited by Jim Hendler and Frank van Harmelen. There will also be an HTML version available free of charge on the Web.

I’ve been asked about the contents, so thought I’d reproduce the table of contents here. This is the structure as we sent it to the publisher — the final structure my vary a little but changes will likely be superficial. Register at Amazon to receive an update when the book is released.

  • Overview
  • Contents
  • List of Figures
  • Acknowledgements
  • Introduction
    • The Data Deluge
    • The Rationale for Linked Data
      • Structure Enables Sophisticated Processing
      • Hyperlinks Connect Distributed Data
    • From Data Islands to a Global Data Space
    • Structure of this book
    • Intended Audience
    • Introducing Big Lynx Productions
  • Principles of Linked Data
    • The Principles in a Nutshell
    • Naming Things with URIs
    • Making URIs Defererencable
      • URIs
      • Hash URIs
      • Hash versus
    • Providing Useful RDF Information
      • The RDF Data Model
        • Benefits of using the RDF Data Model in the Linked Data Context
        • RDF Features Best Avoided in the Linked Data Context
      • RDF Serialization Formats
        • RDF/XML
        • RDFa
        • Turtle
        • N-Triples
        • RDF/JSON
    • Including Links to other Things
      • Relationship Links
      • Identity Links
      • Vocabulary Links
    • Conclusions
  • The Web of Data
    • Bootstrapping the Web of Data
    • Topology of the Web of Data
      • Cross-Domain Data
      • Geographic Data
      • Media
      • Government Data
      • Libraries and Education
      • Life Sciences
      • Retail and Commerce
      • User Generated Content and Social Media
    • Conclusions
  • Linked Data Design Considerations
    • Using URIs as Names for Things
      • Minting HTTP URIs
      • Guidelines for Creating Cool URIs
        • Keep out of namespaces you do not control
        • Abstract away from implementation details
        • Use Natural Keys within URIs
      • Example URIs
    • Describing Things with RDF
      • Literal Triples and Outgoing Links
      • Incoming Links
      • Triples that Describe Related Resources
      • Triples that Describe the Description
    • Publishing Data about Data
      • Describing a Data Set
        • Semantic Sitemaps
        • voiD Descriptions
      • Provenance Metadata
      • Licenses, Waivers and Norms for Data
        • Licenses vs. Waivers
        • Applying Licenses to Copyrightable Material
        • Non-copyrightable Material
    • Choosing and Using Vocabularies
      • SKOS, RDFS and OWL
      • RDFS Basics
        • Annotations in RDFS
        • Relating Classes and Properties
      • A Little OWL
      • Reusing Existing Terms
      • Selecting Vocabularies
      • Defining Terms
    • Making Links with RDF
      • Making Links within a Data Set
        • Publishing Incoming and Outgoing Links
      • Making Links with External Data Sources
        • Choosing External Linking Targets
        • Choosing Predicates for Linking
      • Setting RDF Links Manually
      • Auto-generating RDF Links
        • Key-based Approaches
        • Similarity-based Approaches
  • Recipes for Publishing Linked Data
    • Linked Data Publishing Patterns
      • Patterns in a Nutshell
        • From Queryable Structured Data to Linked Data
        • From Static Structured Data to Linked Data
        • From Text Documents to Linked Data
      • Additional Considerations
        • Data Volume: How much data needs to be served?
        • Data Dynamism: How often does the data change?
    • The Recipes
      • Serving Linked Data as Static RDF/XML Files
        • Hosting and Naming Static RDF Files
        • Server-Side Configuration: MIME Types
        • Making RDF Discoverable from HTML
      • Serving Linked Data as RDF Embedded in HTML Files
      • Serving RDF and HTML with Custom Server-Side Scripts
      • Serving Linked Data from Relational Databases
      • Serving Linked Data from RDF Triple Stores
      • Serving RDF by Wrapping Existing Application or Web APIs
    • Additional Approaches to Publishing Linked Data
    • Testing and Debugging Linked Data
    • Linked Data Publishing Checklist
  • Consuming Linked Data
    • Deployed Linked Data Applications
      • Generic Applications
        • Linked Data Browsers
        • Linked Data Search Engines
      • Domain-specific Applications
    • Developing a Linked Data Mashup
      • Software Requirements
      • Accessing Linked Data URIs
      • Representing Data Locally using Named Graphs
      • Querying local Data with SPARQL
    • Architecture of Linked Data Applications
      • Accessing the Web of Data
      • Vocabulary Mapping
      • Identity Resolution
      • Provenance Tracking
      • Data Quality Assessment
      • Caching Web Data Locally
      • Using Web Data in the Application Context
    • Effort Distribution between Publishers, Consumers and Third Parties
  • Summary and Outlook
  • Bibliography
Linked Data,Semantic Web,Writing 7 Comments

Arguments about HTTP 303 Considered Harmful

November 10th, 2010 by Tom Heath

Ian recently published a blog post that he’d finally got around to writing, several months after a fierce internal debate at Talis about whether the Web of Data needs HTTP 303 redirects. I can top that. Ian’s post unleashed a flood of anti-303 sentiment that has prompted me to finish a blog post I started in February 2008.

Picture the scene: six geeks sit around a table in the bar of a Holiday Inn, somewhere in West London. It’s late, we’re drunk, and debating 303 redirects and the distinction between information and non-information resources. Three of the geeks exit stage left, leaving me to thrash it out with Dan and Damian. Some time shortly afterwards Dan calls me a “303 fascist”, presumably for advocating the use of HTTP 303 redirects when serving Linked Data, as per the W3C TAG’s finding on httpRange-14.

I never got to the bottom of Dan’s objection – technical? philosophical? historical? -  but there is seemingly no end to the hand-wringing that we as a community seem willing to engage in about this issue.

Ian’s post lists nine objections to the 303 redirect pattern, most of which don’t stand up to closer scrutiny. Let’s take them one at a time:

1. it requires an extra round-trip to the server for every request

For whom is this an issue? Users? Data publishers? Both?

If it’s the former then the argument doesn’t wash. Consider a typical request for a Web page. The browser requests the page, the server sends the HTML document in response. (Wait, should that be “an HTML representation of the resource denoted by the URI“, or whatever? If we want to get picky about the terminology of Web architecture then lets start with the resource/representation minefield. I would bet hard cash that the typical Web user or developer is better able to understand the distinction between information resources and non-information resources than between resources and representations of resources).

Anyway, back to our typical request for a Web page… The browser parses the HTML document, finds references to images and stylesheets hosted on the same server, and quite likely some JavaScript hosted elsewhere. Each of the former requires another request to the original server, while the latter triggers requests to other domains. In the worst case scenario these other domains aren’t in the client’s (or their ISP’s) DNS cache, thereby requiring a DNS lookup on the hostname and increasing the overall time cost of the request.

In this context, is a 303 redirect and the resulting round-trip really an issue for users of the HTML interfaces to Linked Data applications? I doubt it.

Perhaps it’s an issue for data publishers. Perhaps those serving (or planning to serve) Linked Data are worried about whether their Web servers can handle the extra requests/second that 303s entail. If that’s the case, presumably the same data publishers insist that their development teams chuck all their CSS into a single stylesheet, in order to prevent any unnecessary requests stemming from using multiple stylesheets per HTML document. I doubt it.

My take home message is this: in the grand scheme of things, the extra round-trip stemming from a 303 redirect is of very little significance to users or data publishers. Eyal Oren raised the very valid question some time ago of whether 303s should be cached. Redefining this in the HTTP spec seems eminently sensible. So why hasn’t it happened? If just a fraction of the time spent debating 303s and IR vs. NIR was spent lobbying to get that change made then we would have some progress to report. Instead we just have hand-wringing and FUD for potential Linked Data adopters.

2. only one description can be linked from the toucan’s URI

Do people actually want to link to more than one description of a resource? Perhaps there are multiple documents on a site that describe the same thing, and it would be useful to point to them both. (Wait, we have a mechanism for that! It’s called a hypertext/hyperdata link). But maybe someone has two static files on the same server that are both equally valid descriptions of the same resource. Yes, in that case it would be useful to be able to point to both; so just create an RDF document that sits behind a 303 redirect and contains some rdfs:seeAlso statements to the more extensive description, or serve up your data from an RDF store that can pull out all statements describing the resource, and return them as one document.

I don’t buy the idea that people actually want to point to multiple descriptions apart from in the data itself. If there are other equivalent resources out there on the Web then state their equivalence, don’t just link to their descriptions. There may be 10 or 100 or 1000 equivalent resources referenced in an RDF document. 303 redirects make it very clear which is the authoritative description of a specific resource.

3. the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI of the toucan. Often they use the document URI by mistake.

OK, let’s break this issue down into two distinct scenarios. Job Public who wants to bookmark something, and Joe Developer who wants to hand-craft some RDF (using the correct URI to identify the toucan).

Again, I would bet hard cash that Joe Public doesn’t want to reuse the URI of the toucan in his bookmarks, emails, tweets etc. I would bet that he wants to reuse the URI of the document describing the toucan. No one sends emails saying “hey, read this toucan“. People say “hey, read this document about a toucan“. In this case it doesn’t matter one bit that the document URI is being used.

Things can get a bit more complicated in the Joe Developer scenario, and the awful URI pattern used in DBpedia, where it’s visually hard to notice the change from /resource to /data or /page, doesn’t help at all. So change it. Or agree to never use that pattern again. If documents describing things in DBpedia ended .rdf or .html would we even be having this debate?

Joe Developer also has to take a bit of responsibility for writing sensible RDF statements. Unfortunately, people like Ed seeming to conflate himself and his homepage (and his router and its admin console) don’t help with the general level of understanding. I’ve tried many times to explain to someone that I am not my homepage, and as far as I know I’ve never failed. In all this frantic debate about the 303 mechanism, let’s not abandon certain basic principles that just make sense.

I don’t think Ian was suggesting in his posts that he is his homepage, so let’s be very, very explicit about what we’re debating here — 303 redirects — and not muddy the waters by bringing other topics into the discussion.

4. its non-trivial to configure a web server to issue the correct redirect and only to do so for the things that are not information resources.

Ian claims this is non-trivial. Nor is running a Drupal installation. I know, it powers linkeddata.org, and maintaining it is a PITA. That doesn’t stop thousands of people doing it. Let’s be honest, very little in Web technology is trivial. Running a Web server in your basement isn’t trivial – that’s why people created wordpress.com, Flickr, MySpace, etc., bringing Web publishing to the masses, and why most of us would rather pay Web hosting companies to do the hard stuff for us. If people really see this configuration issue as a barrier then they should get on with implementing a server that makes it trivial, or teach people how to make the necessary configuration changes.

5. the server operator has to decide which resources are information resources and which are not without any precise guidance on how to distinguish the two (the official definition speaks of things whose “essential characteristics can be conveyed in a message”). I enumerate some examples here but it’s easy to get to the absurd.

The original guidance from the TAG stated that a 200 indicated an information resource, whereas a 303 could indicate any type of resource. If in doubt, use a 303 and redirect to a description of the resource. Simple.

6. it cannot be implemented using a static web server setup, i.e. one that serves static RDF documents

In this case hash URIs are more suitable anyway. This has always been the case.

7. it mixes layers of responsibility – there is information a user cannot know without making a network request and inspecting the metadata about the response to that request. When the web server ceases to exist then that information is lost.

Can’t this be resolved by adding additional triples to the document that describes the resource, stating the relationship between a resource and its description?

8. the 303 response can really only be used with things that aren’t information resources. You can’t serve up an information resource (such as a spreadsheet) and 303 redirect to metadata about the spreadsheet at the same time.

Metadata about an RDF document can be included in the document itself. Perhaps a more Web-friendly alternative to Excel could allow for richer embeddable metadata.

9. having to explain the reasoning behind using 303 redirects to mainstream web developers simply reinforces the perception that the semantic web is baroque and irrelevant to their needs.

I fail to see how Ian’s proposal, when taken as a whole package, is any less confusing.

~~~

Having written this post I’m wondering whether the time would have been better spent on something more productive, which is precisely how I feel about the topic in general. As geeks I think we love obsessing about getting things “right”, but at what cost? Ian’s main objection seems to be about the barriers we put in the way of Linked Data adoption. From my own experience there is no better barrier than uncertainty. Arguments about HTTP 303s are far more harmful than 303s themselves. Let’s put the niggles aside and get on with making Linked Data the great success we all want it to be.

Tags: http, http 303, httpRange-14, Linked Data, rdf, Semantic Web. Linked Data,Semantic Web 10 Comments

Why Carry the Cost of Linked Data?

June 16th, 2010 by Tom Heath

In his ongoing series of niggles about Linked Data, Rob McKinnon claims that “mandating RDF [for publication of government data] may be premature and costly“. The claim is made in reference to Francis Maude’s parliamentary answer to a question from Tom Watson. Personally I see nothing in the statement from Francis Maude that implies the mandating of RDF or Linked Data, only that “Where possible we will use recognised open standards including Linked Data standards”. Note the “where possible”. However, that’s not the point of this post.

There’s nothing premature about publishing government data as Linked Data – it’s happening on a large scale in the UK, US and elsewhere. Where I do agree with Rob (perhaps for the first time spacer ) is that it comes at a cost. However, this isn’t the interesting question, as the same applies to any investment in a nation’s infrastructure. The interesting questions are who bears that cost, and who benefits?

Let’s make a direct comparison between publishing a data set in raw CSV format (probably exported from a database or spreadsheet) and making the extra effort to publish it in RDF according to the Linked Data principles.

Assuming that your spreadsheet doesn’t contain formulas or merged cells that would make the data irregularly shaped, or that you can create a nice database view that denormalises your relational database tables into one, then the cost of publishing data in CSV basically amounts to running the appropriate export of the data and hosting the static file somewhere on the Web. Dead cheap, right?

Oh wait, you’ll need to write some documentation explaining what each of the columns in the CSV file mean, and what types of data people should expect to find in each of these. You’ll also need to create and maintain some kind of directory so people can discover your data in the crazy haystack that is the Web. Not quite so cheap after all.

So what are the comparable processes and costs in the RDF and Linked Data scenario? One option is to use a tool like D2R Server to expose data from your relational database to the Web as RDF, but let’s stick with the CSV example to demonstrate the lo-fi approach.

This is not the place to reproduce an entire guide to publishing Linked Data, but in a nutshell, you’ll need to decide on the format of the URIs you’ll assign to the things described in your data set, select one or more RDF schemata with which to describe your data (analogous to defining what the columns in your CSV file mean and how their contents relate to each other), and then write some code to convert the data in your CSV file to RDF, according to your URI format and the chosen schemata. Last of all, for it to be proper Linked Data, you’ll need to find a related Linked Data set on the Web and create some RDF that links (some of) the things in your data set to things in the other. Just as with conventional Web sites, if people find your data useful or interesting they’ll create some RDF that links the things in their data to the things in yours, gradually creating an unbounded Web of data.

Clearly these extra steps come at a cost compared to publishing raw CSV files. So why bear these costs?

There are two main reasons: discoverability and reusability.

Anyone (deliberately) publishing data on the Web presumably does so because they want other people to be able to find and reuse that data. The beauty of Linked Data is that discoverability is baked in to the combination of RDF and the Linked Data principles. Incoming links to an RDF data set put that data set “into the Web” and outgoing links increase the interconnectivity further.

Yes, you can create an HTML link to a CSV file, but you can’t link to specific things described in the data or say how they relate to each other. Linked Data enables this. Yes, you can publish some documentation alongside a CSV file explaining what each of the columns mean, but that description can’t be interlinked with the data itself, making it self-describing. Linked Data does this. Yes, you can include URIs in the data itself, but CSV provides no mechanism that for indicating that the content of a particular cell is a link to be followed. Linked Data does this. Yes, you can create directories or catalogues that describe the data sets available from a particular publisher, but this doesn’t scale to the Web. Remember what the arrival of Google did to the Yahoo! directory? What we need is a mechanism that supports arbitrary discovery of data sets by bots roaming the Web and building searchable indices of the data they find. Linked Data enables this.

Assuming that a particular data set has been discovered, what is the cost of any one party using that data in a new application? Perhaps this application only needs one data set, in which case all the developer must do is read the documentation to understand the structure of the data and get on with writing code. A much more likely scenario is that the application requires integration of two or more data sets. If each of these data sets is just a CSV file then every application developer must incur the cost of integrating them, i.e. linking together the elements common to both data sets, and must do this for every new data set they want to use in their application. In this scenario the integration cost of using these data sets is proportional to their use. There are no economies of scale. It always costs the same amount, to every consumer.

Not so with Linked Data, which enables the data publisher to identify links between their data and third party data sets, and make these links available to every consumer of that data set by publishing them as RDF along with the data itself. Yes, there is a one-off cost to the publisher in creating the links that are most likely to be useful to data consumers, but that’s a one-off. It doesn’t increase every time a developer uses the data set, and each developer doesn’t have to pay that cost for each data set they use.

If data publishers are seriously interested in promoting the use of their data then this is a cost worth bearing. Why constantly reinvent the wheel by creating new sets of links for every application that uses a certain combination of data sets? Certainly as a UK taxpayer, I would rather the UK Government made this one-off investment in publishing and linking RDF data, thereby lowering the cost for everyone that wanted to use them. This is the way to build a vibrant economy around open data.

Linked Data,Semantic Web 11 Comments

The demise of community.linkeddata.org

March 26th, 2010 by Tom Heath

The issue of what happened to the community.linkeddata.org site came up in this thread on the public-lod mailing list. In the name of the public record I’m posting some of the messages I have related to this issue. I’ll try and get any gaps filled in in due course (let me know if there are specific gaps of interest to you and I’ll try to fill them in); in the meantime I’m keen to get the key bits online.

Some background is here:
lists.w3.org/Archives/Public/public-lod/2008Apr/0096.html



from    Michael Hausenblas <michael.hausenblas@d...>
to    Ted Thibodeau Jr <tthibodeau@o...>
cc    Kingsley Idehen <kidehen@o...>,Tom Heath <tom.heath@t...>
date    9 February 2009 18:27
subject    Re: "powered by" logos on linkeddata.org MediaWiki

MacTed,

I'll likely not invest time anymore in the Wiki [the MediaWiki instance at community.linkeddata.org - TH]. The plan is to transfer everything to Drupal. We had a lot of hassle with the Wiki configuration and community contribution was rather low. After the spam attack we decided to close it. It only contains few valuable things (glossary and iM maybe) ..

Do you have an account at linkeddata.org Drupal, yet? Otherwise, Tom, would you please be so kind?

Again, sorry for the delay ... it's LDOW-paper-write-up time spacer

Cheers,
Michael


--
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland, Europe
Tel. +353 91 495730



> From: Ted Thibodeau Jr <tthibodeau@o...>
> Date: Fri, 6 Feb 2009 16:22:31 -0500
> To: Michael Hausenblas <michael.hausenblas@d...>
> Cc: Kingsley Idehen <kidehen@o...>
> Subject: "powered by" logos on linkeddata.org MediaWiki
>
> Hi, Michael --
>
> re: <community.linkeddata.org/MediaWiki/index.php?Main_Page>
>
> It appears that the "Powered by Virtuoso" logo that was once alongside
> the
> "Powered by Mediawiki" logo (lower right of every page) has disappeared
> from the main page boilerplate.  Can that get re-added, please?
>
> Please use this logo --
>
>
> <boards.openlinksw.com/support/styles/prosilver/theme/images/virt_power
> _no_border.png
>>
>
> -- and make it href link to --
>
>     <virtuoso.openlinksw.com/>
>
> Please let me know if there's any difficulty with this.
>
> Thanks,
>
> Ted
>
>
> --
> A: Yes.                      www.guckes.net/faq/attribution.html
> | Q: Are you sure?
> | | A: Because it reverses the logical flow of conversation.
> | | | Q: Why is top posting frowned upon?
>
> Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
> Evangelism & Support         //        mailto:tthibodeau@o...
> OpenLink Software, Inc.      //              www.openlinksw.com/
>                                   www.openlinksw.com/weblogs/uda/
> OpenLink Blogs              www.openlinksw.com/weblogs/virtuoso/
>                                 www.openlinksw.com/blog/~kidehen/
>      Universal Data Access and Virtual Database Technology Providers




from Tom Heath
to Michael Hausenblas
date 9 March 2009 13:49
subject Re: linkeddata.org/domains?
mailed-by talisplatform.com

Hey Michael,

Re 2. great! I've created this node
and put it near the top of the primary navigation. You should be able
to write to that at will
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.