bobdc.blog tag:www.snee.com,2011:/bobdc.blog/2full 2011-07-27T12:15:19Z Bob DuCharme's weblog, mostly on technology for representing and linking information. Movable Type 4.32-en "Learning SPARQL" now available tag:www.snee.com,2011:/bobdc.blog//2.665 2011-07-27T12:12:53Z 2011-07-27T12:15:19Z Bob DuCharme www.snee.com/bobdc.blog

In print and ebook formats.

spacer

I'm very happy to announce that the ebook and print editions of Learning SPARQL are now available from O'Reilly. Print editions are also available from amazon.com, amazon.co.uk, maybe some more Amazons, and Barnes and Noble. (Borders says that it's on backorder, but I wouldn't hold your breath for that.) You can read more about how I came to write the book in an earlier blog posting.

Right now it's the only complete book on the W3C standard query language for linked data and the semantic web, and as far as I know the only book at all that covers the full range of SPARQL 1.1 features such as the ability to update data. The book steps you through simple examples that can all be performed with free software, and all sample queries, data, and output are available on the book's website. In the words of Priscilla Walmsley, "It's excellent—very well organized and written, a completely painless read. I not only feel like I understand SPARQL now, but I have a much better idea why RDF is useful (I was a little skeptical before!)"

I will continue to post news about the book and about SPARQL on the book's twitter account at @LearningSPARQL. I'm not starting a separate blog for the book, so I will continue to blog about SPARQL here.

Linking linked data to U.S. law tag:www.snee.com,2011:/bobdc.blog//2.664 2011-07-08T12:29:08Z 2011-07-20T21:57:36Z Bob DuCharme www.snee.com/bobdc.blog

Automating conversion of citations into URLs.

At a recent W3C Government Linked Data Working Group working group meeting, I started thinking more about the role in linked data of laws that are published online. To summarize, you don't want to publish the laws themselves as triples, because they're a bad fit for the triples data model, but as online resources relevant to a lot of issues out there, they make an excellent set of resources to point to, although you may not always get the granularity you want.

Plenty of government data references laws and related materials.

I'm discussing U.S. Federal law here, but similar principles should apply both in individual states and in other countries. The main sets of laws here are legislation, code, regulations, and court decisions. ("Code" refers to laws passed by legislation, arranged by topic; for example, laws passed about taxes are gathered into the Internal Revenue Code.) If you really want to learn about the various forms of legal material and their relationship, I highly recommend the book Finding the Law, which I found indispensable when I worked at LexisNexis.

Most law consists of narrative sentences arranged as paragraphs, often with metadata assigned to certain blocks of it. It's such a good fit for XML that legal publishers were among the first users of XML's predecessor, SGML. (Their use of XML and SGML account for a large chunk of my career, and I know that some old XML friends like Sean McGrath and Dale Waldt continue to make great contributions in this area.) So, while you wouldn't get much benefit splitting these sentences and paragraphs into subjects, predicates, and objects and publishing them as triples, plenty of government data references laws and related materials, and it's more helpful if they can reference them with URLs that lead to the actual laws. To add these URLs with any kind of scalability, you need to find out the common format for citing a document (or, if possible, a point within a document) and an online source of those legal documents whose URLs can be built from that citation format with a regular expression or some other automated tool.

When creating links to any specific bits of U.S. law, the most valuable book is The Bluebook: A Uniform System of Citation. As the subtitle implies, the book describes the normalized way to refer to legal documents and their components. Once you know these, a regular expression can often turn them into a URL that leads a browser right to the part you want. For example, while people often refer to the Supreme Court case outlawing school segregation as "Brown v. Board of Education", its official name is "347 U.S. 483", which means "the case beginning on page 483 of volume 347 of the official publication of U.S. Supreme Court decisions".

While there are several sites hosting Supreme Court decisions out there, notably Cornell Law School's Legal Information Institute, the one whose URLs are easiest to construct from a proper Supreme Court citation are at justia.com, where the URL for Brown v. Board of Education is supreme.justia.com/us/347/483/case.html. (See also my favorite case, Campbell aka Skyywalker et al v. Acuff Rose Music, Inc. at supreme.justia.com/us/510/569/case.html. Make sure to listen to the relevant work on YouTube while you review it.) If you're really interested in linked data and U.S. Supreme Court cases, DBpedia has lots of great metadata for many important cases, as I wrote about in Court decision metadata and DBpedia.

To create a URL for other U.S. court systems, you'll have to look up the proper way to cite them in a resource like the Bluebook and then look for versions of that court's cases online with URLs that reflect the citation in a manner that lets you automate the creation of the URL. This is a theme for linking to any kind of law on the web, and you can be sure that developers at the Legal Information Institute, LexisNexis, WestLaw, and other legal publishers have put plenty of time into developing regular expressions to make this happen so that they can turn plain text citations into hypertext links. (It would be great if the LII made their regular expressions public. LexisNexis and WestLaw never would, although they're more interested in keeping such proprietary work away from each other than from us.)

Legislation can be more complicated, but two excellent resources make it remarkably simple: the Library of Congress's THOMAS system lets you create persistent URLs for legislation using the handle system (see also its inventor's web page on it), which I hadn't heard of before the Government Linked Data meeting. The Law Librarian Blog has a nice entry showing examples of how to use it. LegisLink is another way to link to legislation, and looks simpler to me. A Legal Information Institute blog entry has a good explanation of this, and LegisLink provides an excellent form to construct the URLs. These even let you construct links to a specific section of a piece of legislation.

Granularity is an even bigger issue when linking to code and regulations, which are often broken down into numbered and lettered pieces of pieces of pieces. Ever since I worked at the grandly named Research Institute of America (a publisher of hyperlinked U.S. tax law and related information), it's always irked me to see people refer to a pension plan as a 401K, because as subsection k of section 401 of the U.S. Tax Code (title 26 of the U.S. Code), it's more properly written 401(k), or, to use its full name, 26 USC 401(k). The Government Printing Office lets you you link directly to section 401, if not subsection k, with the URL frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&docid=Cite:+26USC401, and the LII lets you link to it with www.law.cornell.edu/uscode/26/usc_sec_26_00000401----000-.html.

That's the US Code, which arranges the laws by topic. Regulations are arranged by topic in the CFR, or Code of Federal Regulations. For example, the legal definition of bourbon is in title 27 of the CFR (Alcohol, Tobacco Products and Firearms), Part 5 (Labeling and Advertising of Distilled Spirits), section 22 (The standards of identity), subsection b (Class 2; whisky) subsubsubsection (1)(i). The full citation would be 27 CFR 5.22(b)(1)(i), but I know of no way to link to anything more specific than 27 CFR 5.22: edocket.access.gpo.gov/cfr_2010/aprqtr/27cfr5.22.htm. (Bookmark that on your phone's browser and then bet a Maker's Mark with the next barroom loudmouth that you hear insisting that bourbon must legally be made in Bourbon County, Kentucky. He's wrong. It can be made anywhere in the United States.)

As you can see, there's some work involved in creating URLs for links to laws, but research for this blog entry led me to new resources like LegisLink that I hadn't heard of before, so I encourage you to let me know if there's anything important that I'm missing.

It was also interesting to see that the LII is involved in efforts to create an international standard for legal document URIs proposed by some Italian legal researchers. (This is particularly interesting when you consider that Italian legal researchers basically invented the concept of linking 900 years ago.)


A comment from Frank Bennett of Nagoya University's Faculty of Law:

These are indeed important developments. The systematic linking of case law and statutory data promise to have a large and positive impact on our access to legal resources. The only point I would take issue with is the reliance on Bluebook citation forms as the rosetta stone for identifying resources. Parsing cites out of plain text is a necessary kludge, given the general absence of meaningful structured metadata from online legal resources (thank you Lexis, thank you WestLaw), but it should be recognized as a kludge.

To get a lively set of service layers running on top of legal data, the metadata contained in or relevant to a particular case, statutory provision or regulatory provision needs to be readily accessible to calling applications. While it is true that string parsing machinery can be written to a good standard, assuming perfectly regular citation forms and uniform document formats, neither of those constraints applies in the wild. The Bluebook shares the field in North America with the ALWD and the McGill Guide. To make matters worse, the Bluebook specifies citation forms for some foreign legal resources that vary significantly from the native citation forms of the target jurisdictions. Document formats vary as well, so getting an accurate string parse may require special-purpose serialization of the document before applying a string parser to the text -- which may be hundreds of pages in length. Although certainly better than nothing, string parsing is a fragile strategy that would be very cumbersome to standardize and does not scale well.

Matching rendered cites to URLs is an important prospect, but we won't see significant progress at the application level until the intervening step of producing true structured metadata -- and embedding it in our online resources -- is covered.


A comment from Augusto Herrmann:

I just read your interesting article intitled "Linking linked data to U.S. law". I'd like to point you to a quite successful government project that uses URN for Brazilian legislation. The portal where you can search for legislation is at www.lexml.gov.br and information about the project can be found on projeto.lexml.gov.br . There you can find the document "Parte 2: LEXML URN" which describes the rules to construct official URN for legislation and court decisions (it's in portuguese, though). The project started circa 2004 and closely followed the footsteps of the Italian Norme in Rete project. If you aren't yet familiar with it, it's worth a look (see also akomantoso.organd metalex.eu).


(Note on comments: after turning off comments on this blog for a few days because of comment spam, turning them back seems to have no effect. If you send me an email about what I've written at snee.com (bob), I'll add it and any response here.)

My upcoming O'Reilly book: "Learning SPARQL" tag:www.snee.com,2011:/bobdc.blog//2.662 2011-06-01T14:07:13Z 2011-06-01T14:14:32Z Bob DuCharme www.snee.com/bobdc.blog

Querying and Updating with SPARQL 1.1.

spacer

51 weeks ago at last year's semtech I couldn't believe that there was still no book about SPARQL available. I had accumulated notes for such a book, and by that point I'd learned enough about SPARQL as a TopQuadrant employee that I decided to start studying the specifications (and especialy the 1.1 update) more systematically and write the book myself. (This explains why I've been writing less on my blog in the last year and writing about SPARQL more when I do.)

I'm proud to announce that I'm publishing the book with O'Reilly. Print and electronic versions will be available in July at the latest, and we're already planning on releasing an expanded edition with additional new material and any necessary updates once SPARQL 1.1 becomes a Recommendation. Anyone who buys the ebook version of the first edition will get the expanded edition on SPARQL 1.1 at no extra cost.

As you can tell from the book's cover on the right, the O'Reilly animal for this one is the anglerfish—the one with the light that hangs off the front of its head, for the pun on "sparkle". (I should really pick up the nightlight version of this lovely fish.)

From what I've seen so far, the only coverage of SPARQL in any existing books is a chapter or two in more general books on the semantic web, and I haven't seen any coverage of SPARQL 1.1 in those books just yet. (The second edition of of Dean Allemang and Jim Hendler's Semantic Web for the Working Ontologist, which is available on Amazon today, covers some SPARQL 1.1 query features, but not SPARQL Update.) "Learning SPARQL" is the first complete book on SPARQL, and covers both 1.0 and 1.1—including SPARQL Update—with working sample queries and data that you can try yourself with free software.

I parked the domain name learningsparql.com some time ago, and now there's a full web site about the book there. For up-to-date information about the book's availability and SPARQL news in general, subscribe to the twitter feed @LearningSPARQL.

Semantic web technology at NASA: lower costs and greater productivity tag:www.snee.com,2011:/bobdc.blog//2.661 2011-05-27T21:54:52Z 2011-05-27T22:28:40Z Bob DuCharme www.snee.com/bobdc.blog

An inspiring story.

Ian Jacob's recent interview with NASA's Jean Holm on the W3C website is an excellent case study of semantic web technology. It's not a long article, so I recommend that you read the whole thing. Here are few points that caught me eye:

spacer
  • She gives nice hard numbers about money spent and money saved, and saw a downward trend of the costs.

  • They used publication data to infer social networks and shared expertise and found other related ways to reduce the need for staff data entry.

  • The use of service agreements encouraged people to share data more easily.

  • This sharing led to demonstrated serendipitous reuse of data.

  • They plan to network the vocabularies (she doesn't use this term literally—I know it from a TopQuadrant context—but she's clearly talking about the same thing).

It was nice to see the credit that she gave to Kendall Clark. With my TopQuadrant hat on, I wish she'd mentioned some of the extensive work that Raph Hodgson has done there, but NASA is a big organization.

After reading Danny Ayers' Smell the coffee blog post this morning, which wasn't very hopeful about recent progress in the semantic web, I hoped that Ian's interview with Jeanne would cheer him up.

Using SPARQL to find the right DBpedia URI tag:www.snee.com,2011:/bobdc.blog//2.660 2011-05-17T12:40:52Z 2011-05-17T12:45:03Z Bob DuCharme www.snee.com/bobdc.blog

Even with the wrong name.

spacer

In Pulling SKOS prefLabel and altLabel values out of DBpedia, I described how Wikipedia and DBpedia store useful data about alternative names for resources described on Wikipedia, and I showed how you can use these to populate a SKOS dataset's alternative and preferred label properties. Today I want to show how to use these as part of an application that lets you retrieve data even when you don't necessarily have the right name for something—for example, retrieving a picture of Bob Marley using the misspelled version of his name "Bob Marly".

The DBpedia page for Bob Marley shows that dbpedia:Bob_Marly is one of the dbpedia-owl:wikiPageRedirects values of dbpedia.org/page/Bob_Marley. This means that if you send your browser to en.wikipedia.org/wiki/Bob_Marly, you'll end up on en.wikipedia.org/wiki/Bob_Marley.

It doesn't show that this redirect URI has the rdfs:label value "Bob Marly"@en associated with it, and this is the really handy part for retrieving data based on not-quite-right values. Because of this, the following SPARQL query will return the URI dbpedia.org/resource/Bob_Marley whether the quoted literal value is "Bob Marly" or "Bob Marley":

# First two PREFIX declarations unnecessary on SNORQL
PREFIX rdf: <www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <xmlns.com/foaf/0.1/>
PREFIX dbo: <dbpedia.org/ontology/>

SELECT ?s WHERE {
  {
    ?s rdfs:label "Bob Marly"@en ;
       a owl:Thing .       
  }
  UNION
  {
    ?altName rdfs:label "Bob Marly"@en ;
             dbo:wikiPageRedirects ?s .
  }
}

The graph pattern before the UNION keyword checks whether there is an actual Wikipedia page for the quoted value, and the part after checks whether it's a redirect of something else. Effectively, it will be one or the other; there are only about a dozen labels in DBpedia that can be both.

To use this in a simple application, I created a form that, after you enter a name on it, attempts to display a picture of what you entered. Because the redirect data includes common misspellings as well as nicknames, entering "Bob Marly" will get you a picture of Marley and the URL of the actual resource, as shown below the picture above. Other interesting nicknames and misspellings to try are Bob Dillan, Mary Casat, Prince Billy, Big Blue, and Proctor and Gamble. (Warning: DBpedia image data is incorrect for some very well-known people, like Abraham Lincoln and Barack Obama, even when the Wikipedia page has a picture, so you may see the symbol for a broken image link. I had hoped to have the picture above have a title of "Abe Lincon".)

Because the output creates a specialized web page, I used the technique I described in Build Wikipedia query forms with semantic technology (which can be used with any SPARQL endpoint, not just DBpedia): a CGI Python script stores a SPARQL query, replaces a string in that query with whatever was entered in the form, sends the query off to the endpoint, and then sends HTML based on the result back to the browser. You can see the source here.

It's safe to say that this ability to find the right information based on a nickname or common misspelling could add a lot to a lot of applications. Once again, while the most important part of the semantic web is the data—in this case, DBpedia's wikiPageRedirects values—and not the standards and technologies used to get at the data, the existence of so much useful SPARQL-accessible data should make the SPARQL query language look more and more appealing to people who might have doubted before.

SKOS overview article on IBM developerWorks tag:www.snee.com,2011:/bobdc.blog//2.658 2011-05-11T14:04:44Z 2011-05-11T14:05:40Z Bob DuCharme www.snee.com/bobdc.blog

SKOS, vocabulary management, the semantic web, and more

spacer

I've been interested in the SKOS standard for vocabulary management for several years (and written about it here several times), but since we at TopQuadrant first began planning out the Enterprise Vocabulary Net product, I've learned a lot more about the theory and practice of using SKOS. I've recently written up an overview of SKOS and where it fits into vocabulary management and the semantic web, and IBM developerWorks has just published this as Improve your taxonomy management using the W3C SKOS standard. I hope it provides useful to people who want to learn more about SKOS.

Quick and dirty linked data content negotiation tag:www.snee.com,2011:/bobdc.blog//2.657 2011-05-09T14:32:08Z 2011-05-09T14:36:39Z Bob DuCharme www.snee.com/bobdc.blog

Not even that dirty.

I've managed to fill a key gap in the world's supply of Linked Open Data by publishing triples that connect Mad Magazine film parody titles to the DBpedia URIs of the actual films. For example:

<dbpedia.org/resource/Judge_Dredd_%28film%29>
      mad:FilmParody
              [ prism:CoverDate "1995-08-00" ;
                prism:issueIdentifier
                        "338" ;
                dc:title "Judge Dreck"
              ] .

<dbpedia.org/resource/2001:_A_Space_Odyssey_%28film%29>
      mad:FilmParody
              [ prism:CoverDate "1969-03-00" ;
                prism:issueIdentifier "125" ;
                dc:title "201 Minutes of a Space Idiocy"
              ] .

(To prepare the data, I scraped a Wikipedia list, tested the URIs, then hand-corrected a few.) To really make this serious RESTful linked open data, I wanted to make it available as both RDF/XML and Turtle depending on the Accept value in the header of the HTTP request. All this took was a few lines in the .htaccess file (which I've been learning more about lately) in the directory storing the RDF/XML and Turtle versions of the data.

For example, either of the following two commands retrieves the Turtle version:

wget --header="Accept: text/turtle" www.rdfdata.org/dat/MadFilmParodies/
curl --header "Accept: text/turtle" -L www.rdfdata.org/dat/MadFilmParodies/

Substituting application/rdf+xml for text/turtle in either command gets you the RDF/XML version, and omitting the --header parameter altogether gets you an HTML version.

Here's the complete .htaccess file:

RewriteEngine on

RewriteCond %{HTTP_ACCEPT} ^.*text/turtle.*
RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [L]
# no luck:
#RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.ttl [R=303,L]

RewriteCond %{HTTP_ACCEPT} ^.*application/rdf\+xml.*
RewriteRule ^index.html$ www.rdfdata.org/dat/MadFilmParodies/MadFilmParodies.rdf [L]

RewriteRule ^index.html$ en.wikipedia.org/wiki/List_of_Mad's_movie_spoofs

The Apache web server where I have this hosted is configured to look for an index.html file in a directory if the requested URL doesn't mention a specific filename, so the three rules here each modify that "request" to look for something else, depending on what the RewriteCond line finds in the HTTP_ACCEPT value. If it finds "text/turtle", it sends the Turtle version of my data, and the L directive tells the Apache mod_rewrite module that is processing these instructions not to look at any more of them.

The next rule performs the corresponding HTTP_ACCEPT check and file delivery for an RDF/XML request, and the default behavior if neither of those happen is to deliver an HTML version of the data. (I took the lazy way out and just redirected to the appropriate Wikipedia page instead of creating a new HTML file.) As you can see from the two commented-out lines, I had the impression that adding R=303 in the brackets with the L would send an HTTP return code of 303 back to the requester, overriding the default code of 302, but never got that to work. If anyone has any any suggestions about how to fix this, or whether 303 is even the most appropriate return code, please let me know.

From what I've read on how the syntax of these instructions work, I shouldn't have needed the full URLs for the Turtle and RDF/XML versions of the Mad Film Parody data, because they were in the same directory as the .htaccess file, but that was the only way I could get this to work.

Now that I know how to do this, I can do it again for other resources pretty quickly. It took me about five minutes to do it for the little www.snee.com/ns/madMag/MadFilmParody ontology that the data points to. I consider this solution quick and a bit dirty because it requires the maintenance of two copies of the data, but the XML guy in me knows that it would be wrong to perform parallel edits on the two copies, and that I should instead pick one as a master, edit it when necessary, and generate the other from it. If I had to do this on a larger scale, I learned from Brian Sletten at last year's semtech that I should look into NetKernel, but it was a good exercise to do it this way to learn what was really going on.

I'm going to try to get into the habit of doing this for data and ontologies that I create, so I'd appreciate any suggestions about tweaking details before any suboptimal aspects of this become habits.

spacer
Data providers tag:www.snee.com,2011:/bobdc.blog//2.655 2011-05-02T12:31:42Z 2011-05-02T12:32:39Z Bob DuCharme www.snee.com/bobdc.blog

RDF or otherwise.

While beta testing Talis's Kasabi, I got to wondering about the data publishing market: who out there is hosting raw data, potentially charging for it and passing money along to the data's providers? Poking around, I learned who the key names are. (Corrections welcome.) I accidentally stumbled across a few more when I followed a tweet from @xmlgrrl (a.k.a. Eve Maler, a friend of mine in the XML world since it was the SGML world) and started looking at her husband Eli's blog. His posting Ten services to get your cloud startup off the ground now mentioned a few more companies that provide raw data—one that even provides free RDF. I tagged a few with a delicious.com bookmark, but wanted to write out notes about a few here in order of how interesting they are to a semantic web geek.

Some general notes:

  • The more I studied, the more I found, but I didn't want to spend more than an afternoon on this.

  • These sites all let you download data directly. I didn't include sites like Data.gov that function more as directories that link to data sources on other sites.

  • Most of these providers have boosted their numbers of available datasets by including small datasets with as few as 100 records, and by hosting copies of data from the well-known names in the Linked Data Cloud. The advertised added value is typically the ease of programmatic access to that data.

  • Despite the title of this blog entry (I was tempted to call it "Data resellers", but many make the data available for free) I focused on a more narrow case of data providers: the redistributors that gather data from specific, identified places and then make it available publicly with attribution, not actual data sources themselves such as government agencies, university projects, media making their metadata available, and various other circles on the Linked Data Cloud diagram.

  • If I've quoted some companies' websites more than others, it's because they had "About" and "FAQ" pages that were easy to find and actually answered the questions I was wondering about.

The most interesting thing about Kasabi in this field is their commitment to providing data according to Linked Data principles, giving you SPARQL endpoints for data sources and the ability to define new APIs around each data source. The current data selection is interesting, considering that Kasabi is still in beta. For now it all looks like data that is freely available elsewhere, but the advantages of retrieving it from them go beyond the ability to use the SPARQL query language. For example, with BestBuy's RDFa spread out across many different dynamically generated pages on bestbuy.com, querying this data from BestBuy's server has a lot of limitations. Kasabi seems to have the BestBuy data aggregated so that their customers have more flexibility in how they query it.

While disintermediation was a big buzzword of the dot com boom, intermediation is now getting bigger.

I list Socrata right after Kasabi because RDF is one of their export formats, along with XML, JSON, CSV, XLS, and more. In a business that depends on finding both data providers and data users, their home page makes the clearest case about why someone should work with them as a data provider: they're clearly targeting government agencies who need to fulfill data transparency mandates. (Other providers are certainly targeting this market; just not as clearly.) The company info page calls them "The Leader in Open Data Services for Government". Another paragraph on the homepage makes a nice case for why developers should be interested in their data, and upcoming webinar titles of "Launch your own Data.Gov" and "Open Data as a Service Delivery Platform" are also pretty catchy to someone interested in this market.

Factual targets data users more than data providers on their current home page, telling developers "Access great data for your web and mobile apps". The only download format I could find was CSV, but with their emphasis on helping developers build apps, they focus more data delivery with their RESTful API. According to their FAQ, "Factual, Inc. is an open data platform for application developers that leverages large scale aggregation and community exchange... Factual's hosted data comes from our community of users, developers and partners, and from our powerful data mining tools... Factual offers several hundred thousand datasets across a variety of topics (with a deep focus in Local) aggregated from multiple sources, made easily accessible for developers to build web and mobile apps... Our APIs are free to everyone—if you want SLAs or have certain performance requirements, we would charge you a fee based on usage volume. Our downloads are free for smaller developers". A press release on Semantifi's web site shows that some big names and big money are behind Factual.

Infochimp seems to be one of the more well-known (and memorable) names in the field. From their FAQ: "Infochimps is a place for people to find, share and sell formatted data. Both users and Infochimps employees scrape, parse and format data so that it's easily accessible to you. We take the chimp work out of working with data so you can literally start building cool stuff in minutes... There is no sign up fee to use Infochimps. Some of the data sets available on our site are free. Some require attribution, and others are available for purchase. The first 100,000 data API calls are free. We offer subscriptions if you would like to use more... The data sets available through our API are 1.) hosted for you and 2.) scraped on a regular basis. ... Most of our data comes in tsv, csv or yaml format". The part about users scraping, parsing, and formatting highlights another aspect of the business model of some of these companies: crowd-sourcing the labor whenever possible.

AggData sells CSV files, typically of locations of all the stores in a particular chain. For example, a complete list of Cinnabon locations, with 454 records, costs $29. The description page for each data set lists the fields and lets you download a sample. Prices that I saw ranged from $9 to $49. According to their FAQ, you order a dataset, and when payment is confirmed they email you a URL for the data that is good for 5 downloads or 120 hours. Being founded in 2006 and therefore the oldest of these companies, AggData is the most low-tech (no APIs here) but it's a lot easier to look at their lists of franchise locations and churches and imagine that data being useful to someone than it is for many of the other data providers. Infochimps lists AggData as a "featured data provider", but lists the same prices for the same datasets, so I'm not sure whether they're just routing you to the same batches of data or making it available through their own APIs. (I got an Infochimps ID, clicked through for an AggData dataset until it asked me for credit card information, and stopped there.)

According to their About page,