Billion Triple Challenge 2012 Dataset

The BTC 2012 dataset serves as basis for submissions to the Billion Triples Track of the Semantic Web Challenge.

Description

The dataset was crawled during May/June 2012. Unlike previous years (in which we started from random URI samples), we used several seed sets collected from mulitple sources.

We rewrote blank node identifiers to include the data source in order to provide unique blank nodes for each data source, and appended the data source to the output file. The data is encoded in NQuads format.

The individual crawls contain a file "redirects.nx.gz" which consists of "source target ." tuples derived from 302 and 303 redirect HTTP response codes. In addition, we provide "access.log.gz" files in Squid access.log format.

Please note that the BTC dataset is collected from the web and as such of varying quality. Dealing with noisy data is part of the fun you'll have when working with web data.

Crawls

We started the crawls with seed sets collected from several sources. DBpedia and Freebase URIs were crawled separately and thus excluded from the Datahub, Rest and Timbl datasets.

We performed the crawling in rounds. For each round we provide data-{round}.nq.gz, redirects-{round}.nx.gz and access-{round}.log.gz files.

File Quads Size (gz)
datahub/data-0.nq.gz45595450K
datahub/data-1.nq.gz8043757.5M
datahub/data-2.nq.gz19655239165M
datahub/data-3.nq.gz805965831010M
datahub/data-4.nq.gz8089771907.1G
dbpedia/data-0.nq.gz1980900244.5G
freebase/data-0.nq.gz101241556981M
rest/data-0.nq.gz196722432M
rest/data-1.nq.gz661727674M
rest/data-2.nq.gz13743742164M
timbl/data-0.nq.gz892.5K
timbl/data-1.nq.gz16516293K
timbl/data-2.nq.gz872501.2M
timbl/data-3.nq.gz3884125.1M
timbl/data-4.nq.gz9405528113M
timbl/data-5.nq.gz938985231017M
timbl/data-6.nq.gz1010104231.2G
Total143654554517G

Datahub

The seed set for the Datahub crawl contained all example URIs marked example/* where the "*" is an RDF serialisation (thanks to Pablo Mendes for providing the URIs).

The crawl was breadth-first with hop 4 expansion. You can find the Datahub files at datahub/.

DBpedia

The seed set for the DBpedia crawl contained all DBpedia URIs from the DBpedia 3.7 dump.

No links were expanded. You can find the DBpedia files at dbpedia/.

Freebase

The seed set for the Freebase crawl contained all Freebase URIs involved in a owl:sameAs relation in the DBpedia 3.7 dump.

No links were expanded. You can find the Freebase files at freebase/. Note that due to call limits some lookups have resulted in a 403 Forbidden.

Rest

The seed set for the Rest crawl contained all other URIs involved in a owl:sameAs relation in the DBpedia 3.7 dump.

The crawl was breadth-first with hop 2 expansion. You can find the Rest files at rest/.

Timbl

The seed set for the Timbl crawl consisted of Tim Berners-Lee's FOAF file (www.w3.org/People/Berners-Lee/card.rdf).

The crawl was breadth-first with hop 6 expansion. You can find the Timbl files at timbl/.

Download

To fetch the content of the entire directory, download the 000-CONTENTS file and do $ wget -x -nH -i 000-CONTENTS which will download the files while preserving the directory structure.

Contact

For questions about data format, server issues or download problems contact harth@kit.edu.

Previous BTC Datasets

Acknowledgement

We acknowledge the support of the Steinbuch Centre for Computing (SCC) and the European Community's Seventh Framework Programme FP7/2007-2013 (PlanetData, Grant 257641).

spacer

History

2012-11-08
Jesse Weaver's blog post on the validation of the BTC 2012 dataset
2012-11-07
Gunnar Grimnes' excellent visualisation of links and some other BTC 2012 stats
2012-09-02
Second attempt at fixing bnodes, now include unique context string
2012-08-24
Fixed bnode syntax issues, character encoding issues
2012-07-01
Dataset posted

Andreas Harth
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.