Citing Articles Within Articles

When writing a hypertext document, you might reference another work, linking to it using either descriptive text or the title of the work:

This sentence refers to the book <a class="">Moby Dick</a>, by Herman Melville.

When including a quote from another work, the quote should be marked up using the <blockquote> tag (with a "cite" attribute to say where the quote came from), and optionally a <cite> tag added to show the source of the quote:

<blockquote cite="">The liquor soon mounted into their heads, as it generally does even with the arrantest topers newly landed from sea, and they began capering about most obstreperously.</blockquote>
<cite><a class="">Moby Dick</a></cite>

This complies with the HTML specification for the <cite> element, which is that it contains the title of the work being cited.

Inline citations

When writing a scholarly article, though, citations (as support for a statement) have historically been added as footnotes, linked at the end of a statement either in "Author, Year" format:

Theropod dinosaurs such as Tyrannosaurus rex attained masses of 7 or even 10 tonnes (<a class="#ref-1">Hutchinson et al., 2011</a>).

or as small superscript numbers:

Theropod dinosaurs such as Tyrannosaurus rex attained masses of 7 or even 10 tonnes<a style="vertical-align:super;font-size:smaller;" class="#ref-1">1</a>.

The footnote then contained the bibliographic information for the item being cited, in a reference list or bibliography:

<ul id="references">
    <li id="ref-1">Hutchinson JR, Bates KT, Molnar J, Allen V, Makovicky PJ (2011) A computational analysis of limb and body dimensions in Tyrannosaurus rex with implications for locomotion, ontogeny, and growth. PLoS ONE 6(10):e26037</li>

Online, though, marking up citations this way doesn't make so much sense. As shown in the first examples, you really want to link directly from the text to the document being cited:

<p>Theropod dinosaurs such as Tyrannosaurus rex attained masses of 7 or even 10 tonnes (<a class="">Hutchinson et al., 2011</a>).</p>

Now we need a) something to say that this particular link is an inline citation (that the work is being referenced in support of the preceding statement), and b) something to associate the inline citation with the bibliographic annotation, so that the bibliographic information can be displayed in a popover and the reader can see in advance what's being cited.

You might think that the <cite> tag would be perfect for (a), but no! The <cite> tag is only to be used to mark up the title of the cited work, so we can only use it in the bibliographic information:

<li id="ref-1">Hutchinson JR, Bates KT, Molnar J, Allen V, Makovicky PJ (2011) <cite><a class="">A computational analysis of limb and body dimensions in Tyrannosaurus rex with implications for locomotion, ontogeny, and growth</a></cite>. PLoS ONE 6(10):e26037</li>

There was a draft of HTML3 which suggested using rev="citation" to denote inline citations, but the "rev" attribute is no longer in the HTML specification and "citation" is not a recognised link relation. That's reasonable, as those semantics wouldn't have been correct anyway (the current document is not a "citation" of the linked document, though the anchor itself could be), but it shows that this use case was being considered back then. There was also some discussion on the microformats wiki about using rel="cite" for this purpose.


Instead, let's add some metadata (as microdata attributes) to say that this is an article, and that the bibliographic information is a citation:

<article itemscope itemtype="">
        <h1 itemprop="name">Example Article</h1>

            <p>Theropod dinosaurs such as Tyrannosaurus rex attained masses of 7 or even 10 tonnes (<a class="" itemprop="citation" itemscope itemtype="" itemref="ref-1">Hutchinson et al., 2011</a>).</p>

<ul id="references">
    <li id="ref-1">Hutchinson JR, Bates KT, Molnar J, Allen V, Makovicky PJ (2011) <cite itemprop="name"><a class="" itemprop="url">A computational analysis of limb and body dimensions in Tyrannosaurus rex with implications for locomotion, ontogeny, and growth</a></cite>. PLoS ONE 6(10):e26037</li>

Now a machine would be able to read this document and understand that it's an Article with one citation. It can also understand that the cited work is an Article, with title "A computational analysis of limb and body…" and URL <>.

The machine can also see the inline context in which the document was cited, enabling it to display that snippet to someone viewing the cited document (this is what ReadCube does, in fact).

Citing specific parts of an article

There is still a problem, though: there's no indication of which part of the cited document was cited. If the citing URL had a fragment on the end, e.g. "", which corresponded to the id of an element in the target document, that would be helpful. There have also been experiments with using XPointer in URLs, or non-standard fragment formats to address specific parts of the target document; neither of these have good cross-browser support, so they depend on the target site to handle them appropriately using Javascript.

Even better would be to include a piece of text from the cited document in the citation, so that the relevant part of the cited document can be detected regardless of format (a PDF, for example). In fact, just a unique snippet of text (even an image of that text) is enough to create a citation, if you're using Google Goggles' OCR and the Google Books database to identify the section being cited.


Now we're getting into annotation territory. What we're really doing is creating an annotation that says "Document A (with metadata X), at position 1 (identified by anchor element A), links to Document B (with metadata Y), at position 2 (identified by text snippet B), and the type of that link is 'citation'". I'm hoping that will make these kinds of annotations easy to create.

Once you have all that information, instead of just being transferred from one document to the next via hypertext links, you can start to create a display that builds itself up by transcluding the relevant section of the target document into the source document, wherever it's referenced. You can also, in reverse, transclude sections of documents that cited the current document and - with a bit more citation typing - you can also know whether the citing document supported, refuted, used data from, or made some other reference to the target document.


Most of the elements and attributes are available in HTML5 for marking up citations:

A standard attribute to say that an anchor or span element marks a citation of another work.
A standard attribute that connects a "citation" anchor or span to the metadata describing the cited work, somewhere else in the current document.
A standard element or attribute that contains information about which part of the cited document is being referenced. See, for example, how the Google Drive API supports annotation targets for different types of media.

Open, Social, Academic Bookmarking: Save to

I'm a big fan of Pinboard - the posting interface is wonderful, and it's even easier with a Chrome extension and keyboard shortcut. Your bookmarks are public (unless you choose otherwise), and anyone can subscribe to your feed.

There's no place in Pinboard, though, for metadata other than title, URL and description, and you can't attach snapshots or files to your bookmarks (Pinboard stores an HTML archive of each bookmark, for subscribers, but its crawler might not always get the same view of the page as you, or any attachments).

Zotero and Mendeley both have APIs which let you post metadata and attach files, but they're limited in various ways (they both have a restricted set of metadata fields that are allowed for each type of object; Zotero's API is nicely engineered but uses XML; neither support the client-side OAuth2 flow, which means browser extensions have to either communicate with the API through a server-side proxy or store their secret key in the extension). Also, neither Zotero or Mendeley allows you to keep your library completely open for anyone to read.

ReadCube has a lovely reading and outward-linking interface, but not much of a public API; Google Drive and Dropbox are both excellent file stores and have well-built APIs, but neither lets you attach extensive metadata to each object.

Basically, I've been searching for a long time for a cloud-based object store with a decent API, the ability to store items with arbitrary JSON metadata/annotations, multiple file attachments, user-pays storage costs, client-side OAuth2 authentication, cross-origin request headers and (as a special bonus) a web-based social layer for communication.

Save to

Then, last week, added Files to their existing API, allowing files to be attached to posts.

In the same way as you can post to Twitter and attach photos, now you can post to, attach any metadata you like as annotations, and attach arbitrary files to each post.

In my case, those files are PDFs.

As a starting point, I made a Chrome extension that reads metadata from the current page, fetches more metadata from various services using any identifier it can find, then posts the item to If the original HTML file points to an associated PDF file, then it also fetches that file and attaches it to the post.

When I'm browsing the web reading articles, I just have to press a single button and the item is stored in my feed.

Anyone subscribed to the RSS feed of my posts gets the item title, a link to the original page, and the PDF file as an enclosure.

Read from

As a reader, I need to be able to access all the stored metadata for each item, so I built a client-side web interface to my library. After authenticating, it fetches my latest posts and displays those that have file attachments. Then, when I select an article, it displays the PDF using PDF.js.

So: a cloud-based file system with arbitrary metadata, a browser extension for creating items and attaching files, a web-based reading interface, the ability to subscribe to other people's posts... what next? A local, synced IndexedDB database and file store like in Metatato, for better lookup, browsing and offline reading? A separate service that converts each PDF to XML using pdfx and attaches the XML file back to your post? An application that reads your feed and recommends similar articles? Text mining to suggest connections between articles? Plugins to display attention metrics for each article? Combine the feeds of everyone you follow to create an ├╝ber-library?

If you have more ideas, here's a home page, of sorts, which has a link to all the source code.

HTML metadata for journal articles

You’d think it would be easy to pin down an ontology for journal articles. There are basically just these properties:

  • title
  • authors[]
  • datePublished
  • abstract

But… some of those are shared with more generic classes higher up the tree, so abtract becomes description, title becomes name, author becomes creator. Each author can be a string or an object. Each author has one or more affiliations, which have addresses. The authors are in a specific order, and some of them have certain roles. There are several different dates: creation, review, update, publication.

  • author: { name: { displayName, familyName, initials, middleName, lastName }, role, affiliation }

Then the big one - each article (a "Work") is expressed in various different forms ("Instances", or publication events). It might be published in one or more collections/periodicals, and not just in a “journal”, but on a certain page of an issue which is part of a volume which is part of a periodical, which has an ISSN (and an eISSN, and an ISSNL) and a title (in multiple forms of abbreviation):

  • journalName
  • issn
  • issue
  • volume

The work itself can have identifiers (DOI, PMID, arXiV, etc) which may or may not be URLs and may or may not be applicable to each publication event. The work may also be rendered in different formats (HTML, PDF) and languages, each of which has its own URL and metadata.

  • htmlURL
  • pdfURL
  • DOI
  • PMID
  • arxiv
  • language

Thus, while has a ScholarlyArticle class (and even a MedicalScholarlyArticle class), it’s quite incomplete and doesn’t even cover a lot of the citation_* tags that Google Scholar indexes.

There’s a W3C working group - “Schema Bib Extend” - trying to extend the schema for bibliographic markup (along with similar efforts in other working groups for comics and other serials/periodicals).

OCLC have made their own extension to to add classes for things like Periodicals and Newspapers.

FreeBase has an extensive set of types and properties around Scholarly Works and Citations, including Journal Article.

BIBO is an existing bibliographic ontology, which is similar to Zotero’s field definitions, and there are similar attempts in FaBiO, BIRO and BibJSON.

There's the MODS XML schema, which I rather like. MODS has proven itself as an intermediary format in bibutils, and converts quite cleanly to JSON.

The newest entrant is BIBFRAME from the Library of Congress: yet to release an ontology, but with a clear overview defining Work, Instance, Authority and Annotation superclasses, where a Work is published as one or more Instances.

The nested/graph approach is pleasing, theoretically: Article (Work) -> hasInstance -> Article (PDF) -> isPartOf -> Issue -> isPartOf -> Volume -> isPartOf -> Journal -> hasISSN -> ISSN. One the other hand is looking for simple key/value pairs attached to an object, and practically it seems to work ok (in Zotero and Mendeley, at least) that the article has “issn”, “startPage”, “endPage”, “volume”, “issue” etc attached to it rather than to one or more associated “isPartOf” entities.

When you come to add markup to HTML to describe these properties (it will be great when articles are just published as HTML with metadata embedded, rather than having to generate XML in multiple formats for archiving and submission to various systems), there are several ways to add this metadata: links to alternate formats fit nicely as rel=alternate links; while either HTML5 microdata or RDFa Lite (which are essentially equivalent, except that RDFa Lite has simpler attribute names while microdata has a defined DOM API and a redundant “itemscope” attribute) are available for adding key-value properties to an object.

The main aim of adding this markup, currently, is so that when someone bookmarks/shares the page (either privately or in public), the information that’s displayed/saved to their collection is easily readable and correct. On the other side of things, having a standard set of metadata that can be passed between various services is also useful, when you want to use the same metadata about an article to look it up in multiple services.

A search engine like Google Scholar probably only really needs a few fields to identify and describe a Work: title, authors, publication date, abstract/description and URL. For locating, filtering, or referring to a specific instance of a work, though, the other fields become useful.

I’ve added the basic microdata and RDFa Lite to an HTML rendering of PubMed articles at{pmid}, e.g. If you have an application that allows people to bookmark/share PubMed articles, that might be a good URL to use.

Ten years of HubMed

It's been 10 years, last weekend, since HubMed first went online. Inspired by TouchGraph's Google Browser, I learned a bit of Perl (mostly because PubCrawler was the closest example code I could find) and had a go at making something similar using PubMed's "Related Articles" API.

Inspired by Mark Pilgrim, I made a script to convert PubMed's EUtils output to RSS, and later Atom. It was great to be able to have new papers arrive in a feed aggregator, and a couple of years later it became possible to do so using PubMed itself. In the meantime I played around with the web interface to HubMed, adding features - some of which are still there, some of which have decayed. Gunther Eysenbach/JMIR kindly supported some part-time work on the site in 2005-6, and I wrote a paper (which - like most papers - obviously wanted to be a blog post) about some of HubMed's features.

So, 10 years later, it's time for a new version of HubMed. It's linked from the front page, but hasn't replaced the old version yet.

There have been some changes to the web over the last 10 years, mostly coming from innovations in web standards and browser support for those standards. These days, there's no need for any server-side scripting to run HubMed - all the data comes direct from the EUtils server (as allowed by CORS), gets turned into Javascript objects, then is rendered as views using Backbone. Attention metrics for each article get pulled in - client-side - from various sources, particularly Altmetric. You can bookmark articles in Mendeley, or download them directly as RIS or BibTeX thanks to bibutils.

Many of these new features are still a work-in-progress, which is exciting: browsers are adding support for native search inputs; back buttons and list offsets are still an unsolved problem when using infinite scroll; there's still a need for an extensible way to ask the browser how it would prefer to handle various actions (bookmark, save, etc; Web Intents is trying to solve this); it's still too difficult to get a URL for the PDF of each article, and still too expensive to read those articles (let alone delegate the reading to software). Clicking an author's name in HubMed shows you other articles they wrote, but also articles by people of the same name - a problem which ORCID is trying to solve by assigning each researcher a unique identifier.

As for features that still survive, one of my favourite things that databases can do is "More Like These", and HubMed now makes that even easier: hold down Ctrl/Cmd while clicking subsequent "Related" links, and the related articles will be merged; the search gets more and more focused, and hopefully achieves an equivalent goal to the original PubMed TouchGraph, even if it doesn't visualise all of the connections between similar articles.

There's also an update to HubMed's Citation Finder, though it's still missing a bulk export option, and I think I can make finding articles even easier now that free text citation parsing is working again in EUtils.

Most importantly, a next step: each researcher's collection of saved articles needs to move out of silos and become available to the web. When searching in Metatato, the list of search results knows which articles you already have saved, as it can query your local database, and can add articles directly; I think this needs to be expanded so that any database or search index can be overlaid with information about your existing collection. It may be synced and stored on your local computer, or it may be somewhere else behind an API, but allowing third-party tools to query and analyse that collection as you build it will hopefully turn out to be really useful for searching and filtering.

There's one more thing related to search and discovery, which is your social graph: the sources that you have chosen as reliable sources of information. Information filters through them to you, and search results in Google and Twitter already incorporate social cues to highlight useful information. It should get easier to follow not only the research produced by the most knowledgable people in any particular area, but also the research that they find the most interesting.

Publishing a podcast using Google Drive (in theory)

  1. Create a folder in Google Drive, and set the sharing settings to Public.
  2. Open the public folder URL, and upload the files you want to publish [example].
  3. Edit the title and description of each item.
  4. Create a spreadsheet in the same folder [example].
  5. Attach a script to the spreadsheet that finds all the files in this folder and adds each one as a new row to the spreadsheet (with columns "Name", "Date", "Size", "URL", "Description", etc) [example].
  6. Publish the spreadsheet and copy the URL for the text (TSV) version [example].
  7. Create a Yahoo Pipe that fetches the TSV file and turns each row into an entry in an RSS feed (each with an enclosure property) [example].
  8. Subscribe to the RSS feed from that Yahoo Pipe [example].

The only remaining hurdle is that Google Drive's download links no longer allow you to bypass the "no antivirus" page, which means that it's not possible to get a direct download link for each MP3 file.

Publishing Articles Using Gists

  1. Not only does GitHub provide standard Git repositories, it also provides Gists: pasting text into an input box creates a file in a new Git repository, which can then be cloned and updated like any other repository, edited locally or in an online editor, with revision tracking. It's meant for small, self-contained items.

  2. Mike Bostock's is built on top of Gists. It takes the HTML, CSS and JS files from a Gist, displays the highlighted source code, and runs the code above it. It's meant for publishing visualisations built with D3 (and similar libraries).

  3. Inspired by, I realised it was possible to publish articles as XML in Gists. I created (code in GitHub), which fetches a Gist, looks for a file called article.xml in the JATS/NLM XML format, converts it to HTML and displays the article.

Here's an example Gist and the corresponding HTML article.

Note: needs some testing, so might not work in browsers other than Chrome - I think IE9 uses a different method for transforming XML documents.

Music Seeds and More Like These

When you want to listen to music, but you don't know which music, where do you go? Recommendations takes the things you've listened to recently and recommends a range of similar music, based on what other people have been listening to. You can see the recommendations on the web, through the API, or in the Spotify app.

This Is My Jam

Alternatively, find people with good music taste and see what they recommend. Follow people on This Is My Jam and get a feed of their chosen tracks, either on the web or in the Spotify app.

Spotify Social

Add Spotify users to your "People" list, and it shows what they've been listening to - click on any of the tracks they've played to start listening.

BBC Radio

Not just listening to live radio, or listening-again to recorded radio, but generating playlists from the latest broadcasts. Subscribe to playlists of your favourite shows in Tomahawk or in Spotify via Britify and have the latest tracks available at all times. Alternatively, subscribe to a radio station and use their playlist as a seed for new exploration.


Find some broadcasters that you like on Mixcloud (e.g. The Quietus, or many of the Resonance FM shows are archived here), and follow them to get their latest mixes. Each mix has an accompanying tracklist, and it's fairly easy to link each artist to Spotify/Tomahawk for further exploration.

Album Reviews

There are Spotify apps for The Guardian, Pitchfork and the NME - among others - which list the albums that they've reviewed. Even better, set your own filtering criteria in Spotimy or Biblify for albums reviewed by a whole range of sources.

Location and Venues

Songkick maintains lists of all the concerts in a particular area. If there are some venues that you know usually book good musicians, you can listen to tracks by everyone who's going to be playing at those venues, either in the Spotify app or on the web.

More Like These

These Spotify apps let you create playlists of whichever view they're currently showing - other people's recommendations, reviews, tracks you've liked previously, etc - but there's one more killer feature:

Playlist Radio

Essentially, this takes the current playlist and performs a "More Like These" query (via The Echo Nest), generating a new playlist of similar tracks.

Thus, music discovery goes from an initial seed source (people, robots, or a playlist in a certain style), narrows down for more detail (selecting a specific artist or album to hear more of), or widens out (related artists, playlist radio) for exploration. Sooner or later you'll hear something that sounds just right, and you can feed it back for other people to enjoy.

Querying Data Sets using Google BigQuery

Google BigQuery is a "cloud SQL service built on Dremel". It can quickly run SQL queries across large data sets.


Say you've downloaded a relatively small set of TSV files (around 100MB), unpacked them locally, and converted them to UTF-8:

wget --continue '' --output-document='knowledgebase.tar.gz'
mkdir knowledgebase
tar -xvzf knowledgebase.tar.gz --directory knowledgebase
rm knowledgebase/update.xml
find knowledgebase/ -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o "{}.tsv" \;

Create a new project in Google API Console, enable the BigQuery API service for that project (requires enabling billing), and install bq (command line interface to the BigQuery API):

sudo easy_install bigquery
bq init

Create a new dataset in the BigQuery browser, and set it as your default dataset:

echo dataset_id=cufts >> ~/.bigqueryrc

Multiple tables with different schema

Upload all the CSV files to your BigQuery dataset:

for FILE in `ls knowledgebase/*.tsv`; do
    echo $FILE
    BASENAME=$(basename $FILE .tsv)
    echo $BASENAME
    SCHEMA=`head --lines=1 $FILE | sed -e "s/\t/, /g"`
    echo $SCHEMA
    bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 "$BASENAME" "$FILE" "$SCHEMA"

Now they can be queried in the BigQuery browser:

SELECT title FROM cufts.aaas, cufts.acs, cufts.doaj WHERE issn = '17605776'

The trouble is that a) you have to specify each table individually in the query, and b) each table has a different schema, so you get an error querying for any field that isn't present in all tables.

A single table with normalised schema

To combine all the tables into one normalised file, create a new project in Google Refine, add all the TSV files, then export the data to a single TSV file. As this file is quite big, store it in Google Cloud Storage so it's available for re-use later:

In the Google API Console, enable the Google Cloud Storage service for your project. Create a new bucket (called "cufts" in this example). Install the Cloud Storage command line interface gsutil, and upload the combined TSV file (using the "-z" option to enable compression):

gsutil cp -z tsv -a public-read cufts.tsv gs://cufts/

Once it's been uploaded to Cloud Storage, import the file to a new BigQuery table:

SCHEMA=`head --lines=1 cufts.tsv | sed -e "s/\t/, /g"`
echo $SCHEMA
bq load --encoding="UTF-8" --field_delimiter="\t" --skip_leading_rows=1 --max_bad_records=100 knowledgebase "gs://cufts/cufts.tsv" "$SCHEMA"

Now you can run queries against the combined data set:

SELECT file, title, e_issn FROM cufts.knowledgebase WHERE issn = '01617761' OR e_issn = '01617761'

Queries can take a few seconds to run, but - as long as they return results before timing out - you can use Javascript to access the query API.

These queries require OAuth 2.0 authentication and a project ID, as queries count towards the quota/billing for the project doing the querying, so it's not possible to allow public queries of the dataset in this way; you'd have to provide an API yourself and handle authentication on the server.

Using Google Fusion Tables to provide an API to data files

This example imports a large TSV file (all the data from the CUFTS Knowledgebase) into a Google Fusion Table, then uses the Fusion Table API to provide a JSON API for querying.

Here's an example table, and an example interface.

  1. Download the latest CUFTS knowledgebase data file:
    wget --continue '' --output-document='cufts.tar.gz'
  2. Extract the data and remove any unwanted files:
    mkdir cufts
    tar -xvzf cufts.tar.gz --directory cufts
    rm cufts/update.xml
  3. Convert the TSV files to UTF-8 encoding:
    find cufts/ -type f -exec iconv -f iso-8859-1 -t utf-8 "{}" -o "{}-utf8.tsv" \;
  4. To merge multiple files together, create a new project in Google Refine and add all the UTF-8 TSV files, then export the data from Google Refine to a single TSV file. If this exported file is over 100MB, split it into smaller pieces:
    tail -n +2 cufts.tsv | split -l 500000 - cufts-split-
    for file in cufts-split-*; do
      head -n 1 cufts.tsv > tmp_file
      cat $file >> tmp_file
      mv -f tmp_file $file

    Alternatively, add each file to the Fusion Table individually, either manually or using the API.

  5. In Google Drive choose Create > Table and import the TSV file to create a new Fusion Table. To add more data, choose File > Import more rows.
  6. Edit the Fusion Table's sharing settings to make it public.
  7. Create a new project in Google APIs Console, enable Fusion Tables API in the "Services" section, and copy the API key from the "API Access" section, for use when querying.
  8. Clone this GitHub Pages repository, and use it to create your own interface to the Fusion Table data. Basically the queries are SQL, e.g.
    SELECT file, title FROM abcdefghijk WHERE issn = '1234-567X'
    and querying in jQuery is simple (the responses have a CORS header so can be accessed cross-domain):
      url: "",
      data: {
        key: "YOUR API KEY",
        sql: "YOUR SQL QUERY"

I have noticed that queries can be (variably) quite slow on large tables like this one, but hopefully that's temporary…

Resourceful Web Interfaces

I gave this talk a couple of months ago, and think it does a good job of describing the way I've been thinking about "resourceful" web development: data modelled as collections of resources; views entirely separate from the data.

This also fits nicely with the way that Backbone works, and Google is perhaps moving in the same direction with App Engine (data) and App Scripts (views). is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.