The two meanings of semantics in HTML5

There is a lot of confusion around HTML5 and RDFa, both in the Drupal community and outside of it. That’s why I decided to redevelop my site in Drupal 7 with HTML5, using a base theme available on Drupal.org, to see for myself what it's like using HTML5 and RDFa together.

HTML5 is the incontestable future of the Web, and it is becoming more and more clear that inline structured data is also going to be a fundamental part of the future Web... and, with the current core support for RDFa and the future core support for HTML5, Drupal has an interest in both.

The capabilities offered by HTML5—and the other standards that sometimes get conflated with it, such as CSS3—blow my mind.

I’ve also been impressed by a number of the people involved in developing and evangelizing HTML5, particularly the emphasis I’ve seen on the social aspects of standards adoption/compliance and the concern for the cognitive load placed on the developer. I didn’t know much about that side until Jeremy Keith talked about design priorities at DrupalCon Copenhagen... you should watch that.

What more could you ask for?

In spite of this HTML awesomeness, there are still some things that HTML just doesn’t do... at least, not without a little help. What I want that plain old HTML doesn’t give me

I want add content to one site and then be able to target specific pieces of that content to display on another site...
using whatever platform I want as my backend for those sites (or even no platform at all)...
without having to worry when I change the document structure of the source site (like removing classes or adding more divs) that I might break things in applications other people have built...
and if one thing or person is talked about on multiple sites, I want to be able to retrieve and remix information from all of those sites without having to hand-edit the files to align them.

And I want to do all this without having to learn any proprietary APIs (ie, the Twitter API, the Flickr API, the GoodReads API, etc, etc, etc), without having to feed my data into Google's giant data pools, and without having to code custom tools for each set of data.

You can’t do all of this at the same time just using plain old HTML... however, you can when you add place a few machine-readable labels on your content. You’ve added semantics to your data and opened it up for reuse. Now applications can come to your site and use the HTML page like a database.

But wait, isn’t HTML5 all about semantics, too?

The difference between HTML semantics and the other kind of semantics

At DrupalCon, questions always come up about HTML5’s semantic markup vs. RDFa. As I always try to explain, it isn’t a ‘versus’ kind of situation.

The trouble is the word semantics. The word means “meaning”. Unfortunately, in the way it is used, its own meaning is often unclear.

HTML5 adds a whole bunch of new semantic elements. These are meant to help browsers understand how to present your content better, whether it is being presented on a screen or by some other kind of device, such as a screenreader. Example elements are video, nav for menus, and mark for highlighted text.

That is the meaning of your document structure, not the meaning of your content itself.

For instance, say I have two articles, one about Ireland and the other about the Blue Heron that wanders around my house. And let’s say I have an aside within each article.

For the Ireland article, the aside will contain information about Ireland, such as the gross domestic product (GDP), broken down by sector. For the article about the blue heron, the aside will contain information about blue herons, such as its species name and links to other herons in the same genus.

The semantics of the document structure of the two pages is the same, an <article> with an <aside>.

However, the semantics of the content are different. While the relationship between the subjects and their images is the same, the relationship between a country and its GDP is different than the relationship between an animal and its species name.

It’s unfortunate that the language for describing these two different kinds of semantics is so fuzzy. It is quite easy to understand why people confuse semantic HTML markup and semantically marked-up content. But the distinction is important to any discussion about HTML and RDFa. As a layer of extra meaning on top of HTML, RDFa is comparable to microdata, another W3C working draft, in the way it allows developers to add structure to data.

Before leaving this example, I would just like to note that because both the GDP information and the species information are available in RDF datasets that have SPARQL endpoints to access that RDF, you could actually pull all of the content for the aside directly from the source using a tool like SPARQL Views, which means you don’t have to enter and maintain the info yourself :)

The real discussion: microdata and RDFa

Like RDFa, microdata is a way of adding structure to your data within the markup, and does it in a more generalized way than microformats (at least, the original microformats).

In the talk I gave at DrupalCon Chicago, I mentioned that sometimes when people ask me about microdata and RDFa, it’s in these slightly hushed, excited tones... as though they were asking me about a fight at the prom last night. And it is really a shame that this is the case.

It seems like a lot of the reason for past animosity between the supporters of the two specs is political and philosophical, not so much technical. The truth is microdata and RDFa are similar in a number of ways. There are many people like myself who just want to do cool things with data and need an easy way to extract data from a page and sometimes combine it with data from other sites, and for these people, both specs move us further than we are today.

But there are some real differences between the two specs. As a community, I think we need to engage with these differences and really try these things out... because at this point in time, while the standards clearly state the need for implementation feedback, we can still file bug reports and have the standardizers welcome them.

To learn more

There are others who have written about microdata and RDFa, and some have examined the differences. These are some of the ones that I like, please add any other recommendations in the comments:

Chrome Web standards hacker and active #whatwg participant Tab Atkins has a good introduction to microdata
Manu Sporny, chair of the W3C's RDFa working group, provided a very helpful comparison of RDFa, microdata, and microformats.
Jeni Tennison, a very insightful developer who’s worked on big Linked Data projects, has given her comments in multiple posts.
Benjamin Nowack, developer of the ARC2 RDF library for PHP that we use in Drupal, has talked about microdata as semantic markup for both RDFers and non-RDFers

html5
microdata
rdfa
drupal
drupal-planet
html-data

Comments

Extracting document semantics

Lin,

Nice post :)

A couple of things that sprang to mind while reading your post. First is to point you at the post Philip Jägenstedt (@foolip) wrote a while ago to add to your 'To learn more' section:

blog.foolip.org/2009/08/23/microformats-vs-rdfa-vs-microdata/

Second is to point out the way in which the RDF mapping algorithm in microdata extracts some of the document-level semantics as well as explicit content-level semantics included by the publisher. For example, it uses the <title> element to create a dct:title statement about the document, rel attributes to create statements linking the document to others with custom relationships, <blockquote> and <cite> elements to create dct:source statements and so on. There are various other things that aren't pulled out, but could be, such as the publication date of an article, or containment structures of sections within the document. It's hard to know where to draw the line on those, I think.

Extracting document semantics like this means it's very easy to get ambiguity in URIs, because the HTML document's URI is in these cases very much taken as being a document's URI rather than being a URI for the thing's that the document is about.

Jeni

I'm afraid you may be adding to the confusion here.

I think discussing HTML semantics as something distinct from content semantics is misleading for most people. You're suggesting HTML doesn't include content semantics. This is clearly untrue, as there's a whole language, GRDDL, dedicated to parsing content semantics in HTML. The problem with HTML isn't that content semantics aren't there; it's that those content semantics aren't readable by machines without the aid of something like GRDDL.

RDFa cuts out that middle step, but it's not an entirely new concept. HTML content semantics like or are easily understood by most people, just not machines. I'm worried you may be making semantics seem less approachable than it really is by suggesting such HTML semantics doesn't include content semantics. Content semantics machines can't understand are far less useful, but they're still content semantics, and still a valuable entry point in understanding the concept.

Unless an HTML author is

Unless an HTML author is using microformats (which isn't part of HTML, but rather is an extension like RDFa or microdata), then I don't think they generally add content semantics to the classes.

If any of the markup marines or other front end folks say that it is widespread practice, then I will stand corrected. However, from the HTML working group's research, it seems that most people use classes to define what an element's purpose is in the document, not what the content inside is. That's the justification they give for promoting those things to elements. When I doing more front-end dev, my class names (and the class names in the base template we used) matched the working group's findings.

I'm not saying that no one should use classes to add their own, non-microformat semantics that are not machine readable... totally fine if someone wants to do this. But the thing about RDFa, microdata, and microformats is that because they are machine readable, they enable you to build applications. Otherwise, you're just using the class as a kind of a code comment for other developers to understand your meaning. That's fine, might even be something really useful for code maintenance... but code commenting is not really a part of this discussion. Something can have its own semantics without being a part of *this* semantics discussion.

You're right that

You're right that microformats vocabularies are not part of HTML, but microformats syntax is entirely HTML. That same HTML syntax is used outside microformats by many, many HTML authors to describe content semantics. In the survey you linked, copyright (#9) is definitely content semantics, and title (#3) would be in some contexts. On this page, maps directly to dc:created and maps directly to foaf:name. Your dismissal of these widely used and -- more importantly -- widely understood HTML content semantics as "code commenting" is disappointing. It's hard to get more people to see the value in this stuff when we make it seem so esoteric. It's really not.

I'm not dismissing it, but if

I'm not dismissing it, but if it is only for human consumption and not for machine consumption then it is by-definition commenting. Comments are the essential part of code that are only for human consumption. If you are using microformats, then you aren't using straight HTML, you are using HTML+microformats, which is adding the little bit of content semantics that I'm talking about.

Anyways, this is a sidetrack that I don't think is relevant to the discussion, so I would prefer to focus on the issues raised in the discussion.

Lin Clark

The two meanings of semantics in HTML5

What more could you ask for?

The difference between HTML semantics and the other kind of semantics

The real discussion: microdata and RDFa

To learn more

Comments

Extracting document semantics

Unless an HTML author is

You're right that

I'm not dismissing it, but if

Recent blog posts

Lin Clark

You are here

The two meanings of semantics in HTML5

What more could you ask for?

The difference between HTML semantics and the other kind of semantics

The real discussion: microdata and RDFa

To learn more

Comments

Extracting document semantics

Unless an HTML author is

You're right that

I'm not dismissing it, but if

Recent blog posts