spacer

Silvia Pfeiffer : What is “interoperable TTML”?


I’ve just tried to come to terms with the latest state of TTML, the Timed Text Markup Language.

TTML has been specified by the W3C Timed Text Working Group and released as a RECommendation v1.0 in November 2010. Since then, several organisations have tried to adopt it as their caption file format. This includes the SMPTE, the EBU (European Broadcasting Union), and Microsoft.

Both, Microsoft and the EBU actually looked at TTML in detail and decided that in order to make it usable for their use cases, a restriction of its functionalities is needed.

EBU-TT

The EBU released EBU-TT, which restricts the set of valid attributes and feature. “The EBU-TT format is intended to constrain the features provided by TTML, especially to make EBU-TT more suitable for the use with broadcast video and web video applications.” (see EBU-TT).

In addition, EBU-specific namespaces were introduce to extend TTML with EBU-specific data types, e.g. ebuttdt:frameRateMultiplierType or ebuttdt:smpteTimingType. Similarly, a bunch of metadata elements were introduced, e.g. ebuttm:documentMetadata, ebuttm:documentEbuttVersion, or ebuttm:documentIdentifier.

The use of namespaces as an extensibility mechanism will ascertain that EBU-TT files continue to be valid TTML files. However, any vanilla TTML parser will not know what to do with these custom extensions and will drop them on the floor.

Simple Delivery Profile

With the intention to make TTML ready for “internet delivery of Captions originated in the United States”, Microsoft proposed a “Simple Delivery Profile for Closed Captions (US)” (see Simple Profile). The Simple Profile is also a restriction of TTML.

Unfortunately, the Microsoft profile is not the same as the EBU-TT profile: for example, it contains the “set” element, which is not conformant in EBU-TT. Similarly, the supported style features are different, e.g. Simple Profile supports “display-region”, while EBU-TT does not. On the other hand, EBU-TT supports monospace, sans-serif and serif fonts, while the Simple profile does not.

Thus files created for the Simple Delivery Profile will not work on players that expect EBU-TT and the reverse.

Fortunately, the Simple Delivery Profile does not introduce any new namespaces and new features, so at least it is an explicit subpart of TTML and not both a restriction and extension like EBU-TT.

SMPTE-TT

SMPTE also created a version of the TTML standard called SMPTE-TT. SMPTE did not decide on a subset of TTML for their purposes – it was simply adopted as a complete set. “This Standard provides a framework for timed text to be supported for content delivered via broadband means,…” (see SMPTE-TT).

However, SMPTE extended TTML in SMPTE-TT with an ability to store a binary blob with captions in another format. This allows using SMPTE-TT as a transport format for any caption format and is deemed to help with “backwards compatibility”.

Now, instead of specifying a profile, SMPTE decided to define how to convert CEA-608 captions to SMPTE-TT. Even if it’s not called a “profile”, that’s actually what it is. It even has its own namespace: “m608:”.

Conclusion

With all these different versions of TTML, I ask myself what a video player that claims support for TTML will do to get something working. The only chance it has is to implement all the extensions defined in all the different profiles. I pity the player that has to deal with a SMPTE-TT file that has a binary blob in it and is expected to be able to decode this.

Now, what is a caption author supposed to do when creating TTML? They obviously cannot expect all players to be able to play back all TTML versions. Should they create different files depending on what platform they are targeting, i.e. a EBU-TT version, a SMPTE-TT version, a vanilla TTML version, and a Simple Delivery Profile version? Should they by throwing all the features of all the versions into one TTML file and hope that the players will pick out the right things that they require and drop the rest on the floor?

Maybe the best way to progress would be to make a list of the “safe” features: those features that every TTML profile supports. That may be the best way to get an “interoperable TTML” file. Here’s me hoping that this minimal set of features doesn’t just end up being the usual (starttime, endtime, text) triple.

UPDATE:

I just found out that UltraViolet have their own profile of SMPTE-TT called CFF-TT (see UltraViolet FAQ and spec). They are making some SMPTE-TT fields optional, but introduce a new @forcedDisplayMode attribute under their own namespace “cff:”.

by silvia at September 19, 2012 11:01 AM

Jean-Marc Valin : Opus is out, it rocks, and it's a standard


We finally made it! Opus is now standardized by the IETF as RFC 6716. See the Mozilla hacks post and the Xiph.Org press release for more details. Of course, feel free to help spread the word around.

We're also releasing both version 1.0.0, which is the same code as the RFC, and version 1.0.1, which is a minor update on that code (mainly with the build system). As usual, you can get those from opus-codec.org/

Thanks to everyone who contributed by fixing bugs, reporting issues, implementing Opus support, testing, advocating, ... It was a lot of work, but it was worth it.

September 11, 2012 07:20 PM

Silvia Pfeiffer : Why I became a HTML5 co-editor


A few weeks ago, I had the honor to be appointed as part of the editorial team of the W3C HTML5 specification.

Since Ian Hickson had recently decided to focus solely on editing the WHATWG HTML living standard specification, the W3C started looking for other editors to take the existing HTML5 specification to REC level. REC level is what other standards organizations call a “ratified standard”.

But what does REC level really mean for HTML?

In my probably somewhat subjective view, recommendation level means that a snapshot is taken of the continuously evolving HTML spec, which has a comprehensive feature set, that is implemented in a cross-browser interoperable way, has a complete test set for the features, and has received wide review. The latter implies that other groups in the W3C have had a chance to look at the specification and make sure it satisfies their basic requirements, which include e.g. applicability to all users (accessibility, internationalization), platforms, and devices (mobile, TV).

Basically it means that we stop for a “moment”, take a deep breath, polish the feature set that we’ve been working on this far, and make sure we all agree on it, before we get back to changing the world with cool new stuff. In a software project we would call it a release branch with feature freeze.

Now, as productive as that may sound for software – it’s not actually that exciting for a specification. Firstly, the most exciting things happen when writing new features. Secondly, development of browsers doesn’t just magically stop to get the release (REC) happening. And lastly, if we’ve done our specification work well, there should be only little work to do. Basically, it’s the unthankful work of tidying up that we’re looking at here. spacer

So, why am I doing it? I am not doing this for money – I’m currently part-time contracting to Google’s accessibility team working on video accessibility and this editor work is not covered by my contract. It wasn’t possible to reconcile polishing work on a specification with the goals of my contract, which include pushing new accessibility features forward. Therefore, when invited, I decided to offer my spare time to the W3C.

I’m giving this time under the condition that I’d only be looking at accessibility and video related sections. This is where my interest and expertise lie, and where I’m passionate to get things right. I want to make sure that we create accessibility features that will be implemented and that we polish existing video features. I want to make sure we don’t digress from implementations which continue to get updated and may follow the WHATWG spec or HTML.next or other needs.

I am not yet completely sure what the editorship will entail. Will we look at tests, too? Will we get involved in HTML.next? This far we’ve been preparing for our work by setting up adequate version control repositories, building a spec creation process, discussing how to bridge to the WHATWG commits, and analysing the long list of bugs to see how to cope with them. There’s plenty of actual text editing work ahead and the team is shaping up well! I look forward to the new experiences.

by silvia at August 15, 2012 01:25 PM

Jean-Marc Valin : Opus will be mandatory to implement for WebRTC


I just got back from the 84th IETF meeting in Vancouver. The most interesting part (as far as I was concerned anyway) was the rtcweb working group meeting. One of the topics was selecting the mandatory-to-implement (MTI) codecs. For audio, we proposed having both Opus and G.711 as MTI codecs. Much to our surprise, most of the following discussion was over whether G.711 was a good idea. In the end, there was strong consensus (the IETF believes in "rough consensus and running code") in favor of Opus+G.711, so that's what's going to be in rtcweb. Of course, implementers will probably ship with a bunch of other codecs for legacy compatibility purposes.

The video codec discussion was far less successful. Not only is there still no consensus over which codec to use (VP8 vs H.264), but there's also been no significant progress in getting to a consensus. Personally, I can't see how anyone could possibly consider H.264 as a viable option. Not only is it incompatible with open-source, but it's like signing a blank check, nobody knows how much MPEG-LA will decide to charge for it in the next years, especially for the encoder, which is currently not an issue for HTML5 (which only requires a decoder). The main argument I have heard against VP8 is "we don't know if there are patents". While this is true in some sense, the problem is much worse for H.264: not only are there tons of known patents for which we only know the licensing fees in the short term, but there's still at least as much risk when it comes to unlicensed patents (see the current Motorola v. Microsoft case).

August 05, 2012 07:24 PM

Jean-Marc Valin : Opus approved by the IETF


Three years after we first tried convincing the IETF to standardize an audio codec, Opus has finally been approved by the IETF. The only remaining step until it's officially an RFC is the RFC editor (fixing last minor issues, typos, ...). That should take in the order of 6-8 weeks (variable), at which point we'll have the RFC and the 1.0 release. Thanks to everyone who helped developing, testing, supporting or advocating Opus.

July 03, 2012 02:59 PM

Ben Schwartz : It’s Google


I’m normally reticent to talk about the future; most of my posts are in the past tense. But now the plane tickets are purchased, apartment booked, and my room is gradually emptying itself of my furniture and belongings. The point of no return is long past.

A few days after Independence Day, I’ll be flying to Mountain View for a week at the Googleplex, and from there to Seattle (or Kirkland), to start work as a software engineer on Google’s WebRTC team, within the larger Chromium development effort. The exact project I’ll be working on initially isn’t yet decided, but a few very exciting ideas have floated by since I was offered the position in March.

Last summer I told a friend that I had no idea where I would be in a year’s time, and when I listed places I might be — Boston, Madrid, San Francisco, Schenectady — Seattle wasn’t even on the list. It still wasn’t in March, when I was offered this position in the Cambridge (MA) office. It was an unfortunate coincidence that the team I’d planned to join was relocated to Seattle shortly after I’d accepted the offer.

My recruiters and managers were helpful and gracious in two key ways. First, they arranged for me to meet with ~5 different leaders in the Cambridge office whose teams I might be able to join instead of moving. Second, they flew me out to Seattle (I’d never been to the city, nor the state, nor any of the states or provinces that it borders) and arranged for meetings with various managers and developers in the Kirkland office, just so I could learn more about the office and the city. I spent the afternoon wandering the city and (with help from a friend of a friend), looking at as many districts as I could squeeze between lunch and sleep.

The visit made all the difference. It made the city real to me … and it seemed like a place that I could live. It also confirmed an impressive pattern: every single Google employee I met, at whichever office, seemed like someone I would be happy to work alongside.

When I returned there were yet more meetings scheduled, but I began to perceive that the move was essentially inevitable. The hiring committee had done their job well, and assigned me to the best fitting position. Everything else was second best at best.

It’s been an up and down experience, with the drudgery of packing and schlepping an unwelcome reminder of the feeling of loss that accompanies leaving history, family, and friends behind. I am learning in the process that, having never really moved, I have no idea how to move.

But there’s also sometimes a sense of joy in it. I am going to be an independent, free adult, in a way that cannot be achieved by even the happiest potted plant.

After signing the same lease on the same student apartment for the seventh time, I worried about getting stuck, in some metaphysical sense, about failure to launch from my too-comfortable cocoon. It was time for a grand adventure.

This is it.

by Ben at June 29, 2012 05:10 PM

David Schleef : GStreamer Streaming Server Library


Introducing the GStreamer Streaming Server Library, or GSS for short.

This post was originally intended to be a release announcement, but I started to wander off to work on other projects before the release was 100% complete.  So perhaps this is a pre-announcement.  Or it’s merely an informational piece with the bonus that the source code repository is in a pretty stable and bug-free state at the moment.  I tagged it with “gss-0.5.0″.

What it is

GSS is a standalone HTTP server implemented as a library.  Its special focus is to serve live video streams to thousands of clients, mainly for use inside an HTML5 video tag.  It’s based on GStreamer, libsoup, and json-glib, and uses Bootstrap and BrowserID in the user interface.

GSS comes with a streaming server application that is essentially a small wrapper around the library.  This application is referred to as the Entropy Wave Streaming Server (ew-stream-server); the code that is now GSS was originally split out of this application.  The app can be found in the tools/ directory in the source tree.

Features

  • Streaming formats: WebM, Ogg, MPEG-TS.  (FLV support is waiting for a flvparse element in GStreamer.)
  • Streams in different formats/sizes/bitrates are bundled into a single “program”.
  • Streaming to Flash via HTTP.
  • Authentication using BrowserID.
  • Automatic conversion from properly formed MPEG-TS to HTTP Live Streaming.
  • Automatic conversion to RTP/RTSP (Experiemental, works for Ogg/Theora/Vorbis only.)
  • Stream upload via HTTP PUT (3 different varieties), Icecast, raw TCP socket.
  • Stream pull from another HTTP streaming server.
  • Content protection via automatic one-time URLs.
  • (Experimental) Video-on-Demand stream types.
  • Per-stream, per-program, and server metrics.
  • HTTP configuration interface and REST API is used to control the server, allowing standalone operation and easy integration with other web servers.

What’s not there?

  • Other types of authentication, LDAP or other authorization.
  • RTMP support.  (Maybe some day, but there are several good open-source Flash servers out there already.)
  • Support for upload using HTTP PUT with no 100-Continue header.  Several HTTP libraries do this.
  • Decent VOD support, with rate-controlled streaming, burst start, and seeking.

The details

  • Main web page
  • Code repository
  • License: LGPL (same as GStreamer)

 

by David Schleef at June 14, 2012 06:21 PM

Silvia Pfeiffer : Video Conferencing in HTML5: WebRTC via Web Sockets


A bit over a week ago I gave a presentation at Web Directions Code 2012 in Melbourne. Maxine and John asked me to speak about something related to HTML5 video, so I went for the new shiny: WebRTC – real-time communication in the browser.

Presentation slides

I only had 20 min, so I had to make it tight. I wanted to show off video conferencing without special plugins in Google Chrome in just a few lines of code, as is the promise of WebRTC. To a large extent, I achieved this. But I made some interesting discoveries along the way. Demos are in the slide deck.

UPDATE: Opera 12 has been released with WebRTC support.

Housekeeping: if you want to replicate what I have done, you need to install a Google Chrome Web Browser 19+. Then make sure you go to chrome://flags and activate the MediaStream and PeerConnection experiment(s). Restart your browser and now you can experiment with this feature. Big warning up-front: it’s not production-ready, since there are still changes happening to the spec and there is no compatible implementation by another browser yet.

Here is a brief summary of the steps involved to set up video conferencing in your browser:

  1. Set up a video element each for the local and the remote video stream.
  2. Grab the local camera and stream it to the first video element.
  3. (*) Establish a connection to another person running the same Web page.
  4. Send the local camera stream on that peer connection.
  5. Accept the remote camera stream into the second video element.

Now, the most difficult part of all of this – believe it or not – is the signalling part that is required to build the peer connection (marked with (*)). Initially I wanted to run completely without a server and just enter the remote’s IP address to establish the connection. This is, however, not a functionality that the PeerConnection object provides [might this be something to add to the spec?].

So, you need a server known to both parties that can provide for the handshake to set up the connection. All the examples that I have seen, such as https://apprtc.appspot.com/, use a channel management server on Google’s appengine. I wanted it all working with HTML5 technology, so I decided to use a Web Socket server instead.

I implemented my Web Socket server using node.js (code of websocket server). The video conferencing demo is in the slide deck in an br – you can also use the stand-alone html page. Works like a treat.

While it is still using Google’s STUN server to get through NAT, the messaging for setting up the connection is running completely through the Web Socket server. The messages that get exchanged are plain SDP message packets with a session ID. There are OFFER, ANSWER, and OK packets exchanged for each streaming direction. You can see some of it in the below image:

spacer

I’m not running a public WebSocket server, so you won’t be able to see this part of the presentation working. But the local loopback video should work.

At the conference, it all went without a hitch (while the wireless played along). I believe you have to host the WebSocket server on the same machine as the Web page, otherwise it won’t work for security reasons.

A whole new world of opportunities lies out there when we get the ability to set up video conferencing on every Web page – scary and exciting at the same time!

by silvia at June 14, 2012 07:43 AM

Silvia Pfeiffer : HTML5 multi-track audio or video


In the last months, we’ve been working hard at the WHATWG and W3C to spec out new HTML markup and a JavaScript interface for dealing with audio or video content that has more than just one audio and video track.

This is particularly relevant when a Web page author wants to add a sign language track to a video or audio resource for deaf people, or an audio description track (i.e. a sound track in which a speaker explains the key things that can be seen on screen) for blind people. It is also relevant when a Web page author wants to publish a video with multiple audio tracks that are each a different language dub for the video and can be used for less common cases such as a director’s comment track, or making available different camera angles for an event.

Just to be clear: this is not a means to introduce video editing functionality into the Web browser. If you want to do edits, you’re better off with an application that will eventually render a new piece of content and includes fancy transitions etc. Similarly, this is not a means to introduce mixing functionality (as in what DJs do when they play with multiple audio recordings). You’re better off with an actual audio mixing or DJ application that will provide you all sorts of amazing effects and filters.

So, multi-track is squarely focused on synchronizing alternative or additional tracks to a single resource with a single timeline to which all tracks are slaved.

Two means of publishing such multi-track media content are possible:

  • In-band multi-track
  • Synchronized resources

1. In-band multi-track

In in-band multi-track, there is a single file that has all all the tracks inside it. For this single file, there is now an API in HTML5 that allows addressing and controlling these tracks.

Of the video file formats that Web browsers support, WebM is currently not defined to contain more than one audio or video track. However, since WebM is using the Matroska container format, which supports multi-track, it is possible to extend WebM for multi-track resources. I have seen multitrack Ogg, MP4 and Matroska files in the wild and most media players support their display.

The specification that has gone into HTML5 to support in-band multi-track looks as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // tracks
  readonly attribute MultipleTrackList audioTracks;
  readonly attribute ExclusiveTrackList videoTracks;
};

interface TrackList {
  readonly attribute unsigned long length;
  DOMString getID(in unsigned long index);
  DOMString getKind(in unsigned long index);
  DOMString getLabel(in unsigned long index);
  DOMString getLanguage(in unsigned long index);

           attribute Function onchange;
};

interface MultipleTrackList : TrackList {
  boolean isEnabled(in unsigned long index);
  void enable(in unsigned long index);
  void disable(in unsigned long index);
};

interface ExclusiveTrackList : TrackList {
  readonly attribute unsigned long selectedIndex;
  void select(in unsigned long index);
};

You will notice that every audio and video track gets an index to address them. You can enable() and disable() individual audio tracks and you can select() a single video track for display. This means that one or more audio tracks can be active at the same time (e.g. main audio and audio description), but only one video track will be active at a time (e.g. main video or sign language).

Through the getID(), getKind(), getLabel() and getLanguage() functions you can find out more about what actual content is available in the individual tracks so as to activate/deactivate them correctly and display the right information about them.

getKind() identifies the type of content that the track exposes such as “description” (for audio description), “sign” (for sign language), “main” (for the default displayed track), “translation” (for a dubbed audio track), and “alternative” (for an alternative to the default track).

getLabel() provides a human readable string that describes the content of the track aiming to be used in a menu.

getID() provides a short machine-readable string that can be used to construct a media fragment URI for the track. The use case for this will be discussed later.

getLanguage() provides a machine-readable language code to identify which language is spoken or signed in an audio or sign language video track.

Example 1:

The following uses a video file that has a main video track, a main audio track in English and French, and an audio description track in English and French. (It likely also has caption tracks, but we will ignore text tracks for now.) This code sample switches the French audio tracks on and all other audio tracks off.

<video id="v1" poster=“video.png” controls>
 <source src="/img/spacer.gif"> 

Example 2:

The following uses a audio file that has a main audio track in English, no main video track, but sign language video tracks in ASL (American Sign Language), BSL (British Sign Language), and ASF (Australian Sign Language). This code sample switches the Australian sign language track on and all other video tracks off.

<video id="a1" controls>
 <source src="/img/spacer.gif"> 

If you have more tracks in both examples that conflict with your intentions, you may need to further filter your activation / deactivation code using the getKind() function.

2. Synchronized resources

Sometimes the production process of media creates not a single resource with multiple contained tracks, but multiple resources that all share the same timeline. This is particularly useful for the Web, because it means the user can download only the required resources, typically saving a substantial amount of bandwidth.

For this situation, an attribute called @mediagroup can be added in markup to slave multiple media elements together. This is administrated in the JavaScript API through a MediaController object, which provides events and attributes for the combined multi-track object.

The new IDL interfaces for HTMLMediaElement are as follows:

interface HTMLMediaElement : HTMLElement {
  [...]
  // media controller
           attribute DOMString mediaGroup;
           attribute MediaController controller;
};

interface MediaController {
  readonly attribute TimeRanges buffered;
  readonly attribute TimeRanges seekable;
  readonly attribute double duration;
           attribute double currentTime;

  readonly attribute boolean paused;
  readonly attribute TimeRanges played;
  void play();
  void pause();

           attribute double defaultPlaybackRate;
           attribute double playbackRate;

           attribute double volume;
           attribute boolean muted;

           attribute Function onemptied;
           attribute Function onloadedmetadata;
           attribute Function onloadeddata;
           attribute Function oncanplay;
           attribute Function oncanplaythrough;
           attribute Function onplaying;
           attribute Function onwaiting;
           attribute Function ondurationchange;
           attribute Function ontimeupdate;
           attribute Function onplay;
           attribute Function onpause;
           attribute Function onratechange;
           attribute Function onvolumechange;
};

You will notice that the MediaController replicates some of the states and events of the slave media elements. In general the approach is that the attributes represent the summary state from all the elements and the writable attributes when set are handed through to all the slave elements.

Importantly, if the individual media elements have @controls activated, then the displayed controls interact with the MediaController thus allowing synchronized playback and interaction with the combined multi-track object.

Example 3:

The following uses a video file that has a main video track, a main audio track in English. There is another video file with the ASL sign language for the video, and an audio file with the audio description in English. This code sample creates controls on the first file, which then also control the audio description and the sign language video, neither of which have controls. Since the audio description doesn’t have controls, it doesn’t get visually displayed. The sign language video will just sit next to the main video without controls.

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src="/img/spacer.gif"> 

Example 4:

We now accompany a main video with three sign language video tracks in ASL, BSL and ASF. We could just do this in JavaScript and replace the currentSrc of a second video element with the links to BSL and ASF as required, but then we need to run our own media controls to list the available tracks. So, instead, we create a video element for each one of the tracks and use CSS to remove the inactive ones from the page layout. The code sample activates the ASF track and deactivates the other sign language tracks.

<style>
  video.inactive { display: none; }
</style>

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src="/img/spacer.gif"> 

Example 5:

In this final example we look at what to do when we have a in-band multi-track resource with multiple video tracks that should all be displayed on screen. This is not a simple problem to solve because a video element is only allowed to display a single video track at a time. Therefore for this problem you need to use both approaches: in-band and synchronized resources.

We take a in-band multitrack resource with a main video and audio track and three sign language tracks in ASL, BSL and ASF. The second resource will be made up from the URI of the first resource with a media fragment address of the sign language tracks. (If required, these can be discovered using the getID() function on the first resource.) The markup will look as follows:

<video id="v1" poster=“video.png” controls mediagroup="a11y_vid">
 <source src="/img/spacer.gif"> 

Note that with multiple video elements you can always style them in the way that you want them displayed on screen. E.g. if you want a picture-in-picture display, you scale the second video down and absolutely position it on top of the first one in the appropriate location. You can even grab the second video into a canvas, chroma-key your sign language speaker on a green or blue screen and remove that background through some canvas processing before popping it on top of the video.

The world is all yours!

HOWEVER: There is one big caveat on all these specs – while they have all found entry into the HTML5 specification, it would be expecting a bit much to have browser support already. spacer

by silvia at June 02, 2012 09:36 AM

David Schleef : GStreamer backend for video in Firefox


Good news to hear that the GStreamer backend for video playback in Firefox has landed, due to a flurry of work by Alessandro Decina in the last few months.  Of course, this isn’t part of the standard Firefox build (but maybe some day?), but it’s very useful for putting Firefox on mobile and embedded platforms, since GStreamer has a well-established ecosystem of vendor-provided plugins for hardware decoding.

by David Schleef at April 29, 2012 10:19 PM

David Schleef : OggStreamer: audio capture and streaming device


Recently learned about a cool new

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.