Anne van Kesteren

Captioning Markup

Now that HTML5 media elements have somewhat reasonable implementations in browsers, time has come to take the next step. You may not be aware of this but lots of features in HTML5 have been incrementally developed, within HTML5 itself. E.g. the 2D graphics API for the canvas element has gained several new features over time, as older features became more widely implemented. I.e. transformations, pixel manipulation, drawing video elements, and focus management have all been added over time. This was — and still is — done this way to ensure that implementations mature in a similar way. And that more attention is paid to the individual features which hopefully leads to better and more consistent implementations.

To this end, two proposals have been put forward for media element track API extensions and associating synchronized text with media elements. This latter proposal suggests two formats for captioning, SRT and TTML. Both have their issues. SRT is not really documented at all and while TTML has a specification, it is somewhat at odds with the rest of the web platform. (I think I will decapitalize web from now on unless it is part of a name. And why is decapitalize not a word?) Philip Jägenstedt, a colleague of mine implementing video in Opera, wrote on the WHATWG discussion list:

For the record, I am also not enthusiastic about TTML, specifically the styling mechanism which even makes creative use of XML namespaces. An example for those that haven't seen it before:

<region xml:id="r1">
  <style tts:extent="306px 114px"/>
  <style tts:backgroundColor="red"/>
  <style tts:color="white"/>
  <style tts:displayAlign="after"/>
  <style tts:padding="3px 40px"/>
<p region="r1" tts:backgroundColor="purple" tts:textAlign="center">
  Twinkle, twinkle, little bat!<br/>
  How <span tts:backgroundColor="green">I wonder</span> where you're at!

While I don't have any suggestions about what to use instead, I'd much prefer something which just uses CSS with the same syntax we're all used to.

Jonas Sicking, developer for Mozilla, wrote:

The mere fact that the proposal suggests "just start by implementing a subset" indicates to me that things have gotten too complicated. I think unless we can say "here is what you should implement, go forth and do it", we've constructed something too complex.

And as Maciej points out, it does not appear that TTML uses CSS for styling, which is the technology used for styling on the web today. This both means extra work for authors familiar with web technologies today, and extra work for implementations which currently support CSS.

Maciej Stachowiak, lead of the WebKit team at Apple, wrote:

I'm especially concerned that TTML presentation is formally defined in terms of XSL-FO, itself an extremely complicated spec that is in many ways at odds with the CSS formatting model in browser engines. I am not at all enthusiastic about implementing a second layout engine just for captions.

While some have claimed that it's probably possible to translate TTML presentation requirements to CSS, I don't really buy this without seeing a normative specification for how to do so.

On the HTML WG mailing list Henri Sivonen suggests creating an HTML version of SRT. It at least would be much more Web-like and not create the need for a new layout engine, which I hope most people would agree is somewhat overkill for a captioning language.


  1. Are you familiar with advanced substation alpha format?

    I'm not sure if it is entirely appropriate for the web, but it is:

    1. Pretty sophisticated in what in can achieve
    2. Already has plenty of open source tools that can work with it, so it might be more practical to implement than some standard that is cooked up out of thin air.

    People use this with anime to create pretty sophisticated subtitling effects.

    Posted by Brendan Miller at

  2. Also... be aware that the substation format was invented because srts are just not good enough in terms of styling.

    So, if you do something based on srt's, a lot of people just won't use them, but instead will burn subtitles into the video, which makes it impossible to localise for more than one language. So non-english speakers get screwed.

    Posted by Brendan Miller at

  3. style tts:displayAlign="after" ...wtfits?

    I'd rather quickly (possibly) and easily (possibly) slap together some javascript to "associate synchronized text with media elements" than bother with that thing.

    SRT +1. In some cases it may mean you can use existing subtitle files from

    Posted by thinsoldier at