Copyright © 2019 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and permissive document license rules apply.
This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cue events to add support for dynamic content replacement cues and generic metadata events that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document was published by the Media & Entertainment Interest Group as an Interest Group Note.
GitHub Issues are preferred for discussion of this specification.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The disclosure obligations of the Participants of this group are described in the charter.
This document is governed by the 1 February 2018 W3C Process Document.
        There is a need in the media industry for an API to support metadata
        events synchronized to audio or video media, specifically for both
        out-of-band event streams and in-band discrete events
        (for example, MPD and emsg events in MPEG-DASH).
        These media timed events can be used to support use cases
        such as dynamic content replacement, ad insertion, or presentation of
        supplemental content alongside the audio or video, or more generally,
        making changes to a web page, or executing application code triggered
        from JavaScript events, at specific points on the media timeline
        of an audio or video media stream.
      
The following terms are used in this document:
The following terms are defined in [HTML]:
Media timed events carry metadata that is related to points in time, or regions of time on the media timeline, which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions, and display of track titles or images alongside an audio stream.
The following sections describe a few use cases in more detail.
A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events used to describe the points on the media timeline, known as splice points, where switching playback to inserted content is possible.
The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [SCTE35] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [SCTE35], sections 12.1 and 12.2.
A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.
HLS timed metadata [HLS-TIMED-METADATA] uses in-band ID3 metadata to carry the image content. RadioVIS in DVB ([DVB-DASH], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline.
A media streaming server uses media timed events to send control messages to media client library, such as dash.js. Typically segmented streaming protocols such as HLS and MPEG-DASH make use of a manifest document that informs the client of the available encodings of a media stream, e.g., the Media Presentation Description (MPD) document in [MPEGDASH].
          Should any of the content in the manifest document need to change, the
          client should refresh it by requesting an updated copy from the
          server. Section 5.10.4 of [MPEGDASH] describes an MPEG-DASH specific
          event that is used to notify a client application. An in-band
          emsg event is used as an alternative to setting a cache
          duration in the response to the HTTP request for the manifest, so the
          client can refresh the MPD when it actually changes, as opposed to
          waiting for a cache duration expiry period to elapse. This also has
          the benefit of reducing the load on HTTP servers caused by frequent
          server requests.
        
Reference: M&E IG call 1 Feb 2018: Minutes, [DASH-EVENTING].
See also this issue against the [WEB-MEDIA-GUIDELINES]. TODO: Add detail here.
A subtitle or caption author wants ensure that subtitle changes are aligned as closely as possible to shot changes in the video. The BBC Subtitle Guidelines [BBC-SUBTITLE] describes authoring best practices. In particular, in section 6.1 authors are advised "it is likely to be less tiring for the viewer if shot changes and subtitle changes occur at the same time. Many subtitles therefore start on the first frame of the shot and end on the last frame."
A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.
[WEBVMT] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. Concrete examples are provided by the tech demos at the WebVMT website.
Reference: M&E IG TF call 17 Sept 2018: Minutes.
A video image analysis system processes a media stream to detect and recognize objects shown in the video. This system generates metadata describing the objects, including timestamps that describe the when the objects are visible, together with position information (e.g., bounding boxes). A web application then uses this timed metadata to overlay labels and annotations on the video using HTML and CSS.
During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.
This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.
          The DataCue API has been previously discussed as a means
          to deliver in-band event data to web applications, but this is
          not implemented in all of the main browser engines. It is
          included
          in the 18 October 2018 HTML 5.3 draft [HTML53-20181018], but is
          not included
          in [HTML]. See discussion here
          and notes on implementation status here.
        
          WebKit supports
          a DataCue interface that extends HTML5 DataCue
          with two attributes to support non-text metadata, type and
          value.
        
interface DataCue : TextTrackCue {
  attribute ArrayBuffer data; // Always empty
  // Proposed extensions.
  attribute any value;
  readonly attribute DOMString type;
};
        
          type is a string identifying the type of metadata:
        
| WebKit DataCuemetadata types | |
|---|---|
| "com.apple.quicktime.udta" | QuickTime User Data | 
| "com.apple.quicktime.mdta" | QuickTime Metadata | 
| "com.apple.itunes" | iTunes metadata | 
| "org.mp4ra" | MPEG-4 metadata | 
| "org.id3" | ID3 metadata | 
          and value is an object with the metadata item key, data, and optionally a locale:
        
value = {
  key: String
  data: String | Number | Array | ArrayBuffer | Object
  locale: String
}
        
          Neither [MSE-BYTE-STREAM-FORMAT-ISOBMFF] nor [INBANDTRACKS] describe
          handling of emsg boxes.
        
On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.
          [HBBTV] section 9.3.2 describes a mapping between the emsg
          fields described above
          and the TextTrack
          and DataCue
          APIs. A TextTrack instance is created for each event
          stream signalled in the MPD document (as identified by the
          schemeIdUri and value), and the
          inBandMetadataTrackDispatchType
          TextTrack attribute contains the scheme_id_uri
          and value values. Because HbbTV devices include a native
          DASH client, parsing of the MPD document and creation of the
          TextTracks is done by the user agent, rather than by
          application JavaScript code.
        
Subtitles for video are typically authored against video at a nominal frame rate, e.g., 25 frames per second, which corresponds to 40 milliseconds per frame. The actual video frame rate may be adjusted dynamically according to the video encoding, but the subtitle timing must remain the same ([EBU-TT-D], Annex E).
          This places a requirement on user agents for timely delivery of
          TextTrackCue and VTTCue events, so that
          application code can respond and render the cues. For subtitle
          rendering to be possible with frame accuracy, we recommend that cue
          events are fired within 20 milliseconds of their position on the
          media timeline.
        
          The time
          marches on steps in [HTML] control the firing of cue
          events during media playback. Time marches on requires a
          timeupdate event to be fired at the
          HTMLMediaElement between 15 and 250 milliseconds since
          the last such event, and this requirement therefore specifies the rate
          at which time marches on is executed during playback. In
          practice it has
          been found that the timing varies between browser implementations.
        
There are two methods a web application can use to handle text track cues:
oncuechange handler function to the
              TextTrack and inspect the track's
              activeCues list. Because activeCues
              contains the list of cues that are active at the time that
              time marches on is run, it is possible for cues to be
              missed by a web application using this method, where cues appear
              on the media timeline between successive executions of
              time marches on during media playback. This may occur
              if the cues have short duration, or by a long-running event
              handler function.
            onenter and onexit handler functions
              to each cue. The time marches on steps guarantee that
              enter and exit events will be fired for
              all cues, including those that appear on the media timeline
              between successive executions of time marches on
              during media playback. This method is only possible for cues
              created by the web application, i.e., VTTCue objects,
              and not cue objects created by the user agent.
            An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above.
Describe gaps relating to synchronized rendering of web resources. Can we define a generic web API for scheduling page changes synchronized to playing media? Related: [css-animations-1], [web-animations-1], [css-transitions-1]. See also: https://github.com/bbc/VideoContext. Should this be in scope for the TF?
          There is no API for surfacing web content embedded in ISO BMFF
          containers into the browser (e.g., the HTMLCue proposal
          discussed at TPAC 2015).
        
Add more detail on what's required. Some questions / considerations:
This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.
          The API should allow web applications to subscribe to receive specific
          event types. For example, to support DASH emsg and MPD events,
          the API should allow subscription by id and (optional) value.
          This is to make receiving events opt-in from the application point of view.
          The user agent should deliver only those events to a web application
          for which the application has subscribed. The API should also allow web
          applications to unsubscribe from specific event streams by event type.
        
          To be able to handle out of band events, the API must allow web
          applications to create events to be added to the media timeline,
          to be triggered by the user agent. The API should allow the
          web application to provide all necessary parameters to define
          the event, including start and end times, event type, and data
          payload. The payload should be any data type (e.g., the set of
          types supported by the WebKit DataCue). For DASH MPD
          events, the event type is defined by the id and
          (optional) value fields.
        
For those events that the application has subscribed to receive, the API should:
The API must provide guarantees that no events can be missed during linear playback of the media.
We recommend updating [INBANDTRACKS] to describe handling of in-band media timed events supported on the web platform, following a registry approach with one specification per media format that describes the event details for that format.
          We recommend that browser engines support MPEG-DASH emsg
          in-band events and MPD out-of-band events, as part of
          their support for the MPEG Common Media Application Format (CMAF)
          [MPEGCMAF].
        
In order to acheive greater synchronization accuracy between media playback and web content rendered by an application, the time marches on steps in [HTML] should be modified to allow delivery of media timed event start time and end time notifications within 20 milliseconds of their positions on the media timeline.
Thanks to Charles Lo, Nigel Megitt, Jon Piesing, and Rob Smith for their contributions to this document.