Media Timed Events

W3C Interest Group Note 05 August 2019

This version:
https://www.w3.org/TR/2019/NOTE-media-timed-events-20190805/ https://www.w3.org/TR/2020/NOTE-media-timed-events-20200117/
Latest published version:
https://www.w3.org/TR/media-timed-events/
Latest editor's draft:
https://w3c.github.io/me-media-timed-events/
Editor:
( British Broadcasting Corporation )
Former editor:
Giridhar Mandyam (Qualcomm) (until December 2018)
Participate:
GitHub w3c/me-media-timed-events
File a bug
Commit history
Pull requests

Abstract

This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cue events cues to add support for dynamic content replacement cues and generic metadata events data cues that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.

Status of This Document

This is a preview

Do not attempt to implement this version of the specification. Do not reference this version as authoritative in any way. Instead, see https://w3c.github.io/me-media-timed-events/ for the Editor's draft.

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

The Media & Entertainment Interest Group may update these use cases and requirements over time. Development of new web APIs based on the requirements described here, for example, DataCue , will proceed in the Web Platform Incubator Community Group (WICG) , with the goal of eventual standardization within a W3C Working Group. Contributors to this document are encouraged to participate in the WICG. Where the requirements described here affect the HTML specification, contributors will follow up with WHATWG . The Interest Group will continue to track these developments and provide input and review feedback on how any proposed API meets these requirements.

This document was published by the Media & Entertainment Interest Group as an Interest Group Note.

GitHub Issues are preferred for discussion of this specification.

Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

The disclosure obligations of the Participants of this group are described in the charter .

This document is governed by the 1 March 2019 W3C Process Document .

1. Introduction

There is a need in the media industry for an API to support arbitrary data associated with points in time or periods of time in a continuous media (audio or video) presentation. This data may include:

For the purpose of this document, we refer to these collectively as media timed events . These events can be used to carry information intended to be synchronized with the media stream, used to support use cases such as dynamic content replacement, ad insertion, or presentation of supplemental content alongside the audio or video, or more generally, making changes to a web page, or executing application code triggered from JavaScript events, at specific points on the media timeline of an audio or video media stream.

Media timed events may be carried either in-band , meaning that they are delivered within the audio or video media container or multiplexed with the media stream, or out-of-band , meaning that they are delivered externally to the media container or media stream.

This document describes use cases and requirements that go beyond the existing support for timed text, using TextTrack and related APIs.

2. Terminology

The following terms are used in this document:

The following terms are defined in [ HTML ]:

The following term is defined in [ HR-TIME ]:

The following term is defined in [ WEBVTT ]:

3. Use cases

Media timed events carry metadata information that is related to points in time, time or regions periods of time on the media timeline , which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions, and display of track titles or images alongside an audio stream. captions.

The following sections describe a few use cases in more detail.

3.1 Dynamic content insertion

A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events can be used to describe the points on the media timeline , known as splice points, where switching playback to inserted content is possible.

The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [ SCTE35 ] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [ SCTE35 ], sections 12.1 and 12.2.

3.2 Audio stream with titles and images

A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.

HLS timed metadata [ HLS-TIMED-METADATA ] uses in-band ID3 metadata to carry the artist and title information, and image content. RadioVIS in DVB ([ DVB-DASH ], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline .

3.3 Control messages for media streaming clients

A media streaming server uses media timed events MPEG-DASH defines a number of in-band to send delivered control messages that are used to media notify a client library, appication or library such as dash.js . Typically segmented streaming protocols such as HLS and These messages include: MPEG-DASH make use of defines a manifest document that informs the client of the available encodings number of a control messages for media stream, e.g., the Media Presentation Description (MPD) document in [ MPEGDASH streaming clients (e.g., libraries such as dash.js ). These messages are carried in-band ]. Should any of the content in the media container files. Use cases include:

Reference: M&E IG call 1 Feb 2018: Minutes , [ DASH-EVENTING ].

3.4 Subtitle and caption rendering synchronization

A subtitle or caption author wants ensure that subtitle changes are aligned as closely as possible to shot changes in the video. The BBC Subtitle Guidelines [ BBC-SUBTITLE ] describes authoring best practices. In particular, in section 6.1 authors are advised "it is likely to be less tiring for the viewer if shot changes and subtitle changes occur at the same time. Many subtitles therefore start on the first frame of the shot and end on the last frame."

3.5 Synchronized map animations

A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.

[ WEBVMT ] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. Concrete examples are provided by the tech demos at the WebVMT website.

3.6 Media analysis visualization

A video image analysis system processes a media stream to detect and recognize objects shown in the video. This system generates metadata describing the objects, including timestamps that describe the when the objects are visible, together with position information (e.g., bounding boxes). A web application then uses this timed metadata to overlay labels and annotations on the video using HTML and CSS.

3.7 Presentation of auxiliary content in live media

During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.

5. Gap analysis

This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.

5.1 MPEG-DASH and ISO BMFF emsg events

The DataCue API has been previously discussed as a means to deliver in-band media timed event data to web applications, but this is not implemented in all of the main browser engines. It is included in the 18 October 2018 HTML 5.3 draft [ HTML53-20181018 ], but is not included in [ HTML ]. See discussion here and notes on implementation status here .

WebKit supports a DataCue interface that extends HTML5 DataCue with two attributes to support non-text metadata, type and value .

interface DataCue : TextTrackCue {
  attribute ArrayBuffer data; // Always empty
  // Proposed extensions.
  attribute any value;
  readonly attribute DOMString type;
};

type is a string identifying the type of metadata:

WebKit DataCue metadata types
"com.apple.quicktime.udta" QuickTime User Data
"com.apple.quicktime.mdta" QuickTime Metadata
"com.apple.itunes" iTunes metadata
"org.mp4ra" MPEG-4 metadata
"org.id3" ID3 metadata

and value is an object with the metadata item key, data, and optionally a locale:

value = {
  key: String
  data: String | Number | Array | ArrayBuffer | Object
  locale: String
}

Neither [ MSE-BYTE-STREAM-FORMAT-ISOBMFF ] nor [ INBANDTRACKS ] describe handling of emsg boxes.

On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.

Avoiding parsing in JavaScript is also important for low latency video streaming applications, where minimizing the time taken to pass media content through to the media element's playback buffer is essential.

[ HBBTV ] section 9.3.2 describes a mapping between the emsg fields described above and the TextTrack and DataCue APIs. A TextTrack instance is created for each event stream signalled in the MPD document (as identified by the schemeIdUri and value ), and the inBandMetadataTrackDispatchType TextTrack attribute contains the scheme_id_uri and value values. Because HbbTV devices include a native DASH client, parsing of the MPD document and creation of the TextTrack s is done by the user agent, rather than by application JavaScript code.

5.2 Synchronized rendering of web resources

In browsers, non media web rendering is handled through repaint operations at a rate that generally matches the display refresh rate (e.g., 60 times per second), following the user's wall clock. A web application can schedule actions and render web content at specific points on the user's wall clock, notably through Performance.now() , setTimeout() , setInterval() , and requestAnimationFrame() .

In most cases, media rendering follows a different path, be it because it gets handled by a dedicated background process or by dedicated hardware circuitry. As a result, progress along the media timeline may follow a clock different from the user's wall clock. [ HTML ] recommends that the media clock approximate the user's wall clock but does not require it to match the user's wall clock.

To synchronize rendering of web content to a video with frame accuracy, a web application needs:

The following sub-sections discusses mechanisms currently available to web applications to track progress on the media timeline and render content at frame boundaries.

5.2.1 Using cues to track progress on the media timeline

Cues (e.g., TextTrackCue , or and VTTCue ) are units of time-sensitive data on a media timeline [ HTML ]. The time marches on steps in [ HTML ] control the firing of cue DOM events during media playback. Time marches on requires a timeupdate event is specified to be fired at run "when the current playback position of a media element between 15 and 250 milliseconds since the last such event, and changes" but how often this requirement therefore specifies the rate at which time marches on should happen is executed during playback. unspecified. In practice it has been found that the timing varies between browser implementations. implementations, in some cases with a delay up to 250 milliseconds (which corresponds to the lowest rate at which timeupdate events are expected to be fired).

There are two methods a web application can use to handle cues:

An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above. A similar issue has been filed against the HTML specification.

5.2.2 Using timeupdate events from the media element

Another approach to synchronizing rendering of web content to media playback is to use the timeupdate DOM event, and for the web application to manage the media timed events event data to be triggered, rather than use the text track cue APIs in [ HTML ]. This approach has the same synchronization limitations as described above due to the 250 millisecond update rate specified in time marches on , and so is explicitly discouraged in [ HTML ]. In addition, the timing variability of timeupdate events between browser engines makes them unreliable for the purpose of synchronized rendering of web content.

5.2.3 Polling the current position on the media timeline

Synchronization accuracy can be improved by polling the media element's currentTime property from a setInterval() callback, or by using requestAnimationFrame() for greater accuracy. This technique can be useful in where content should be animated smoothly in synchronicity with the media, for example, rendering a playhead position marker in an audio waveform visualization, or displaying web content at specific points on the media timeline . However, the use of setInterval() or requestAnimationFrame() for media synchronized rendering is CPU intensive.

5.2.4 Detecting when the next media frame will be rendered

[ HTML ] does not expose any precise mechanism to assess the time, from a user's wall clock perspective, at which a particular media frame is going to be rendered. A web application may only infer this information by looking at the media element 's currentTime property to infer the frame being rendered and the time at which the user will see the next frame. This has several limitations:

  • currentTime is represented as a double value, which does not allow to identify individual frames due to rounding errors. This is a known issue .
  • currentTime is updated at a user-agent defined rate (typically the rate at which time marches on runs), and is kept stable while scripts are running. When a web application reads currentTime , it cannot tell when this property was last updated, and thus cannot reliably assess whether this property still represents the frame currently being rendered.

6. Recommendations

This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.

6.1 Subscribing to receive media timed event streams cues

The API should allow web applications to subscribe to receive specific types of media timed event streams by event type. cue. For example, to support MPEG-DASH emsg and MPD events, the API should allow subscription cue type is identified by a combination of the id scheme_id_uri and (optional) value . This The purpose of this is to make receiving events cues of each type opt-in from the application application's point of view. The user agent should deliver only those events cues to a web application for which the application has subscribed. The API should also allow web applications to unsubscribe from specific event streams by event type. cue types.

6.2 Out-of-band events

To be able to handle out of band events, out-of-band media timed event cues, including MPEG-DASH MPD events, the API should allow web applications to create events to be added and add timed data cues to the media timeline , to be triggered by the user agent. The API should allow the web application to provide all necessary parameters to define the event, cue, including start and end times, event type, cue type identifier, and data payload. The payload should be any data type (e.g., the set of types supported by the WebKit DataCue ). For MPEG-DASH MPD events, the event type is defined by the id and (optional) value fields.

6.3 Event triggering

For those events that the application has subscribed to receive, the API should:

The API should provide guarantees that no events media timed event cues can be missed during linear playback of the media.

6.4 In-band media timed event processing

We recommend updating [ INBANDTRACKS ] to describe handling of in-band media timed events supported on the web platform, possibly following a registry approach with one specification per media format that describes the event details for of how media timed events are carried in that format.

6.5 MPEG-DASH events

We recommend that browser engines support MPEG-DASH emsg in-band events and MPD out-of-band events, as part of their support for the MPEG Common Media Application Format (CMAF) [ MPEGCMAF ].

6.6 Synchronization

In order to achieve greater synchronization accuracy between media playback and web content rendered by an application, the time marches on steps in [ HTML ] should be modified to allow delivery of media timed event start time cue onenter and end time notifications onexit DOM events within 20 milliseconds of their positions on the media timeline .

Additionally, to allow such synchronization to happen at frame boundaries, we recommend introducing a mechanism that would allow a web application to accurately predict, using the user's wall clock, when the next frame will be rendered (e.g., as done in the Web Audio API ).

7. Acknowledgments

Thanks to François Daoust, Charles Lo, Nigel Megitt, Jon Piesing, Rob Smith, and Mark Vickers for their contributions and feedback on this document.

A. References

A.1 Informative references

[3GPP-INTERACTIVITY]
Interactivity Support for 3GPP-Based Streaming and Download Services (Release 15) . 3GPP. June 2018. URL: https://www.3gpp.org/ftp/Specs/archive/26_series/26.953/26953-f00.zip
[BBC-SUBTITLE]
Subtitle Guidelines, Version 1.1.7 . BBC. May 2018. URL: https://bbc.github.io/subtitle-guidelines/
[DASH-EVENTING]
DASH Eventing and HTML5 . Giridhar Mandyam. February 2018. URL: https://www.w3.org/2011/webtv/wiki/images/a/a5/DASH_Eventing_and_HTML5.pdf
[DASHIF-EVENTS]
DASH Player’s Application Events and Timed Metadata Processing Models and APIs . DASH Industry Forum. 3 July 2019. URL: https://dashif-documents.azurewebsites.net/Events/master/event.html
[DASHIFIOP]
Guidelines for Implementation: DASH-IF Interoperability Points . DASH Industry Forum. 9 April 2018. Version 4.2. URL: https://dash-industry-forum.github.io/docs/DASH-IF-IOP-v4.2-clean.pdf
[DVB-DASH]
ETSI TS 103 285 V1.2.1 (2018-03): Digital Video Broadcasting (DVB); MPEG-DASH Profile for Transport of ISO BMFF Based DVB Services over IP Based Networks . ETSI. March 2018. Published. URL: http://www.etsi.org/deliver/etsi_ts/103200_103299/103285/01.02.01_60/ts_103285v010201p.pdf
[EBU-TT-D]
EBU TECH 3380: "EBU-TT-D Subtitling Distribution Format" . European Broadcasting Union. URL: https://tech.ebu.ch/docs/tech/tech3380.pdf
[HBBTV]
HbbTV 2.0.2 Specification . HbbTV Association. 16 February 2018. URL: https://www.hbbtv.org/wp-content/uploads/2018/02/HbbTV_v202_specification_2018_02_16.pdf
[HLS-TIMED-METADATA]
Timed Metadata for HTTP Live Streaming . Apple, Inc. 28 April 2011. URL: https://developer.apple.com/library/archive/documentation/AudioVideo/Conceptual/HTTP_Live_Streaming_Metadata_Spec/Introduction/Introduction.html
[HR-TIME]
High Resolution Time Level 2 . Jatinder Mann. Ilya Grigorik. W3C. 17 December 2012. 21 November 2019. W3C Recommendation. URL: https://www.w3.org/TR/hr-time/ https://www.w3.org/TR/hr-time-2/
[HTML]
HTML Standard . Anne van Kesteren; Domenic Denicola; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[html51-20151008]
HTML 5.1 . Simon Pieters; Anne van Kesteren; Philip Jägenstedt; Domenic Denicola; Ian Hickson; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Theresa O'Connor; Robin Berjon. W3C. 8 October 2015. W3C Working Draft. URL: https://www.w3.org/TR/2015/WD-html51-20151008/
[HTML53-20181018]
HTML 5.3 . Patricia Aas; Shwetank Dixit; Terence Eden; Bruce Lawson; Sangwhan Moon; Xiaoqian Wu; Scott O'Hara. W3C. 18 October 2018. W3C Working Draft. URL: https://www.w3.org/TR/2018/WD-html53-20181018/
[ID3v2]
ID3 tag version 2.4.0 - Main Structure . id3.org. URL: http://id3.org/id3v2.4.0-structure
[INBANDTRACKS]
Sourcing In-band Media Resource Tracks from Media Containers into HTML . Silvia Pfeiffer; Bob Lund. W3C. 26 April 2015. Unofficial Draft. URL: https://dev.w3.org/html5/html-sourcing-inband-tracks/
[ISOBMFF]
Information technology — Coding of audio-visual objects — Part 12: ISO Base Media File Format . ISO/IEC. December 2015. International Standard. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip
[MPEGCMAF]
Information technology -- Multimedia application format (MPEG-A) -- Part 19: Common media application format (CMAF) for segmented media . ISO/IEC. January 2018. Published. Under development. URL: https://www.iso.org/standard/71975.html https://www.iso.org/standard/79106.html
[MPEGDASH]
Information technology -- Dynamic adaptive streaming over HTTP (DASH) -- Part 1: Media presentation description and segment formats . ISO/IEC. May 2014. December 2019. Published. URL: https://www.iso.org/standard/65274.html https://www.iso.org/standard/79329.html
[MSE-BYTE-STREAM-FORMAT-ISOBMFF]
ISO BMFF Byte Stream Format . Matthew Wolenetz; Jerry Smith; Mark Watson; Aaron Colwell; Adrian Bateman. W3C. 4 October 2016. W3C Note. URL: https://www.w3.org/TR/mse-byte-stream-format-isobmff/
[RFC8216]
HTTP Live Streaming . R. Pantos, Ed.; W. May. IETF. August 2017. Informational. URL: https://tools.ietf.org/html/rfc8216
[SCTE214-1]
ANSI/SCTE 214-1 2016: MPEG DASH for IP-Based Cable Services Part 1: MPD Constraints and Extensions . SCTE. 2016. URL: http://scte.org/SCTEDocs/Standards/ANSI_SCTE%20214-1%202016.pdf
[SCTE214-2]
ANSI/SCTE 214-2 2016: MPEG DASH for IP-Based Cable Services Part 2: DASH/TS Profile . SCTE. 2016. URL: http://scte.org/SCTEDocs/Standards/ANSI_SCTE%20214-2%202016.pdf
[SCTE214-3]
ANSI/SCTE 214-3 2015: MPEG DASH for IP-Based Cable Services Part 3: DASH/FF Profile . SCTE. 2015. URL: http://scte.org/SCTEDocs/Standards/ANSI_SCTE%20214-3%202015.pdf
[SCTE35]
ANSI/SCTE 35 2019: 2019r1: Digital Program Insertion Cueing Message for Cable; . SCTE. 2019. URL: https://www.scte.org/SCTEDocs/Standards/ANSI_SCTE%2035%202019r1.pdf
[WEB-ISOBMFF]
ISO/IEC JTC1/SC29/WG11 N16944 Working Draft on Carriage of Web Resources in ISOBMFF . Thomas Stockhammer; Cyril Concolato. MPEG. July 2017. URL: https://mpeg.chiariglione.org/standards/mpeg-4/timed-text-and-other-visual-overlays-iso-base-media-file-format/wd-carriage-web
[WEBVMT]
WebVMT: The Web Video Map Tracks Format . Rob Smith. W3C. 29 January 2019. W3C Editor's Draft. URL: https://w3c.github.io/sdw/proposals/geotagging/webvmt/
[WEBVTT]
WebVTT: The Web Video Text Tracks Format . Silvia Pfeiffer. W3C. 4 April 2019. W3C Candidate Recommendation. URL: https://www.w3.org/TR/webvtt1/