Copyright © 2020 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and permissive document license rules apply.
This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cues to add support for dynamic content replacement cues and generic data cues that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
The
Media
&
Entertainment
Interest
Group
may
update
these
use
cases
and
requirements
over
time.
Development
of
new
web
APIs
based
on
the
requirements
described
here,
for
example,
DataCue
,
will
proceed
in
the
Web
Platform
Incubator
Community
Group
(WICG)
,
with
the
goal
of
eventual
standardization
within
a
W3C
Working
Group.
Contributors
to
this
document
are
encouraged
to
participate
in
the
WICG.
Where
the
requirements
described
here
affect
the
HTML
specification,
contributors
will
follow
up
with
WHATWG
.
The
Interest
Group
will
continue
to
track
these
developments
and
provide
input
and
review
feedback
on
how
any
proposed
API
meets
these
requirements.
This document was published by the Media & Entertainment Interest Group as an Interest Group Note.
GitHub Issues are preferred for discussion of this specification.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The disclosure obligations of the Participants of this group are described in the charter .
This document is governed by the 1 March 2019 W3C Process Document .
There is a need in the media industry for an API to support arbitrary data associated with points in time or periods of time in a continuous media (audio or video) presentation. This data may include:
For the purpose of this document, we refer to these collectively as media timed events . These events can be used to carry information intended to be synchronized with the media stream, used to support use cases such as dynamic content replacement, ad insertion, presentation of supplemental content alongside the audio or video, or more generally, making changes to a web page, or executing application code triggered at specific points on the media timeline of an audio or video media stream.
Media timed events may be carried either in-band , meaning that they are delivered within the audio or video media container or multiplexed with the media stream, or out-of-band , meaning that they are delivered externally to the media container or media stream.
This
document
describes
use
cases
and
requirements
that
go
beyond
the
existing
support
for
timed
text,
using
TextTrack
and
related
APIs.
The following terms are used in this document:
The following terms are defined in [ HTML ]:
activeCues
currentTime
enter
exit
oncuechange
onenter
onexit
TextTrack
TextTrackCue
timeupdate
setTimeout()
setInterval()
requestAnimationFrame()
The following term is defined in [ HR-TIME ]:
The following term is defined in [ WEBVTT ]:
Media timed events carry information that is related to points in time or periods of time on the media timeline , which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions.
The following sections describe a few use cases in more detail.
A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events can be used to describe the points on the media timeline , known as splice points, where switching playback to inserted content is possible.
The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [ SCTE35 ] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [ SCTE35 ], sections 12.1 and 12.2.
This use case typically requires frame accuracy, so that inserted content is played at the right time, and continuous playback is maintained.
A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.
HLS timed metadata [ HLS-TIMED-METADATA ] uses in-band ID3 metadata to carry the artist and title information, and image content. RadioVIS in DVB ([ DVB-DASH ], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline .
The visual information should be rendered within a hundred milliseconds or so to maintain good synchronization with the audio content.
MPEG-DASH
defines
a
number
of
in-band
delivered
control
messages
that
are
used
to
notify
a
client
appication
application
or
library
such
as
dash.js
.
These
messages
include:
MPEG-DASH
defines
a
number
of
control
messages
for
media
streaming
clients
(e.g.,
libraries
such
as
dash.js
).
These
messages
are
carried
in-band
in
the
media
container
files.
Use
cases
include:
Reference: M&E IG call 1 Feb 2018: Minutes , [ DASH-EVENTING ].
A subtitle or caption author wants ensure that subtitle changes are aligned as closely as possible to shot changes in the video. The BBC Subtitle Guidelines [ BBC-SUBTITLE ] describes authoring best practices. In particular, in section 6.1 authors are advised:
"[...] it is likely to be less tiring for the viewer if shot changes and subtitle changes occur at the same time. Many subtitles therefore start on the first frame of the shot and end on the last frame."
The NorDig technical specifications for DVB receivers for the Nordic and Irish markets [ NORDIG ], section 7.3.1, mandates that receivers support TTML in MPEG-2 Transport Streams. The presentation timing precision for subtitles is specified as being within 2 frames.
Another important use case is maintaining synchronization of subtitles during program content with fast dialog. The BBC Subtitle Guidelines, section 5.1 says:
"Impaired viewers make use of visual cues from the faces of television speakers. Therefore subtitle appearance should coincide with speech onset. [...] When two or more people are speaking, it is particularly important to keep in sync. Subtitles for new speakers must, as far as possible, come up as the new speaker starts to speak. Whether this is possible will depend on the action on screen and rate of speech."
A very fast word rate, for example, 240 words per minute, corresponds on average to one word every 250 milliseconds.
A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.
[ WEBVMT ] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. The rendering of the map animation and annotations should usually be to within a hundred milliseconds or so to maintain good synchronization with the video. However, a shot change which instantly moves to a different location would require the map to be updated simultaneously, ideally with frame accuracy.
Concrete examples are provided by the tech demos at the WebVMT website.
A
video
image
analysis
system
processes
a
media
stream
to
detect
and
recognize
objects
shown
in
the
video.
This
system
generates
metadata
describing
the
objects,
including
timestamps
that
describe
the
when
the
objects
are
visible,
together
with
position
information
(e.g.,
bounding
boxes).
A
web
application
then
uses
this
timed
metadata
to
overlay
labels
and
annotations
on
the
video
using
HTML
and
CSS.
This use case requires frame accurate synchronization of the content being rendered over the video.
Media content providers often cover live events where the timing of particular segments, although often pre-scheduled, can be subject to last minute change, or may not be known ahead of time.
The media content provider uses media timed events together with their video stream to add metadata to annotate the start and (where known) end times of each of these segments. This metadata drives a user interface that allows users to see information about the current playing and upcoming segments.
Examples of the dynamic nature of the timing include:
During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.
This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.
The
DataCue
API
has
been
previously
discussed
as
a
means
to
deliver
in-band
media
timed
event
data
to
web
applications,
but
this
is
not
implemented
in
all
of
the
main
browser
engines.
It
is
included
in
the
18
October
2018
HTML
5.3
draft
[
HTML53-20181018
],
but
is
not
included
in
[
HTML
].
See
discussion
here
and
notes
on
implementation
status
here
.
WebKit
supports
a
DataCue
interface
that
extends
HTML5
DataCue
with
two
attributes
to
support
non-text
metadata,
type
and
value
.
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
type
is
a
string
identifying
the
type
of
metadata:
WebKit
DataCue
metadata
types
|
|
---|---|
"com.apple.quicktime.udta"
|
QuickTime User Data |
"com.apple.quicktime.mdta"
|
QuickTime Metadata |
"com.apple.itunes"
|
iTunes metadata |
"org.mp4ra"
|
MPEG-4 metadata |
"org.id3"
|
ID3 metadata |
and
value
is
an
object
with
the
metadata
item
key,
data,
and
optionally
a
locale:
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
Neither
[
MSE-BYTE-STREAM-FORMAT-ISOBMFF
]
nor
[
INBANDTRACKS
]
describe
handling
of
emsg
boxes.
On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.
Avoiding parsing in JavaScript is also important for low latency video streaming applications, where minimizing the time taken to pass media content through to the media element's playback buffer is essential.
[
HBBTV
]
section
9.3.2
describes
a
mapping
between
the
emsg
fields
described
above
and
the
TextTrack
and
DataCue
APIs.
A
TextTrack
instance
is
created
for
each
event
stream
signalled
in
the
MPD
document
(as
identified
by
the
schemeIdUri
and
value
),
and
the
inBandMetadataTrackDispatchType
TextTrack
attribute
contains
the
scheme_id_uri
and
value
values.
Because
HbbTV
devices
include
a
native
DASH
client,
parsing
of
the
MPD
document
and
creation
of
the
TextTrack
s
is
done
by
the
user
agent,
rather
than
by
application
JavaScript
code.
TextTrackCue
s
with
unbounded
duration
It
is
not
currently
possible
to
create
a
TextTrackCue
that
extends
from
a
given
start
time
to
the
end
of
a
live
media
stream.
If
the
stream
duration
is
known,
the
content
author
can
set
the
cue's
endTime
equal
to
the
media
duration.
However,
for
live
media
streams,
where
the
duration
is
unbounded,
it
would
be
useful
to
allow
content
authors
to
specify
that
the
TextTrackCue
duration
is
also
unbounded,
e.g.,
by
allowing
the
endTime
to
be
set
to
Infinity
.
This
would
be
consistent
with
the
media
element
's
duration
property,
which
can
be
Infinity
for
unbounded
streams.
In
browsers,
non
media
web
rendering
is
handled
through
repaint
operations
at
a
rate
that
generally
matches
the
display
refresh
rate
(e.g.,
60
times
per
second),
following
the
user's
wall
clock.
A
web
application
can
schedule
actions
and
render
web
content
at
specific
points
on
the
user's
wall
clock,
notably
through
Performance.now()
,
setTimeout()
,
setInterval()
,
and
requestAnimationFrame()
.
In most cases, media rendering follows a different path, be it because it gets handled by a dedicated background process or by dedicated hardware circuitry. As a result, progress along the media timeline may follow a clock different from the user's wall clock. [ HTML ] recommends that the media clock approximate the user's wall clock but does not require it to match the user's wall clock.
To synchronize rendering of web content to a video with frame accuracy, a web application needs:
The following sub-sections discusses mechanisms currently available to web applications to track progress on the media timeline and render content at frame boundaries.
Cues
(e.g.,
TextTrackCue
and
VTTCue
)
are
units
of
time-sensitive
data
on
a
media
timeline
[
HTML
].
The
time
marches
on
steps
in
[
HTML
]
control
the
firing
of
cue
DOM
events
during
media
playback.
Time
marches
on
is
specified
to
run
"when
the
current
playback
position
of
a
media
element
changes"
but
how
often
this
should
happen
is
unspecified.
In
practice
it
has
been
found
that
the
timing
varies
between
browser
implementations,
in
some
cases
with
a
delay
up
to
250
milliseconds
(which
corresponds
to
the
lowest
rate
at
which
timeupdate
events
are
expected
to
be
fired).
There are two methods a web application can use to handle cues:
oncuechange
handler
function
to
the
TextTrack
and
inspect
the
track's
activeCues
list.
Because
activeCues
contains
the
list
of
cues
that
are
active
at
the
time
that
time
marches
on
is
run,
it
is
possible
for
cues
to
be
missed
by
a
web
application
using
this
method,
where
cues
appear
on
the
media
timeline
between
successive
executions
of
time
marches
on
during
media
playback.
This
may
occur
if
the
cues
have
short
duration,
or
by
a
long-running
event
handler
function.
onenter
and
onexit
handler
functions
to
each
cue.
The
time
marches
on
steps
guarantee
that
enter
and
exit
events
will
be
fired
for
all
cues,
including
those
that
appear
on
the
media
timeline
between
successive
executions
of
time
marches
on
during
media
playback.
The
timing
accuracy
of
these
events
varies
between
browser
implementations,
as
the
firing
of
the
events
is
controlled
by
the
rate
of
execution
of
time
marches
on
.
An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above. A similar issue has been filed against the HTML specification.
timeupdate
events
from
the
media
element
Another
approach
to
synchronizing
rendering
of
web
content
to
media
playback
is
to
use
the
timeupdate
DOM
event,
and
for
the
web
application
to
manage
the
media
timed
event
data
to
be
triggered,
rather
than
use
the
text
track
cue
APIs
in
[
HTML
].
This
approach
has
the
same
synchronization
limitations
as
described
above
due
to
the
250
millisecond
update
rate
specified
in
time
marches
on
,
and
so
is
explicitly
discouraged
in
[
HTML
].
In
addition,
the
timing
variability
of
timeupdate
events
between
browser
engines
makes
them
unreliable
for
the
purpose
of
synchronized
rendering
of
web
content.
Synchronization
accuracy
can
be
improved
by
polling
the
media
element's
currentTime
property
from
a
setInterval()
callback,
or
by
using
requestAnimationFrame()
for
greater
accuracy.
This
technique
can
be
useful
in
where
content
should
be
animated
smoothly
in
synchronicity
with
the
media,
for
example,
rendering
a
playhead
position
marker
in
an
audio
waveform
visualization,
or
displaying
web
content
at
specific
points
on
the
media
timeline
.
However,
the
use
of
setInterval()
or
requestAnimationFrame()
for
media
synchronized
rendering
is
CPU
intensive.
[
HTML
]
does
not
expose
any
precise
mechanism
to
assess
the
time,
from
a
user's
wall
clock
perspective,
at
which
a
particular
media
frame
is
going
to
be
rendered.
A
web
application
may
only
infer
this
information
by
looking
at
the
media
element
's
currentTime
property
to
infer
the
frame
being
rendered
and
the
time
at
which
the
user
will
see
the
next
frame.
This
has
several
limitations:
currentTime
is
represented
as
a
double
value,
which
does
not
allow
to
identify
individual
frames
due
to
rounding
errors.
This
is
a
known
issue
.
currentTime
is
updated
at
a
user-agent
defined
rate
(typically
the
rate
at
which
time
marches
on
runs),
and
is
kept
stable
while
scripts
are
running.
When
a
web
application
reads
currentTime
,
it
cannot
tell
when
this
property
was
last
updated,
and
thus
cannot
reliably
assess
whether
this
property
still
represents
the
frame
currently
being
rendered.
This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.
The
API
should
allow
web
applications
to
subscribe
to
receive
specific
types
of
media
timed
event
cue.
For
example,
to
support
MPEG-DASH
emsg
and
MPD
events,
the
cue
type
is
identified
by
a
combination
of
the
scheme_id_uri
and
(optional)
value
.
The
purpose
of
this
is
to
make
receiving
cues
of
each
type
opt-in
from
the
application's
point
of
view.
The
user
agent
should
deliver
only
those
cues
to
a
web
application
for
which
the
application
has
subscribed.
The
API
should
also
allow
web
applications
to
unsubscribe
from
specific
cue
types.
To
be
able
to
handle
out-of-band
media
timed
event
cues,
including
MPEG-DASH
MPD
events,
the
API
should
allow
web
applications
to
create
and
add
timed
data
cues
to
the
media
timeline
,
to
be
triggered
by
the
user
agent.
The
API
should
allow
the
web
application
to
provide
all
necessary
parameters
to
define
the
cue,
including
start
and
end
times,
cue
type
identifier,
and
data
payload.
The
payload
should
be
any
data
type
(e.g.,
the
set
of
types
supported
by
the
WebKit
DataCue
).
For those events that the application has subscribed to receive, the API should:
The API should provide guarantees that no media timed event cues can be missed during linear playback of the media.
We recommend updating [ INBANDTRACKS ] to describe handling of in-band media timed events supported on the web platform, possibly following a registry approach with one specification per media format that describes the details of how media timed events are carried in that format.
We
recommend
that
browser
engines
support
MPEG-DASH
emsg
in-band
events
and
MPD
out-of-band
events,
as
part
of
their
support
for
the
MPEG
Common
Media
Application
Format
(CMAF)
[
MPEGCMAF
].
To
support
cues
with
unknown
end
time,
where
the
cue
is
active
from
its
start
time
to
the
end
of
the
media
stream,
we
recommend
that
the
TextTrackCue
interface
be
modified
to
allow
the
cue
duration
to
be
unbounded.
We recommend that the API allows media timed event information to be updated, such as an event's position on the media timeline, and its data payload. Where the media timed event is updated by the user agent, such as for in-band events, we recommend that the API allows the web application to be notified of any changes.
In
order
to
achieve
greater
synchronization
accuracy
between
media
playback
and
web
content
rendered
by
an
application,
the
time
marches
on
steps
in
[
HTML
]
should
be
modified
to
allow
delivery
of
cue
onenter
and
onexit
DOM
events
within
20
milliseconds
of
their
positions
on
the
media
timeline
.
Additionally, to allow such synchronization to happen at frame boundaries, we recommend introducing a mechanism that would allow a web application to accurately predict, using the user's wall clock, when the next frame will be rendered (e.g., as done in the Web Audio API ).
Thanks to François Daoust, Charles Lo, Nigel Megitt, Jon Piesing, Rob Smith, and Mark Vickers for their contributions and feedback on this document.