Copyright © 2019 W3C ® ( MIT , ERCIM , Keio , Beihang ). W3C liability , trademark and permissive document license rules apply.
This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cue events to add support for dynamic content replacement cues and generic metadata events that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This document was published by the Media & Entertainment Interest Group as an Interest Group Note.
GitHub Issues are preferred for discussion of this specification.
Publication as an Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
The disclosure obligations of the Participants of this group are described in the charter .
This document is governed by the 1 February 2018 W3C Process Document .
There
is
a
need
in
the
media
industry
for
an
API
to
support
metadata
events
synchronized
to
audio
or
video
media,
specifically
for
both
out-of-band
event
streams
and
in-band
discrete
events
(for
example,
MPD
and
emsg
events
in
MPEG-DASH).
These
media
timed
events
can
be
used
to
support
use
cases
such
as
dynamic
content
replacement,
ad
insertion,
or
presentation
of
supplemental
content
alongside
the
audio
or
video,
or
more
generally,
making
changes
to
a
web
page,
or
executing
application
code
triggered
from
JavaScript
events,
at
specific
points
on
the
media
timeline
of
an
audio
or
video
media
stream.
The following terms are used in this document:
The following terms are defined in [ HTML ]:
Media timed events carry metadata that is related to points in time, or regions of time on the media timeline , which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions, and display of track titles or images alongside an audio stream.
The following sections describe a few use cases in more detail.
A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events used to describe the points on the media timeline , known as splice points, where switching playback to inserted content is possible.
The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [ SCTE35 ] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [ SCTE35 ], sections 12.1 and 12.2.
A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.
HLS timed metadata [ HLS-TIMED-METADATA ] uses in-band ID3 metadata to carry the image content. RadioVIS in DVB ([ DVB-DASH ], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline .
A
media
streaming
server
uses
">
media
timed
events
">
to
send
control
messages
to
media
client
library,
such
as
dash.js
.
Typically
segmented
streaming
protocols
such
as
HLS
and
MPEG-DASH
make
use
of
a
manifest
document
that
informs
the
client
of
the
available
encodings
of
a
media
stream,
e.g.,
the
Media
Presentation
Description
(MPD)
document
in
[
MPEGDASH
].
Should
any
of
the
content
in
the
manifest
document
need
to
change,
the
client
should
refresh
it
by
requesting
an
updated
copy
from
the
server.
Section
5.10.4
of
[
MPEGDASH
]
describes
an
MPEG-DASH
specific
event
that
is
used
to
notify
a
client
application.
An
in-band
emsg
event
is
used
as
an
alternative
to
setting
a
cache
duration
in
the
response
to
the
HTTP
request
for
the
manifest,
so
the
client
can
refresh
the
MPD
when
it
actually
changes,
as
opposed
to
waiting
for
a
cache
duration
expiry
period
to
elapse.
This
also
has
the
benefit
of
reducing
the
load
on
HTTP
servers
caused
by
frequent
server
requests.
Reference: M&E IG call 1 Feb 2018: Minutes , [ DASH-EVENTING ].
See also this issue against the [ WEB-MEDIA-GUIDELINES ]. TODO: Add detail here.
A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.
[ WEBVMT ] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. Concrete examples are provided by the tech demos at the WebVMT website.
Reference: M&E IG TF call 17 Sept 2018: Minutes .
A video image analysis system processes a media stream to detect and recognize objects shown in the video. This system generates metadata describing the objects, including timestamps that describe the when the objects are visible, together with position information (e.g., bounding boxes). A web application then uses this timed metadata to overlay labels and annotations on the video using HTML and CSS.
During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.
This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.
The
DataCue
API
has
been
previously
discussed
as
a
means
to
deliver
in-band
event
data
to
web
applications,
but
this
is
not
implemented
in
all
of
the
main
browser
engines.
It
is
included
in
the
18
October
2018
HTML
5.3
draft
[
HTML53-20181018
],
but
is
not
included
in
[
HTML
].
See
discussion
here
and
notes
on
implementation
status
here
.
WebKit
supports
a
DataCue
interface
that
extends
HTML5
DataCue
with
two
attributes
to
support
non-text
metadata,
type
and
value
.
interface DataCue : TextTrackCue {
attribute ArrayBuffer data; // Always empty
// Proposed extensions.
attribute any value;
readonly attribute DOMString type;
};
type
is
a
string
identifying
the
type
of
metadata:
WebKit
DataCue
metadata
types
|
|
---|---|
"com.apple.quicktime.udta"
|
QuickTime User Data |
"com.apple.quicktime.mdta"
|
QuickTime Metadata |
"com.apple.itunes"
|
iTunes metadata |
"org.mp4ra"
|
MPEG-4 metadata |
"org.id3"
|
ID3 metadata |
and
value
is
an
object
with
the
metadata
item
key,
data,
and
optionally
a
locale:
value = {
key: String
data: String | Number | Array | ArrayBuffer | Object
locale: String
}
Neither
[
MSE-BYTE-STREAM-FORMAT-ISOBMFF
]
nor
[
INBANDTRACKS
]
describe
handling
of
emsg
boxes.
On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.
[
HBBTV
]
section
9.3.2
describes
a
mapping
between
the
emsg
fields
described
above
and
the
TextTrack
and
DataCue
APIs.
A
TextTrack
instance
is
created
for
each
event
stream
signalled
in
the
MPD
document
(as
identified
by
the
schemeIdUri
and
value
),
and
the
inBandMetadataTrackDispatchType
TextTrack
attribute
contains
the
scheme_id_uri
and
value
values.
Because
HbbTV
devices
include
a
native
DASH
client,
parsing
of
the
MPD
document
and
creation
of
the
TextTrack
s
is
done
by
the
user
agent,
rather
than
by
application
JavaScript
code.
Subtitles for video are typically authored against video at a nominal frame rate, e.g., 25 frames per second, which corresponds to 40 milliseconds per frame. The actual video frame rate may be adjusted dynamically according to the video encoding, but the subtitle timing must remain the same ([ EBU-TT-D ], Annex E).
This
places
a
requirement
on
user
agents
for
timely
delivery
of
TextTrackCue
and
VTTCue
events,
so
that
application
code
can
respond
and
render
the
cues.
For
subtitle
rendering
to
be
possible
with
frame
accuracy,
we
recommend
that
cue
events
are
fired
within
20
milliseconds
of
their
position
on
the
media
timeline
.
The
time
marches
on
steps
in
[
HTML
]
control
the
firing
of
cue
events
during
media
playback.
Time
marches
on
requires
a
timeupdate
event
to
be
fired
at
the
HTMLMediaElement
between
15
and
250
milliseconds
since
the
last
such
event,
and
this
requirement
therefore
specifies
the
rate
at
which
time
marches
on
is
executed
during
playback.
In
practice
it
has
been
found
that
the
timing
varies
between
browser
implementations.
There are two methods a web application can use to handle text track cues:
oncuechange
handler
function
to
the
TextTrack
and
inspect
the
track's
activeCues
list.
Because
activeCues
contains
the
list
of
cues
that
are
active
at
the
time
that
time
marches
on
is
run,
it
is
possible
for
cues
to
be
missed
by
a
web
application
using
this
method,
where
cues
appear
on
the
media
timeline
between
successive
executions
of
time
marches
on
during
media
playback.
This
may
occur
if
the
cues
have
short
duration,
or
by
a
long-running
event
handler
function.
onenter
and
onexit
handler
functions
to
each
cue.
The
time
marches
on
steps
guarantee
that
enter
and
exit
events
will
be
fired
for
all
cues,
including
those
that
appear
on
the
media
timeline
between
successive
executions
of
time
marches
on
during
media
playback.
This
method
is
only
possible
for
cues
created
by
the
web
application,
i.e.,
VTTCue
objects,
and
not
cue
objects
created
by
the
user
agent.
An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above.
Another
approach
to
synchronizing
rendering
of
web
content
to
media
playback
is
to
use
the
timeupdate
event,
and
for
the
web
application
to
manage
the
media
timed
events
to
be
triggered,
rather
than
add
use
the
text
track
cue
APIs
in
[
HTML
].
This
has
the
same
synchronization
limitations
as
described
above,
because
the
250
millisecond
update
rate
specified
in
time
marches
on
is
too
infrequent
to
ensure
that
the
web
content
can
be
updated
smoothly
(for
example,
rendering
a
playhead
position
marker
in
an
audio
waveform
visualization),
or
to
occur
at
specific
points
on
the
media
timeline
.
In
addition,
the
timing
variability
of
timeupdate
events
between
browser
engines
makes
them
unreliable
for
the
purpose
of
synchronized
rendering
of
web
content.
Synchronization
accuracy
can
be
improved
by
polling
the
media
element's
currentTime
property
from
a
setInterval
callback,
or
by
using
requestAnimationFrame
for
greater
accuracy.
However,
the
use
of
setInterval
or
requestAnimationFrame
for
media
synchronized
rendering
is
CPU
intensive.
This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.
The
API
should
allow
web
applications
to
subscribe
to
receive
specific
event
types.
For
example,
to
support
DASH
emsg
and
MPD
events,
the
API
should
allow
subscription
by
id
and
(optional)
value
.
This
is
to
make
receiving
events
opt-in
from
the
application
point
of
view.
The
user
agent
should
deliver
only
those
events
to
a
web
application
for
which
the
application
has
subscribed.
The
API
should
also
allow
web
applications
to
unsubscribe
from
specific
event
streams
by
event
type.
To
be
able
to
handle
out
of
band
events,
the
API
must
allow
web
applications
to
create
events
to
be
added
to
the
media
timeline
,
to
be
triggered
by
the
user
agent.
The
API
should
allow
the
web
application
to
provide
all
necessary
parameters
to
define
the
event,
including
start
and
end
times,
event
type,
and
data
payload.
The
payload
should
be
any
data
type
(e.g.,
the
set
of
types
supported
by
the
WebKit
DataCue
).
For
DASH
MPD
events,
the
event
type
is
defined
by
the
id
and
(optional)
value
fields.
For those events that the application has subscribed to receive, the API should:
The API must provide guarantees that no events can be missed during linear playback of the media.
We recommend updating [ INBANDTRACKS ] to describe handling of in-band media timed events supported on the web platform, following a registry approach with one specification per media format that describes the event details for that format.
We
recommend
that
browser
engines
support
MPEG-DASH
emsg
in-band
events
and
MPD
out-of-band
events,
as
part
of
their
support
for
the
MPEG
Common
Media
Application
Format
(CMAF)
[
MPEGCMAF
].
In order to acheive greater synchronization accuracy between media playback and web content rendered by an application, the time marches on steps in [ HTML ] should be modified to allow delivery of media timed event start time and end time notifications within 20 milliseconds of their positions on the media timeline .
Thanks to Charles Lo, Nigel Megitt, Jon Piesing, and Rob Smith for their contributions to this document.