Copyright © 2024 World Wide Web Consortium. W3C® liability, trademark and permissive document license rules apply.
This document defines a set of ECMAScript APIs in WebIDL to extend the [mediacapture-streams] specification.
This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.
This is an unofficial proposal.
This document was published by the Web Real-Time Communications Working Group as an Editor's Draft.
Publication as an Editor's Draft does not imply endorsement by W3C and its Members.
This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 03 November 2023 W3C Process Document.
This document contains proposed extensions and modifications to the [mediacapture-streams] specification.
New features and modifications to existing features proposed here may be considered for addition into the main specification post Recommendation. Deciding factors will include maturity of the extension or modification, consensus on adding it, and implementation experience.
A concrete long-term goal is reducing the fingerprinting surface of
enumerateDevices
()
by deprecating exposure of the device
label
in its results. This requires relieving
applications of the burden of building user interfaces to select cameras and
microphones in-content, by offering this in user agents as part of
getUserMedia
()
instead.
Miscellaneous other smaller features are under consideration as well, such as constraints to control multi-channel audio beyond stereo.
This document uses the definitions MediaDevices
, MediaStreamTrack
,
MediaStreamConstraints
, ConstrainablePattern
,
MediaTrackSupportedConstraints
, MediaTrackCapabilities
,
MediaTrackConstraintSet
, MediaTrackSettings
and
ConstrainBoolean
from [mediacapture-streams].
The terms permission state, request permission to use, and prompt the user to choose are defined in [permissions].
Performance
.now
()
is defined in [hr-time].
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, MUST NOT, and SHOULD in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
The existing enumerateDevices
()
function exposes camera
and microphone label
s to let applications build
in-content user interfaces for camera and microphone selection. Applications
have had to do this because getUserMedia
()
did not offer a
web compatible in-agent device picker. This specification aims to rectify
that.
Due to the significant fingerprinting vector caused by device
label
s, and the well-established nature of the existing
APIs, the scope of this particular effort is limited to removing
label
, leaving the overall constraints-based model
intact. This helps ensure a migration path more viable than to a
less-powerful API.
This specification augments the existing getUserMedia
()
function instead of introducing a new less-powerful API to compete with it,
for that reason as well.
This specification introduces
slightly altered semantics to the getUserMedia
()
function called "user-chooses"
that guarantee a picker will
be shown to the user in cases where the user agent would otherwise choose
for the user (that is: when application constraints do not narrow down
the choices to a single device). This is orthogonal to permission, and
offers a better and more consistent user experience across applications
and user agents.
Unfortunately, since the "user-chooses"
semantics may
produce user agent prompts at different times and in different situations
compared to the old semantics, they are somewhat incompatible with
expectations in some existing web applications that tend to call
getUserMedia
()
repeatedly and lazily instead of using
e.g. stream.clone()
.
User agents are encouraged to provide the new semantics as opt-in
initially for web compatibility. User agents MUST deprecate (remove)
label
from MediaDeviceInfo
over time, though specific migration strategies
are left to user agents. User agents SHOULD migrate to offering the new
semantics by default (opt-out) over time.
Since the constraints-model remains intact, web compatibility problems are expected to be limited to:
WebIDLpartial interface MediaDevices {
readonly attribute GetUserMediaSemantics
defaultSemantics
;
};
defaultSemantics
of type GetUserMediaSemantics
, readonlyThe default semantics of getUserMedia
()
in this
user agent.
User agents SHOULD default to "browser-chooses"
for backwards compatibility, until a transition plan has been
enacted where a majority of user agents collectively switch their
defaults to "user-chooses"
for improved user privacy,
and usage metrics suggest this transition is feasible without
major breakage.
WebIDLpartial dictionary MediaStreamConstraints {
GetUserMediaSemantics
semantics
;
};
MediaStreamConstraints
Memberssemantics
of type GetUserMediaSemantics
In cases where the specified constraints do not narrow
multiple choices between devices down to one per kind, specifies
how the final determination of which devices to pick from the
remaining choices MUST be made. If not specified, then the
defaultSemantics
are used.
WebIDLenum GetUserMediaSemantics
{
"browser-chooses
",
"user-chooses
"
};
GetUserMediaSemantics Enumeration
description |
|
---|---|
browser-chooses |
When application-specified constraints do not narrow multiple choices between devices down to one per kind, the user agent is allowed to make the final determination between the remaining choices. |
user-chooses |
When application-specified constraints do not narrow multiple choices between devices down to one per kind, the user agent MUST prompt the user to choose between the remaining choices, even if the application already has permission to some or all of them. |
When the getUserMedia
()
method is invoked, run the
following steps before invoking the getUserMedia
()
algorithm:
Let mediaDevices be the object on which this method was invoked.
Let constraints be the method's first argument.
Let semanticsPresent be true
if
constraints.semantics
exists,
otherwise false
.
Let semantics be
constraints.semantics
if it exists,
or the value of mediaDevices.
otherwise.defaultSemantics
Replace step 6.5.1. of the getUserMedia
()
algorithm in its entirety with the following two steps:
Let descriptor be a PermissionDescriptor
with its name
member set to the permission name
associated with kind (e.g. "camera" for
"video"
, "microphone" for "audio"
).
If the number of unique devices sourcing tracks of
media type kind in candidateSet
is greater than 1
and
semantics is "user-chooses"
,
then prompt the user to choose a device with
descriptor, resulting in provided media.
Otherwise, request permission to use a
device with descriptor, while considering
all devices being attached to a live and
same-permission MediaStreamTrack in the current
browsing context to mean having permission status "granted
",
resulting in provided media.
Same-permission in this context means a
MediaStreamTrack
that required the same level of
permission to obtain as what is being requested.
When asking the user’s permission, the user agent MUST disclose whether permission will be granted only to the device chosen, or to all devices of that kind.
Let track be the provided media, which
MUST be precisely one track of type kind from
finalSet. If semantics is
"browser-chooses"
then the decision of
which track to choose from finalSet is up
to the User Agent, which MAY use the value of the computed
"fitness distance" from the
SelectSettings
algorithm, the value of semanticsPresent,
or any other internally-available information about
the devices, as inputs to its decision.
If semantics is "user-chooses"
,
and the application has not narrowed down the choices
to one, then the user agent MUST ask the user to make
the final selection.
Once selected, the source of the
MediaStreamTrack
MUST NOT change.
User Agents are encouraged to default to or present a default choice based primarily on fitness distance, and secondarily on the user's primary or system default device for kind (when possible). User Agents MAY allow users to use any media source, including pre-recorded media files.
This example shows a setup with a start button and a camera selector using the new semantics (microphone is not shown for brievity but is equivalent).
<button id="start">Start</button>
<button id="chosenCamera" disabled>Camera: none</button>
<script>
let cameraTrack = null;
start.onclick = async () => {
try {
const stream = await navigator.mediaDevices.getUserMedia({
video: {deviceId: localStorage.cameraId}
});
setCameraTrack(stream.getVideoTracks()[0]);
} catch (err) {
console.error(err);
}
}
chosenCamera.onclick = async () => {
try {
const stream = await navigator.mediaDevices.getUserMedia({
video: true,
semantics: "user-chooses"
});
setCameraTrack(stream.getVideoTracks()[0]);
} catch (err) {
console.error(err);
}
}
function setCameraTrack(track) {
cameraTrack = track;
const {deviceId, label} = track.getSettings();
localStorage.cameraId = deviceId;
chosenCamera.innerText = `Camera: ${label}`;
chosenCamera.disabled = false;
}
</script>
A MediaStreamTrack
is a transferable object.
This allows manipulating real-time media outside the context it was requested or created in,
for instance in workers or third-party iframes.
To preserve the existing privacy and security infrastructure, in particular for capture tracks, the track source lifetime management remains tied to the context that created it. The transfer algorithm MUST ensure the following behaviors:
The context named originalContext that created a track named originalTrack remains in control of the originalTrack source, named trackSource, even when originalTrack is transferred into transferredTrack.
In particular, originalContext remains the proxy to privacy indicators of trackSource. transferredTrack or any of its clones are considered as tracks using trackSource as if they were tracks created in and controlled by originalContext.
When originalContext goes away, trackSource gets ended, thus transferredTrack gets ended.
When originalContext would have muted/unmuted originalTrack, transferredTrack gets muted/unmuted.
If transferredTrack is cloned in transferredTrackClone, transferredTrackClone is tied to trackSource. It is not tied to originalTrack in any way.
If transferredTrack is transferred into transferredAgainTrack, transferredAgainTrack is tied to trackSource. It is not tied to transferredTrack or originalTrack in any way.
The WebIDL changes to make the track transferable are the following:
WebIDL[Exposed=(Window,Worker), Transferable]
partial interface MediaStreamTrack {
};
At creation of a MediaStreamTrack
object, called track, run the following steps:
Initialize track.[[IsDetached]]
to false
.
The MediaStreamTrack
transfer steps, given value and dataHolder, are:
If value.[[IsDetached]]
is true
, throw a "DataCloneError" DOMException.
Set dataHolder.[[id]]
to value.id
.
Set dataHolder.[[kind]]
to value.kind
.
Set dataHolder.[[label]]
to value.label
.
Set dataHolder.[[readyState]]
to value.readyState
.
Set dataHolder.[[enabled]]
to value.enabled
.
Set dataHolder.[[muted]]
to value.muted
.
Set dataHolder.[[source]]
to value underlying source.
Set dataHolder.[[constraints]]
to value active constraints.
Set dataHolder.[[contentHint]]
to value application-set content hint.
Set value.[[IsDetached]]
to true
.
Set value.[[ReadyState]] to "ended
" (without stopping the underlying source or firing an ended
event).
MediaStreamTrack
transfer-receiving steps, given dataHolder and track, are:
Initialize track.id
to dataHolder.[[id]]
.
Initialize track.kind
to dataHolder.[[kind]]
.
Initialize track.label
to dataHolder.[[label]]
.
Initialize track.readyState
to dataHolder.[[readyState]]
.
Initialize track.enabled
to dataHolder.[[enabled]]
.
Initialize track.muted
to dataHolder.[[muted]]
.
Set track application-set content hint to dataHolder.[[contentHint]]
.
Initialize the underlying source of track to dataHolder.[[source]]
.
Set track's constraints to dataHolder.[[constraints]]
.
The underlying source is supposed to be kept alive between the transfer and transfer-receiving steps, or as long as the data holder is alive. In a sense, between these steps, the data holder is attached to the underlying source as if it was a track.
On microphone audio tracks, frame counters allow the application to tell the ratio of audio that is delivered as one quality indicator and the latency metrics measure the input delay from capture to application.
On camera and screenshare video tracks, frame counters allow the
application to tell what the frame rate is, which may be lower than the
target frameRate
. For example, if the track is
sourced from a camera then the production of frames could be slowed down
if it's dark or frames could be dropped if the system is CPU starved.
This could impact the total number of frames produced by the source and
impact how many frames are delivered, discarded or dropped for other
reasons.
WebIDLpartial interface MediaStreamTrack {
[SameObject] readonly attribute
(MediaStreamTrackAudioStats
or MediaStreamTrackVideoStats
)? stats
;
};
Let the MediaStreamTrack
have a
[[Stats]]
internal slot initialized it to null
, unless otherwise
specified below.
If the track's is of kind
"audio"
,
run the following steps:
If the MediaStreamTrack
is sourced from
getUserMedia()
, initialize [[Stats]]
to a new instance of MediaStreamTrackAudioStats
set up to expose
audio stats for this MediaStreamTrack
.
If the track's is of kind
"video"
,
run the following steps:
If the MediaStreamTrack
is sourced from
getUserMedia()
or getDisplayMedia()
,
initialize [[Stats]]
to a new instance of
MediaStreamTrackVideoStats
set up to expose video stats for this
MediaStreamTrack
.
stats
of type (MediaStreamTrackAudioStats
or MediaStreamTrackVideoStats
), readonly
When this getter is called, the user agenst MUST run the following steps:
Let track be the MediaStreamTrack
that this
getter is called on.
Return track.[[Stats]]
.
WebIDL[Exposed=Window]
interface MediaStreamTrackAudioStats
{
readonly attribute unsigned long long deliveredFrames
;
readonly attribute DOMHighResTimeStamp deliveredFramesDuration
;
readonly attribute unsigned long long totalFrames
;
readonly attribute DOMHighResTimeStamp totalFramesDuration
;
readonly attribute DOMHighResTimeStamp latency
;
readonly attribute DOMHighResTimeStamp averageLatency
;
readonly attribute DOMHighResTimeStamp minimumLatency
;
readonly attribute DOMHighResTimeStamp maximumLatency
;
undefined resetLatency
();
[Default] object toJSON
();
};
The following metrics lack Working Group consensus:
deliveredFrames
,
deliveredFramesDuration
,
totalFrames
and
totalFramesDuration
. See Issue
#129.
The MediaStreamTrackAudioStats
expose frame counters for the
MediaStreamTrack
that created it. For this track, the user agent
is required to count each audio frame from its source as follows:
A frame is considered a delivered audio frame if it either was delivered to a sink or would have been delivered to a sink, if one was connected.
The delivered audio frames duration is the total duration of all delivered audio frames. This measurement is incremented at the same time as delivered audio frames and is measured in milliseconds.
An audio frame that is discarded because it cannot be delivered on time, or it cannot be delivered for any other reason, is considered dropped.
The dropped audio frames duration is the total duration of all dropped audio frames. This measurement is incremented at the same time as dropped audio frames and is measured in milliseconds.
If the track is unmuted and enabled, the counters increase as audio is produced by the capture device. If no audio is flowing, such as if the track is muted or disabled, then the counters do not increase.
Input latency is the time, in milliseconds, between the point in time an audio input device has acquired a signal and the time it is available for consumption, which may include buffering by the user agent.
The latest input latency is the latest available input latency as estimated between the track's input device and delivery to any of its sinks.
The user agent updates its estimates at sufficient frequency to allow monitoring. The latency is representative of the experienced delay, but is not necessarily an exact measurement of the last individual audio frame that was delivered.
A sink that consumes audio may add additional processing latency not included in this measurement, such as playout delay or encode time.
Every time the latest input latency measurement is updated, the user agent also updates its average input latency, minimum input latency and maximum input latency which are the average, minimum and maximum observed measurements since the last latency reset time.
Let the MediaStreamTrackAudioStats
have internal slots
[[DeliveredFrames]],
[[DeliveredFramesDuration]],
[[DroppedFrames]],
[[DroppedFramesDuration]],
[[Latency]],
[[AverageLatency]],
[[MinimumLatency]]
and
[[MaximumLatency]],
initialized to 0.
Let the MediaStreamTrackAudioStats
also have internal slots
[[LastTask]]
and
[[LastExposureTime]],
initialized to undefined
.
The expose audio frame counters steps are the following:
Let task be the current task.
If [[LastTask]]
is equal to
task, abort these steps.
Set [[LastTask]]
to
task.
Set [[DeliveredFrames]]
to
delivered audio frames,
set [[DeliveredFramesDuration]]
to delivered audio frames duration,
set [[DroppedFrames]]
to
dropped audio frames,
set [[DroppedFramesDuration]]
to
dropped audio frames duration,
set [[Latency]]
to the
latest input latency,
set [[AverageLatency]]
to the
average input latency,
set [[MinimumLatency]]
to the
minimum input latency and
set [[MaximumLatency]]
to the
maximum input latency.
Set [[LastExposureTime]]
to
reflect the time that these metrics were exposed.
Only updating these counters once per task preserves the run-to-completion semantics defined in [API-DESIGN-PRINCIPLES].
deliveredFrames
of type unsigned long long, readonly
Upon getting, run the expose audio frame counters steps
and return
[[DeliveredFrames]]
.
deliveredFramesDuration
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return
[[DeliveredFramesDuration]]
.
totalFrames
of type unsigned long long, readonly
Upon getting, run the expose audio frame counters steps
and return the sum of
[[DeliveredFrames]]
and
[[DroppedFrames]]
.
totalFramesDuration
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return the sum of
[[DeliveredFramesDuration]]
and
[[DroppedFramesDuration]]
.
Because audio capture devices produce audio in real-time, audio frames may be dropped if not processed in a timely manner.
The ratio of audio duration that was delivered, i.e. not
dropped, can be calculated as
deliveredFramesDuration
/
totalFramesDuration
.
latency
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return [[Latency]]
.
averageLatency
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return [[AverageLatency]]
.
minimumLatency
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return [[MinimumLatency]]
.
maximumLatency
of type DOMHighResTimeStamp, readonly
Upon getting, run the expose audio frame counters steps
and return [[MaximumLatency]]
.
resetLatency
When called, run the following steps:
Run the expose audio frame counters steps.
Set [[AverageLatency]]
,
[[MinimumLatency]]
and
[[MaximumLatency]]
to
[[Latency]]
.
Set the latency reset time to
[[LastExposureTime]]
.
toJSON
When called, run [WEBIDL]'s default toJSON steps.
WebIDL[Exposed=Window]
interface MediaStreamTrackVideoStats
{
readonly attribute unsigned long long deliveredFrames
;
readonly attribute unsigned long long discardedFrames
;
readonly attribute unsigned long long totalFrames
;
[Default] object toJSON
();
};
The MediaStreamTrackVideoStats
expose frame counters for the
MediaStreamTrack
that created it. For this track, the user agent
is required to count each video frame from its source as follows:
A frame is considered a delivered video frame if it either was delivered to a sink or would have been delivered to a sink, if one was connected. This is a subset of total video frames and it is incremented at the same time as total video frames.
A video frame is considered
discarded if it was
discarded in order to achieve the target
frameRate
. This is a subset of
total video frames and it is incremented at the same time as
total video frames.
The total number of frames that have been processed by this source, meaning it is known whether the frame was considered delivered, discarded or dropped for any other reason. The number of dropped frames for various unknown reasons can be calculated by subtracting delivered video frames and discarded video frames from total video frames.
If the track is unmuted and enabled and the source is backed by a camera, total frames is incremented by frames produced by the camera. If no frames are flowing, such as if the track is muted or disabled, then total frames does not increment.
Let the MediaStreamTrackVideoStats
have internal slots
[[DeliveredFrames]],
[[DiscardedFrames]]
and
[[TotalFrames]],
initialized to 0.
Let the MediaStreamTrackVideoStats
also have an internal slot
[[LastTask]]
initialized to null
.
The expose video frame counters steps are the following:
Let task be the current task.
If [[LastTask]]
is equal to
task, abort these steps.
Set [[LastTask]]
to
task.
Set [[DeliveredFrames]]
to
delivered video frames,
set [[DiscardedFrames]]
to
discarded video frames and
set [[TotalFrames]]
to
total video frames.
Only updating these counters once per task preserves the run-to-completion semantics defined in [API-DESIGN-PRINCIPLES].
deliveredFrames
of type unsigned long long, readonly
Upon getting, run the expose video frame counters steps
and return [[DeliveredFrames]]
.
discardedFrames
of type unsigned long long, readonly
Upon getting, run the expose video frame counters steps
and return [[DiscardedFrames]]
.
totalFrames
of type unsigned long long, readonly
Upon getting, run the expose video frame counters steps
and return [[TotalFrames]]
.
toJSON
When called, run [WEBIDL]'s default toJSON steps.
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean powerEfficient
= true;
};
MediaTrackSupportedConstraints
MemberspowerEfficient
of type
boolean
, defaulting to true
WebIDLpartial dictionary MediaTrackCapabilities {
sequence<boolean> powerEfficient
;
};
MediaTrackCapabilities
MemberspowerEfficient
of type
sequence<boolean
>
The source may operate in different configurations.
If all configurations have the same power efficiency
impact, a single false
is reported.
Otherwise, the source reports a list with
both true
and false
as possible values. See
powerEfficient for additional
details.
WebIDLpartial dictionary MediaTrackSettings {
boolean powerEfficient
;
};
MediaTrackSettings
MemberspowerEfficient
of type
boolean
The constrainable properties in this document are defined below.
powerEfficient
capabilities.
Property Name | Values | Notes |
---|---|---|
powerEfficient | ConstrainBoolean |
Cameras can often operate in different configurations. Configurations are typically selected based on constraints that are related to observable parameters like width or height. Configurations may have less directly observable characteristics: power consumption, low light sensitivity, fast autofocus... The powerEfficient constraint allows web applications to favor selection of configurations that consume less power. This may be useful for web applications that may use the camera for an extended amount of time, like video conference web applications. On the other hand, applications that may use the camera for a small amount of time may prefer to not use the powerEfficient constraint. This constraint is only applicable to camera sources. As a constraint, setting it to true instructs the user agent to prefer configuration that it considers power efficient. |
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean powerEfficientPixelFormat
= true;
};
MediaTrackSupportedConstraints
MemberspowerEfficientPixelFormat
of type
boolean
, defaulting to true
WebIDLpartial dictionary MediaTrackCapabilities {
sequence<boolean> powerEfficientPixelFormat
;
};
MediaTrackCapabilities
MemberspowerEfficientPixelFormat
of type
sequence<boolean
>
If the source only has power efficient pixel formats, a single
true
is reported. If the source only has power
inefficient pixel formats, a single false
is
reported. If the script can control the feature, the source
reports a list with both true
and false
as possible values. See
powerEfficientPixelFormat for additional
details.
WebIDLpartial dictionary MediaTrackSettings {
boolean powerEfficientPixelFormat
;
};
MediaTrackSettings
MemberspowerEfficientPixelFormat
of type
boolean
The constrainable properties in this document are defined below.
Property Name | Values | Notes |
---|---|---|
powerEfficientPixelFormat | ConstrainBoolean |
Compressed pixel formats often need to be decoded, for instance for display purposes or when being encoded during a video call. The user agent SHOULD label compressed pixel formats that incur significant power penalty when decoded as power inefficient. The labeling is up to the user agent, but decoding MJPEG in software is an example of an expensive mode. Pixel formats that have not been labeled power inefficient by the user agent are for the purpose of this API considered power efficient. As a constraint, setting it to true allows filtering out inefficient pixel formats and setting it to false allows filtering out efficient pixel formats. As a setting, this reflects whether or not the current pixel format is considered power efficient by the user agent. |
This section is non-normative.
Video media flowing inside media stream tracks comprises of a sequence of video frames, where the frames are sampled from the media at instants spread out over time.
Each video frame must have a presentation timestamp which is relative to a source specific origin. A source of frames can define how this timestamp is set. A sink of frames can define how this timestamp is used.
The timestamp is present for sinks to be able to define an absolute presentation timeline of the frames relative to a clock reference, for example for playback.
Each frame may have an absolute capture timestamp representing the instant the frame capture process began, which is useful for example for delay measurements and synchronization. A source of frames can define how this timestamp is set, otherwise it is unset. A sink of frames can define how this timestamp is used if set.
Each frame may have an absolute receive timestamp representing the last received timestamp of packets used to produce this video frame was received in its entirety. The timestamp is useful for example for network jitter measurements. A source of frames can define how this timestamp is set, otherwise it is unset. A sink of frames can define how this timestamp is used if set.
Each frame may have a RTP timestamp representing the packet RTP timestamp used to produce this video frame. The timestamp is useful for example for frame identification and playback quality measurements. A source of frames can define how the timestamp is set, otherwise it is unset. A sink of frames can define how this timestamp is used if set. The packet RTP timestamp concept is defined in [RFC3550] Section 5.1.
The capture timestamp and receive timestamp are using the same clock and offset. The presentation timestamp and capture timestamp are using the same clock and have an offset which can be arbitrarily chosen by the user agent since it isn't directly observable by script.
VideoFrameMetadata
WebIDLpartial dictionary VideoFrameMetadata {
DOMHighResTimeStamp captureTime
;
DOMHighResTimeStamp receiveTime
;
unsigned long rtpTimestamp
;
};
captureTime
of type DOMHighResTimeStampThe capture timestamp of the frame relative to Performance
.timeOrigin
. It corresponds to
the capture timestamp of MediaStreamTrack
video frames.
receiveTime
of type DOMHighResTimeStampThe receive time of the corresponding encoded frame relative to Performance
.timeOrigin
.
It corresponds to the receive timestamp of MediaStreamTrack
video frames.
rtpTimestamp
of type unsigned longThe RTP timestamp of the corresponding encoded frame. It corresponds to RTP timestamp of
MediaStreamTrack
video frames.
timestamp
from presentation timestamp minus offset.captureTime
from capture timestamp if set.receiveTime
from receive timestamp if set.rtpTimestamp
from RTP timestamp if set.timestamp
.captureTime
if present.receiveTime
if present.rtpTimestamp
if present.
The user agent MUST set the capture timestamp of each video frame that is sourced from
getUserMedia
()
and getDisplayMedia
()
to its best estimate of the time that
the frame was captured.
This value MUST be monotonically increasing.
Some platforms or User Agents may provide built-in support for video effects triggered by user motion heuristics, in particular for camera video streams.
Web applications may either want to control or at least be aware that these heuristics are active and might trigger these effects at the source level.
This can for instance allow the web application to update its UI or to turn off these heuristics where having such effects tiggered accidentally might be considered insensitive or inappropriate.
For that reason, we extend MediaStreamTrack
with the following properties.
The WebIDL changes are the following:
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean gestureReactions
= true;
};
partial dictionary MediaTrackConstraintSet {
ConstrainBoolean gestureReactions
;
};
partial dictionary MediaTrackSettings {
boolean gestureReactions
;
};
partial dictionary MediaTrackCapabilities {
sequence<boolean> gestureReactions
;
};
Some platforms or User Agents may provide built-in support for automatic
continuous framing based on the position of human faces within the field
of view, in particular for camera video streams.
Web applications may either want to control or at least be aware that
automatic continuous human face framing is applied at the source level.
This may for instance allow the web application to update its UI or to
not apply human face framing on its own.
For that reason, we extend MediaStreamTrack
with the following
properties.
The WebIDL changes are the following:
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean faceFraming
= true;
};
partial dictionary MediaTrackCapabilities {
sequence<boolean> faceFraming
;
};
partial dictionary MediaTrackConstraintSet {
ConstrainBoolean faceFraming
;
};
partial dictionary MediaTrackSettings {
boolean faceFraming
;
};
When the "faceFraming" setting is set to true
by
the ApplyConstraints algorithm, the UA will attempt to
continuously improve framing by cropping to human faces.
When the "faceFraming" setting is set to false
by
the ApplyConstraints algorithm, the UA will not crop to human
faces.
<video></video>
<script>
// Open camera.
const stream = await navigator.mediaDevices.getUserMedia({video: true});
const [videoTrack] = stream.getVideoTracks();
// Try to improve framing.
const capabilities = videoTrack.getCapabilities();
if ("faceFraming" in capabilities) {
await videoTrack.applyConstraints({faceFraming: true});
} else {
// Face framing is not supported by the platform or by the camera.
// Consider falling back to some other method.
}
// Show to user.
const videoElement = document.querySelector("video");
videoElement.srcObject = stream;
</script>
Some platforms or User Agents may provide built-in support for human eye
gaze correction to make the eyes of faces appear to look at the camera,
in particular for camera video streams.
This may for instance allow the web application to update its UI or to
not apply human eye gaze correction on its own.
For that reason, we extend MediaStreamTrack
with the following
properties.
The WebIDL changes are the following:
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean eyeGazeCorrection
= true;
};
partial dictionary MediaTrackCapabilities {
sequence<boolean> eyeGazeCorrection
;
};
partial dictionary MediaTrackConstraintSet {
ConstrainBoolean eyeGazeCorrection
;
};
partial dictionary MediaTrackSettings {
boolean eyeGazeCorrection
;
};
When the "eyeGazeCorrection" setting is set to true
,
the User Agent will attempt to correct human eye gaze so that the eyes
of faces appear to look at the camera.
When the "eyeGazeCorrection" setting is set to false
,
the User Agent will not correct human eye gaze.
<video></video>
<script>
// Open camera.
const stream = await navigator.mediaDevices.getUserMedia({video: true});
const [videoTrack] = stream.getVideoTracks();
// Try to correct eye gaze.
const videoCapabilities = videoTrack.getCapabilities();
if ((videoCapabilities.eyeGazeCorrection || []).includes(true)) {
await videoTrack.applyConstraints({eyeGazeCorrection: {exact: true}});
} else {
// Eye gaze correction is not supported by the platform or by the camera.
// Consider falling back to some other method.
}
// Show to user.
const videoElement = document.querySelector("video");
videoElement.srcObject = stream;
</script>
Some platforms offer functionality for voice isolation: Attempting to remove all parts of an audio track that do not correspond to a human voice. Some platforms even attempt to remove extraneous voices, leaving the "main voice" as the dominant component of the audio. The exact methods used may vary between implementations.
This constraint permits the platform to turn on that functionality, with the desired result being that the "main voice" in the audio signal is the dominant component of the audio.
This will have large effects on audio that is presented for other reasons than to transmit voice (for instance music or ambient noises), so needs to be off by default.
This constraint is a stronger version of noise cancellation, which means that if the "noiseSuppression" constraint is set to false and "voiceIsolation" is set to true, the value of "noiseCancellation" will be ignored.
This constraint has no such relationship with any other constraint; in particular it does not affect echoCancellation.
The WebIDL changes are the following:
WebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean voiceIsolation
= true;
};
partial dictionary MediaTrackConstraintSet {
ConstrainBoolean voiceIsolation
;
};
partial dictionary MediaTrackSettings {
boolean voiceIsolation
;
};
partial dictionary MediaTrackCapabilities {
sequence<boolean> voiceIsolation
;
};
When the "voiceIsolation" setting is set to true
by the
ApplyConstraints algorithm, the UA
will attempt to remove the components of the audio track that
do not correspond to a human voice. If a dominant voice can be
identified, the UA will attempt to enhance that voice.
When the "voiceIsolation" constraint setting is set to false
by the ApplyConstraints algorithm, the UA will process the
audio according to other settings in its normal fashion.
The configuration (capabilities and settings) of a MediaStreamTrack
may be changed dynamically
outside the control of web applications. One example is when a user decides to switch on background blur through
the operating system. Web applications might want to know that the configuration
of a particular MediaStreamTrack
has changed. For that purpose, a new event is defined below.
WebIDLpartial interface MediaStreamTrack {
attribute EventHandler onconfigurationchange
;
};
The onconfigurationchange
attribute
is an event handler IDL attribute for the onconfigurationchange
event handler, whose event handler event type is
configurationchange.
When the User Agent detects a change of configuration in a track's underlying source, the User Agent MUST run the following steps:
If track.muted
is true
, wait for track.muted
to become false
or track.readyState
to be "ended".
Queue a task on current settings object's responsible event loop to perform the following steps:
This task will run before any other task that may set track.muted
to true
.
If track.readyState
is "ended", abort these steps.
If track's capabilities and settings are matching source configuration, abort these steps.
Update track's capabilities and settings according track's underlying source.
Fire an event named configurationchange
on track.
These events are potentially triggered simultaneously on documents of different origins. User Agents MAY add fuzzing on the timing of events to avoid cross-origin activity correlation.
This example shows how to monitor external background blur changes.
const stream = await navigator.mediaDevices.getUserMedia({video: true});
const [track] = stream.getVideoTracks();
let {backgroundBlur} = track.getSettings();
applyBlurInSoftwareInstead(!backgroundBlur);
track.addEventListener("configurationchange", () => {
if (backgroundBlur != track.getSettings().backgroundBlur) {
backgroundBlur = track.getSettings().backgroundBlur;
applyBlurInSoftwareInstead(!backgroundBlur);
}
});
Human face metadata describes the geometrical information of human faces
in video frames. It can be set by web applications using the standard means
when creating VideoFrameMetadata
for VideoFrame
s or it can be set by
a user agent when the media track constraint, defined below, is used to
enable face detection for the MediaStreamTrack
which provides the
VideoFrame
s.
The facial metadata can be used by video encoders to enhance the quality of the faces in encoded video streams or for other suitable purposes.
VideoFrameMetadata
WebIDLpartial dictionary VideoFrameMetadata {
sequence<Segment
> segments
;
};
segments
of type sequence<Segment
>
The set of known geometrical segments in the video frame.
Segment
WebIDLdictionary Segment
{
required SegmentType
type
;
required long id
;
long partOf
;
required float probability
;
Point2D centerPoint
;
DOMRectInit boundingBox
;
};
WebIDLenum SegmentType
{
"human-face
",
"left-eye
",
"right-eye
",
"eye
",
"mouth
",
};
Segment
Memberstype
of type SegmentType
The type of segment which the segment refers to.
It must be one of the following values:
human-face
The segment describes a human face.
left-eye
The segment describes oculus sinister.
right-eye
The segment describes oculus dexter.
eye
The segment describes an eye, either left or right.
mouth
The segment describes a mouth.
id
of type long
An identifier of the object described by the segment, unique
within a sequence. If the same object can be tracked over multiple
frames originating from the same MediaStreamTrack
source or it
can be matched to correspond to the same object in
MediaStreamTrack
s which are cloned from the same original
MediaStreamTrack
, the user agent SHOULD use the same
id
for the segments which describe the object.
id
is also used in conjunction with the member
partOf
.
The user agent MUST NOT select the value to assign to
id
in such a way that the detected objects could be
correlated to match between different MediaStreamTrack
sources
unless the MediaStreamTrack
s are cloned from the same original
MediaStreamTrack
.
partOf
of type long
If defined, references another segment which has the member id
set to the same value.
The referenced segment corresponds to an object of which the object described by this segment
is part of.
If undefined, the object described by this segment is not known to be part of any other object
described by any segment associated with the MediaStreamTrack
.
probability
of type float
If nonzero, this is the estimate of the conditional probability that the segmented object
actually is of the type indicated by the type
member on the condition that
the detection has been made. The value of this member must be always zero or above with a
maximum of one. The special value of zero indicates that the probability estimate
is not available.
centerPoint
of type Point2D
The coordinates of the approximate center of the object described by this Segment
.
The object location in the frame can be specified even if it is obscured by other objects in
front of it or it lies partially or fully outside of the frame.
The x
and y
values of the point are interpreted to represent a coordinate in a
normalized square space. The origin of coordinates {x,y} =
{0.0, 0.0} represents the upper left corner whereas the {x,y} =
{1.0, 1.0} represents the lower right corner relative to the
rendered frame.
boundingBox
of type DOMRectInit
A bounding box surrounding the object described by this segment.
The object bounding box in the frame can be specified even if it is obscured by other objects in front of it or it lies partially or fully outside of the frame.
See the member centerPoint
for the definition of the
coordinate system.
MediaTrackSupportedConstraints
dictionary extensionsWebIDLpartial dictionary MediaTrackSupportedConstraints {
boolean humanFaceDetectionMode
= true;
};
MediaTrackSupportedConstraints
MembershumanFaceDetectionMode
of type boolean
, defaulting to true
MediaTrackCapabilities
dictionary extensionsWebIDLpartial dictionary MediaTrackCapabilities {
sequence<DOMString> humanFaceDetectionMode
;
};
MediaTrackCapabilities
MembershumanFaceDetectionMode
of type sequence<DOMString
>
The sequence of supported face detection modes.
Each string MUST be one of the members of HumanFaceDetectionModeEnum
.
See
humanFaceDetectionMode for additional details.
MediaTrackConstraintSet
dictionary extensionsWebIDLpartial dictionary MediaTrackConstraintSet {
ConstrainDOMString humanFaceDetectionMode
;
};
MediaTrackConstraintSet
MembershumanFaceDetectionMode
of type ConstrainDOMString
MediaTrackSettings
WebIDLpartial dictionary MediaTrackSettings {
DOMString humanFaceDetectionMode
;
};
MediaTrackSettings
MembershumanFaceDetectionMode
of type DOMString
WebIDLenum HumanFaceDetectionModeEnum
{
"none
",
"bounding-box
",
"bounding-box-with-landmark-center-point
",
};
HumanFaceDetectionModeEnum
Enumeration Descriptionnone
This MediaStreamTrack
source does not set metadata in
VideoFrameMetadata
of VideoFrame
s related to human faces or
human face landmarks, that is, to any Segment
which has
the type
set to any of the alternatives listed in
enumeration SegmentType
.
As an input, this is interpreted as a command to turn off the
setting of human face and landmark detection.
bounding-box
This source sets metadata related to human faces
(segment type of "human-face
") including bounding
box information in the member boundingBox
of each
Segment
related to a detected face. The source does not set the
human face landmark information. As an input, this is interpreted
as a command to enable the setting of human face detection and to
find the bounding box of each detected face.
bounding-box-with-landmark-center-point
With this setting, the source sets a superset of the metadata
compared to the "bounding-box
"
setting. The source sets the same metadata and additionally metadata
related to human face landmarks (all other SegmentType
s except
"human-face
") including center point information
in the member centerPoint
of each Segment
related
to a detected landmark. As an input, this is interpreted as a
command to enable the setting of human face and face landmark
detection and to set bounding box related information to face
segment metadata and to set the center point information of each
detected face landmark.
The constrainable properties in this section are defined below.
Property Name | Values | Notes |
---|---|---|
humanFaceDetectionMode | ConstrainDOMString |
This string (or each string, when a list) should be one of the
members of As a As a setting, this reflects which face geometrical
properties the user agent detects and sets in the metadata of the
|
// main.js:
// Open camera with face detection enabled
const stream = await navigator.mediaDevices.getUserMedia({
video: { humanFaceDetectionMode: 'bounding-box' }
});
const [videoTrack] = stream.getVideoTracks();
if (videoTrack.getSettings().humanFaceDetectionMode != 'bounding-box') {
throw('Face bounding box detection is not supported');
}
// Use a video worker and show to user.
const videoElement = document.querySelector('video');
const videoWorker = new Worker('video-worker.js');
videoWorker.postMessage({track: videoTrack}, [videoTrack]);
const {data} = await new Promise(r => videoWorker.onmessage);
videoElement.srcObject = new MediaStream([data.videoTrack]);
// video-worker.js:
self.onmessage = async ({data: {track}}) => {
const generator = new VideoTrackGenerator();
parent.postMessage({videoTrack: generator.track}, [generator.track]);
const {readable} = new MediaStreamTrackProcessor({track});
const transformer = new TransformStream({
async transform(frame, controller) {
for (const segment of frame.metadata().segments || []) {
if (segment.type === 'human-face') {
// the metadata is coming directly from the video track with
// bounding-box face detection enabled
console.log(
`Face @ (${segment.boundingBox.x}, ${segment.boundingBox.y}), size ` +
`${segment.boundingBox.width}x${segment.boundingBox.height}`);
}
}
controller.enqueue(frame);
}
});
await readable.pipeThrough(transformer).pipeTo(generator.writable);
};
MediaStreamTrack
's objects are exposed to workers, so can do MediaStream
's objects.
The WebIDL changes are the following:
WebIDL[Exposed=(Window,Worker)]
partial interface MediaStream {
};
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in:
Referenced in: