1. Introduction
This section is non-normative.
The Web Speech API aims to enable web developers to provide, in a web browser, speech-input and text-to-speech output features that are typically not available when using standard speech-recognition or screen-reader software. The API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. The API is designed to enable both brief (one-shot) speech input and continuous speech input. Speech recognition results are provided to the web page as a list of hypotheses, along with other relevant information for each hypothesis.
This specification is a subset of the API defined in the HTML Speech Incubator Group Final Report . That report is entirely informative since it is not a standards track document. All portions of that report may be considered informative with regards to this document, and provide an informative background to this document. This specification is a fully-functional subset of that report. Specifically, this subset excludes the underlying transport protocol, the proposed additions to HTML markup, and it defines a simplified subset of the JavaScript API. This subset supports the majority of use-cases and sample code in the Incubator Group Final Report. This subset does not preclude future standardization of additions to the markup, API or underlying transport protocols, and indeed the Incubator Report defines a potential roadmap for such future work.
2. Use Cases
This section is non-normative.
This specification supports the following use cases, as defined in Section 4 of the Incubator Report .
- Voice Web Search
- Speech Command Interface
- Continuous Recognition of Open Dialog
- Speech UI present when no visible UI need be present
- Voice Activity Detection
- Temporal Structure of Synthesis to Provide Visual Feedback
- Hello World
- Speech Translation
- Speech Enabled Email Client
- Dialog Systems
- Multimodal Interaction
- Speech Driving Directions
- Multimodal Video Game
- Multimodal Search
To keep the API to a minimum, this specification does not directly support the following use case. This does not preclude adding support for this as a future API enhancement, and indeed the Incubator report provides a roadmap for doing so.
- Rerecognition
3. Security and privacy considerations
-
User
agents
must
only
start
speech
input
sessions
with
explicit,
informed
user
consent.
User
consent
can
include,
for
example:
- User click on a visible speech input element which has an obvious graphical representation showing that it will start speech input.
-
Accepting
a
permission
prompt
shown
as
the
result
of
a
call
to
start(). - Consent previously granted to always allow speech input for this web page.
-
User
agents
must
give
the
user
an
obvious
indication
when
audio
is
being
recorded.
-
In
a
graphical
user
agent,
this
could
be
a
mandatory
notification
displayed
by
the
user
agent
as
part
of
its
chrome
and
not
accessible
by
the
web
page.
This
could
for
example
be
a
pulsating/blinking
record
icon
as
part
of
the
browser
chrome/address
bar,
an
indication
in
the
status
bar,
an
audible
notification,
or
anything
else
relevant
and
accessible
to
the
user.
This
UI
element
must
also
allow
the
user
to
stop
recording.
- In a speech-only user agent, the indication may for example take the form of the system speaking the label of the speech input element, followed by a short beep.
-
In
a
graphical
user
agent,
this
could
be
a
mandatory
notification
displayed
by
the
user
agent
as
part
of
its
chrome
and
not
accessible
by
the
web
page.
This
could
for
example
be
a
pulsating/blinking
record
icon
as
part
of
the
browser
chrome/address
bar,
an
indication
in
the
status
bar,
an
audible
notification,
or
anything
else
relevant
and
accessible
to
the
user.
This
UI
element
must
also
allow
the
user
to
stop
recording.
- The user agent may also give the user a longer explanation the first time speech input is used, to let the user know what it is and how they can tune their privacy settings to disable speech recording if required.
-
To
mitigate
the
risk
of
fingerprinting,
user
agents
MUST
NOT
personalize
speech
recognition
when
performing
speech
recognition
on
a
MediaStreamTrack.
3.1. Implementation considerations
This section is non-normative.
- Spoken password inputs can be problematic from a security perspective, but it is up to the user to decide if they want to speak their password.
- Speech input could potentially be used to eavesdrop on users. Malicious webpages could use tricks such as hiding the input element or otherwise making the user believe that it has stopped recording speech while continuing to do so. They could also potentially style the input element to appear as something else and trick the user into clicking them. An example of styling the file input element can be seen at https://www.quirksmode.org/dom/inputfile.html . The above recommendations are intended to reduce this risk of such attacks.
3.2. On-Device Model Privacy Considerations
This subsection, and the "On-Device Model Security Considerations" subsection below, unlike many "privacy considerations" sections which only summarize and restate considerations that are already normatively specified elsewhere in the document, contain some normative requirements that are not present elsewhere, and add more detail to the normative requirements present elsewhere. The novel normative requirements are called out using strong emphasis .3.2.1. Language Pack Availability
For
on-device
speech
recognition,
the
exact
download
status
of
language
packs
can
present
a
fingerprinting
vector.
How
many
bits
this
vector
provides
depends
on
the
options
provided
to
available()
or
install()
,
and
how
they
influence
the
download
(e.g.,
if
different
language
packs
have
different
availability
statuses).
3.2.2. Download Masking
One
mitigation
is
for
the
user
agent
to
mask
the
current
download
status
by
returning
"downloadable"
from
available()
even
if
the
actual
download
status
is
"available"
or
"downloading"
.
Because
implementation
strategies
differ
and
other
mitigations
(like
permission
prompts
for
install()
)
are
available,
a
specific
masking
scheme
is
not
mandated.
For
APIs
where
the
user
agent
believes
such
masking
is
necessary,
a
suggested
heuristic
is
to
mask
by
default,
subject
to
a
masking
state
that
is
established
for
each
(API,
options,
storage
key
)
tuple.
This
state
can
be
set
to
"unmasked"
once
a
web
page
in
a
given
storage
key
calls
install()
with
a
given
set
of
options,
and
successfully
starts
a
download
or
the
promise
resolves
to
true
(indicating
the
language
pack
is
ready).
Since
install()
has
stronger
requirements
(see
Installation-time
friction
),
this
ensures
that
web
pages
only
get
access
to
the
true
download
status
after
taking
a
more
costly
and
less-repeatable
action.
Implementations which use such a storage key -based masking scheme must ensure that the masking state is reset when other storage for that origin is reset.
3.2.2.1. Installation-time friction
The
mitigation
described
in
Download
Masking
works
against
attempts
to
silently
fingerprint
using
available()
.
The
specification
also
contains
requirements
to
prevent
install()
from
being
easily
used
for
fingerprinting,
by
introducing
friction:
The
install()method both requires and consumes user activation , when it would initiate a download.The
install()method allows the user agent to prompt the user for permission, or to implicitly reject download attempts based on previous signals (such as an observed pattern of abuse).Access to
install()andavailable()is gated on an per-API policy-controlled feature , which means that only top-level origins and their delegates can use the API.
Additionally,
initiating
the
download
process
via
install()
is
more
or
less
a
one-time
operation
for
a
given
language.
The
availability
status
will
only
transition
from
"downloadable"
to
"downloading"
to
"available"
via
these
guarded
installation
operations.
That
is,
while
install()
can
be
used
to
read
some
of
these
fingerprinting
bits
(by
observing
the
resolution
of
its
promise
and
subsequent
calls
to
available()
),
doing
so
will
effectively
"destroy"
those
bits
by
changing
the
state.
(For details on cases where downloading might happen more than once, and how privacy and security are preserved in those cases, see Download Cancelation , Download Eviction , and Disk Space for Language Packs .)
3.2.2.2. Download Cancelation
An
important
part
of
making
the
download
status
a
less-useful
fingerprinting
vector
is
to
ensure
that
the
website
cannot
toggle
the
availability
state
back
and
forth
by
starting
and
then
effectively
canceling
downloads.
The
Web
Speech
API’s
install()
method
returns
a
promise
and
does
not
take
an
AbortSignal
to
cancel
the
download
itself
(
abort()
is
for
an
active
recognition
session).
Once
a
download
is
initiated
by
install()
,
the
user
agent
should
preserve
the
download
progress.
User
agents
should
not
cancel
an
ongoing
language
pack
download
in
response
to
page-controlled
actions
(e.g.,
navigation,
page
unload)
that
could
be
used
to
manipulate
the
download
state
for
fingerprinting.
If
a
page
navigates
away,
the
download
should
ideally
continue
in
the
background,
or
at
least
its
progress
should
be
saved.
The
goal
is
to
prevent
the
site
from
easily
reverting
a
language
pack’s
state
from
"downloading"
back
to
"downloadable"
.
Note that canceling downloads in response to explicit, out-of-band user-controlled actions (e.g., via browser UI) is not problematic from this perspective.
3.2.2.3. Download Eviction
Another ingredient in ensuring that websites cannot toggle the availability state back and forth is to ensure that user agents don’t use a quota-based eviction system for downloaded language packs that web pages can indirectly control. For example, if a user agent evicted less-recently-used language packs when new ones are installed, a web page could trigger such evictions to toggle the state of a target language pack.
To
avoid
this,
user
agents
should
not
implement
systems
which
allow
web
pages
to
control
the
eviction
of
downloaded
language
packs
,
including
via
indirect
triggers
such
as
further
subsequent
downloads.
One
way
to
fulfill
this
requirement
is
to
never
evict
downloaded
material
in
response
to
web
page-initiated
storage
pressure,
instead
refusing
to
download
new
material
(e.g.,
install()
resolving
to
false
)
if
doing
so
would
cause
storage
pressure.
Evicting downloads in response to user-controlled actions (e.g., via a browser settings UI) is not problematic.
3.2.2.4. Alternate Options
While some of the above requirements are specified using "must" language, most are "should." This is because implementations might use different strategies to preserve user privacy, especially for APIs with smaller models or language packs.
The simplest is to treat language pack downloads like other stored resources, partitioning them by the downloading page’s storage key . This leverages existing web origin model privacy protections. The downside is potentially redundant downloads across sites, using more user bandwidth and disk space.
A variant is to re-download for new storage keys but re-use on-disk storage if the pack is already there, saving disk space but still using time/bandwidth.
User agents could also attempt to fake a download for new storage keys if the language pack is already present, by waiting a similar amount of time as the real download originally took. This saves bandwidth and disk space but is less private due to network side channels (e.g., a page observing no change in network throughput). Such a scheme needs caution, as the first site initiating the download could try to inflate this time. Nevertheless, faking download times might be useful, combined with other mitigations.
3.2.3. Sensitive Language Information
Even if fingerprinting risks from availability status are mitigated, knowing a user has downloaded a specific language pack (e.g., for a minority language) can be sensitive.
For
this
reason,
on
top
of
installation-time
friction,
user
agents
may
artificially
fake
a
download
(e.g.,
by
adding
a
delay
to
the
resolution
of
the
install()
promise)
if
they
believe
it
would
be
helpful
for
privacy
reasons
,
instead
of
having
install()
resolve
instantly
if
the
language
pack
is
already
present.
This
provides
plausible
deniability.
If
install()
takes
a
few
seconds
to
resolve
true
,
it
could
be
a
fake
delay
or
a
quick
real
download.
Such fake delays are not foolproof but offer some privacy benefit, especially when combined with other mitigations like prompts.
3.2.4. Model Version
The specific version or behavior of an on-device speech recognition model can also be a fingerprinting vector. These APIs do not expose model versions directly.
The
best
way
to
prevent
the
model
version
from
becoming
a
fingerprinting
vector
is
to
tie
it
to
the
user
agent’s
version,
such
that
the
model’s
version
(and
behavior)
only
updates
alongside
already-exposed
information
(like
the
User-Agent
string).
User
agents
should
limit
the
number
of
possible
model
versions
that
a
single
user
agent
version
can
be
paired
with
when
determining
if
a
language
pack
is
"available"
via
available()
.
This
might
involve
not
providing
model
updates
to
older
user
agent
versions
or
ignoring
already-downloaded
models
below
a
minimum
version
threshold
after
a
user
agent
update
(instead,
available()
might
report
"downloadable"
for
a
newer
version).
There’s
a
tradeoff:
aggressively
locking
new
UA
versions
to
new
model
versions
can
increase
transitions
between
"available"
and
"downloadable"
.
This
can
be
mitigated
by
allowing
older
models
with
newer
UAs
while
a
new
model
downloads,
keeping
the
status
"available"
but
briefly
allowing
identification
of
users
with
older-model/newer-UA
combinations.
3.2.5. User Input and Speech Data
Speech data is inherently sensitive. Implementations must not train or fine-tune on-device speech recognition models on user speech input obtained through this API, or otherwise store user speech input in a way that models can consult in the future (e.g., for personalization beyond the current session or across origins).
Using user speech input in such a way would be a significant privacy leak, potentially exposing user information or information derived from interactions with one site to another.
This
reinforces
the
existing
requirement:
"To
mitigate
the
risk
of
fingerprinting,
user
agents
MUST
NOT
personalize
speech
recognition
when
performing
speech
recognition
on
a
MediaStreamTrack
."
The
considerations
here
apply
broadly
to
any
speech
processed
by
on-device
models
via
this
API.
3.2.6. Cloud-based vs. On-Device Implementations
The
Web
Speech
API
can
support
both
server-based
(cloud)
and
client-based/embedded
(on-device)
recognition
and
synthesis.
The
processLocally
attribute
allows
developers
to
indicate
a
preference
or
requirement
for
on-device
processing.
When
processLocally
is
false
(the
default),
user
speech
data
may
be
sent
to
a
remote
server
for
processing.
Web
developers
should
be
aware
of
this
possibility
and
the
associated
privacy
implications
if
they
do
not
explicitly
request
local
processing.
User
agents
should
also
be
transparent
with
users
about
where
speech
processing
occurs.
When
processLocally
is
true
,
the
considerations
in
this
"On-Device
Model
Privacy
Considerations"
section
are
paramount.
3.3. On-Device Model Security Considerations
3.3.1. Disk Space for Language Packs
Downloading
language
packs
for
on-device
speech
recognition
via
install()
could
use
significant
amounts
of
the
user’s
disk
space.
In
the
event
of
storage
pressure,
user
agents
should
balance
the
utility
of
these
APIs
with
the
disk
space
they
take
up
,
possibly
by
having
install()
resolve
to
false
for
new
downloads
or
by
freeing
up
disk
space
in
other
ways.
However,
user
agents
need
to
be
mindful
of
the
privacy
impacts
discussed
in
Download
Eviction
when
considering
freeing
up
disk
space
by
evicting
language
packs.
User
agents
may
involve
the
user
in
these
decisions
,
e.g.,
via
download-time
prompts
or
a
browser
UI
for
managing
downloaded
language
packs.
If
a
previously
installed
language
pack
is
evicted
(e.g.,
by
the
user
or
due
to
extreme
storage
pressure)
while
it
might
be
in
use
or
expected
to
be
available,
subsequent
attempts
to
use
it
(e.g.,
via
start()
with
lang
set
to
that
language
and
processLocally
as
true)
should
fail
gracefully.
This
might
involve
available()
returning
"downloadable"
or
"unavailable"
,
and
start()
potentially
firing
an
SpeechRecognitionErrorEvent
with
an
appropriate
error
code
like
language-not-supported
or
service-not-allowed
.
3.3.2. Runtime Shared Resources
On-device speech recognition can consume significant runtime resources like CPU, memory, and potentially specialized hardware accelerators.
User agents should ensure that one web page’s use of on-device speech recognition does not overly interfere with another web page’s use of the API, or another web page’s general operation, or the overall system stability. For example, it should not be possible for a background tab to monopolize speech processing resources, preventing a foreground tab from using them.
This
specification
does
not
mandate
any
particular
mitigation
strategy,
but
possible
approaches
include
queuing
requests,
rate
limiting,
prioritizing
foreground
tabs,
or
detecting
abusive
behavior.
If
necessary
to
prevent
resource
exhaustion
or
instability,
the
user
agent
may
cause
speech
recognition
operations
to
fail
(e.g.,
by
firing
an
SpeechRecognitionErrorEvent
with
service-not-allowed
).
3.3.3. OS-Provided Models
One implementation strategy for on-device speech recognition is to delegate to models or capabilities provided by the underlying operating system. This can offer benefits like a consistent user experience and efficient resource usage.
However, this approach comes with the usual considerations of exposing OS capabilities to the web. User agents must still ensure that all privacy and security requirements of this specification are met when using OS-provided models. This includes the requirements in User Input and Speech Data (preventing training on user data) and Runtime Shared Resources (ensuring fair and stable resource sharing).
4. API Description
This section is normative.
4.1. The SpeechRecognition Interface
The speech recognition interface is the scripted web API for controlling a given recognition.
The term "final result" indicates a
SpeechRecognitionResult
in
which
the
isFinal
attribute
is
true.
The
term
"interim
result"
indicates
a
SpeechRecognitionResult
in
which
the
isFinal
attribute
is
false.
SpeechRecognition
has
the
following
internal
slots:
-
[[started]] -
A boolean flag representing whether the speech recognition started. The initial value is
false.
-
[[processLocally]] -
A boolean flag indicating whether recognition MUST be performed locally. The initial value is
false.
-
[[phrases]] -
A
SpeechRecognitionPhraseListrepresenting a list of phrases for contextual biasing. The initial value is null.
[SecureContext ,Exposed =Window ]interface :SpeechRecognition EventTarget {(); // recognition parametersconstructor attribute SpeechGrammarList grammars ;;attribute DOMString lang ;attribute boolean continuous ;attribute boolean interimResults ;attribute unsigned long maxAlternatives ;;attribute boolean processLocally ;attribute SpeechRecognitionPhraseList phrases ; // methods to drive the speech interaction(); );undefined start ();undefined start (MediaStreamTrack );audioTrack undefined stop ();(); ); );undefined abort ();static Promise <AvailabilityStatus >available (SpeechRecognitionOptions );options static Promise <boolean >install (SpeechRecognitionOptions ); // event methodsoptions attribute EventHandler ;onaudiostart attribute EventHandler ;onsoundstart attribute EventHandler ;onspeechstart attribute EventHandler ;onspeechend attribute EventHandler ;onsoundend attribute EventHandler ;onaudioend attribute EventHandler ;onresult attribute EventHandler ;onnomatch attribute EventHandler ;onerror attribute EventHandler ;onstart attribute EventHandler ; };onend dictionary {SpeechRecognitionOptions required sequence <DOMString >;langs boolean =processLocally false ; };enum {SpeechRecognitionErrorCode "no-speech" ,"aborted" ,"audio-capture" ,"network" ,"not-allowed" ,, ,"service-not-allowed" ,"language-not-supported" ,"phrases-not-supported" };enum {AvailabilityStatus , , ,"unavailable" ,"downloadable" ,"downloading" ,"available" }; [SecureContext ,Exposed =Window ]interface :SpeechRecognitionErrorEvent Event {(constructor DOMString ,type SpeechRecognitionErrorEventInit );eventInitDict readonly attribute SpeechRecognitionErrorCode error ;readonly attribute DOMString message ; };dictionary :SpeechRecognitionErrorEventInit EventInit {required SpeechRecognitionErrorCode ;error DOMString = ""; }; // Item in N-best list [message SecureContext ,Exposed =Window ]interface {SpeechRecognitionAlternative readonly attribute DOMString transcript ;readonly attribute float confidence ; }; // A complete one-shot simple response [SecureContext ,Exposed =Window ]interface {SpeechRecognitionResult readonly attribute unsigned long length ;getter SpeechRecognitionAlternative item (unsigned long );index readonly attribute boolean isFinal ; }; // A collection of responses (used in continuous mode) [SecureContext ,Exposed =Window ]interface {SpeechRecognitionResultList readonly attribute unsigned long length ;getter SpeechRecognitionResult item (unsigned long ); }; // A full response, which could be interim or final, part of a continuous response or not [index SecureContext ,Exposed =Window ]interface :SpeechRecognitionEvent Event {(constructor DOMString ,type SpeechRecognitionEventInit );eventInitDict readonly attribute unsigned long resultIndex ;readonly attribute SpeechRecognitionResultList results ; };dictionary :SpeechRecognitionEventInit EventInit {unsigned long = 0;resultIndex required SpeechRecognitionResultList ; }; // The object representing a speech grammar. This interface has been deprecated and exists in this spec for the sole purpose of maintaining backwards compatibility. [results Exposed =Window ]interface {SpeechGrammar attribute DOMString src ;attribute float weight ; }; // The object representing a speech grammar collection. This interface has been deprecated and exists in this spec for the sole purpose of maintaining backwards compatibility. [Exposed =Window ]interface {SpeechGrammarList ();constructor readonly attribute unsigned long length ;getter SpeechGrammar item (unsigned long );index undefined addFromURI (DOMString ,src optional float = 1.0);weight undefined addFromString (DOMString ,string optional float = 1.0); }; // The object representing a phrase for contextual biasing. [weight SecureContext ,Exposed =Window ]interface {SpeechRecognitionPhrase constructor (DOMString ,phrase optional float = 1.0);boost readonly attribute DOMString phrase ;readonly attribute float boost ; }; // The object representing a list of phrases for contextual biasing. [SecureContext ,Exposed =Window ]interface {SpeechRecognitionPhraseList constructor (sequence <SpeechRecognitionPhrase >);phrases readonly attribute unsigned long length ;SpeechRecognitionPhrase item (unsigned long );index undefined addItem (SpeechRecognitionPhrase );item undefined removeItem (unsigned long ); };index
4.1.1. SpeechRecognition Attributes
-
grammarsattribute, of type SpeechGrammarList - The grammars attribute stores the collection of SpeechGrammar objects which represent the grammars that are active for this recognition. This attribute does nothing and exists in this spec for the sole purpose of maintaining backwards compatibility.
-
langattribute, of type DOMString - This attribute will set the language of the recognition for the request, using a valid BCP 47 language tag. [BCP47] If unset it remains unset for getting in script, but will default to use the language of the html document root element and associated hierarchy. This default value is computed and used when the input request opens a connection to the recognition service.
-
continuousattribute, of type boolean - When the continuous attribute is set to false, the user agent must return no more than one final result in response to starting recognition, for example a single turn pattern of interaction. When the continuous attribute is set to true, the user agent must return zero or more final results representing multiple consecutive recognitions in response to starting recognition, for example a dictation. The default value must be false. Note, this attribute setting does not affect interim results.
-
interimResultsattribute, of type boolean - Controls whether interim results are returned. When set to true, interim results should be returned. When set to false, interim results must not be returned. The default value must be false. Note, this attribute setting does not affect final results.
-
maxAlternativesattribute, of type unsigned long -
This
attribute
will
set
the
maximum
number
of
SpeechRecognitionAlternatives per result. The default value is 1. -
processLocallyattribute, of type boolean - This attribute, when set to true, indicates a requirement that the speech recognition process MUST be performed locally on the user’s device. If set to false, the user agent can choose between local and remote processing. The default value is false.
-
phrasesattribute, of type SpeechRecognitionPhraseList -
This
attribute
represents
a
list
of
phrases
for
contextual
biasing.
- The getter steps are to return the value of
[[phrases]].- The setter steps are:
-
If the
lengthof the given value is greater than 0 and the system does not support contextual biasing, throw aSpeechRecognitionErrorEventwith thephrases-not-supportederror code and abort these steps. -
Set
[[phrases]]to the given value. -
Send a copy of
[[phrases]]to the system for initializing or updating the phrases for contextual biasing implementation.
- The getter steps are to return the value of
The group has discussed whether WebRTC might be used to specify selection of audio sources and remote recognizers. See Interacting with WebRTC, the Web Audio API and other external sources thread on public-speech-api@w3.org.
4.1.2. SpeechRecognition Methods
-
start()method -
Start
the
speech
recognition
process,
directly
from
a
microphone
on
the
device.
When
invoked,
run
the
following
steps:
-
Let requestMicrophonePermission be a boolan variable set to to
true. -
Run the start session algorithm with requestMicrophonePermission .
-
-
start(methodMediaStreamTrackaudioTrack) -
Start
the
speech
recognition
process,
using
a
MediaStreamTrackWhen invoked, run the following steps:-
Let audioTrack be the first argument.
-
If audioTrack ’s
kindattribute is NOT"audio", throw anInvalidStateErrorand abort these steps. -
If audioTrack ’s
readyStateattribute is NOT"live", throw anInvalidStateErrorand abort these steps. -
Let requestMicrophonePermission be
false. -
Run the start session algorithm with requestMicrophonePermission .
-
-
stop()method - The stop method represents an instruction to the recognition service to stop listening to more audio, and to try and return a result using just the audio that it has already received for this recognition. A typical use of the stop method might be for a web application where the end user is doing the end pointing, similar to a walkie-talkie. The end user might press and hold the space bar to talk to the system and on the space down press the start call would have occurred and when the space bar is released the stop method is called to ensure that the system is no longer listening to the user. Once the stop method is called the speech service must not collect additional audio and must not continue to listen to the user. The speech service must attempt to return a recognition result (or a nomatch) based on the audio that it has already collected for this recognition. If the stop method is called on an object which is already stopped or being stopped (that is, start was never called on it, the end or error event has fired on it, or stop was previously called on it), the user agent must ignore the call.
-
abort()method - The abort method is a request to immediately stop listening and stop recognizing and do not return any information but that the system is done. When the abort method is called, the speech service must stop recognizing. The user agent must raise an end event once the speech service is no longer connected. If the abort method is called on an object which is already stopped or aborting (that is, start was never called on it, the end or error event has fired on it, or abort was previously called on it), the user agent must ignore the call.
-
available(methodSpeechRecognitionOptionsoptions) -
The
availablemethod returns aPromisethat resolves to aAvailabilityStatusindicating the recognition availability matching theSpeechRecognitionOptionsargument. Access to this method is gated behind the policy-controlled feature "on-device-speech-recognition", which has a default allowlist of'self'.When invoked, run these steps:
-
Let promise be a new promise .
-
Run the availability algorithm with options and promise . If it returns an exception, throw it and abort these steps.
-
Return promise .
-
-
install(methodSpeechRecognitionOptionsoptions) -
The
installmethod attempts to install speech recognition language packs for all languages specified inoptions.langs. It returns aPromisethat resolves to aboolean. The promise resolves totruewhen all installation attempts for requested and supported languages succeed (or the languages were already installed). The promise resolves tofalseifoptions.langsis empty, if not all of the requested languages are supported, or if any installation attempt for a supported language fails. Access to this method is gated behind the policy-controlled feature "on-device-speech-recognition", which has a default allowlist of'self'.When invoked, run these steps:
-
If the current settings object ’s relevant global object ’s associated Document is NOT fully active , throw an
InvalidStateErrorand abort these steps. -
If any lang in
langsof options is not a valid [BCP47] language tag, throw aSyntaxErrorand abort these steps. -
If the on-device speech recognition language pack for any lang in
langsof options is unsupported, return a resolvedPromisewith false and skip the rest of these steps. -
Let promise be a new promise .
-
For each lang in
langsof options , initiate the download of the on-device speech recognition language for lang .Note: The user agent can prompt the user for explicit permission to download the on-device speech recognition language pack.
-
Queue a task on the relevant global object ’s task queue to run the following step:
-
When the download of all languages specified by
langsof options succeeds, resolve promise withtrue, otherwise resolve it withfalse.Note: The
falseresolution of the Promise does not indicate the specific cause of failure. User agents are encouraged to provide more detailed information about the failure in developer tools console messages. However, this detailed error information is not exposed to the script.
-
-
Return promise .
processLocallyof options is not used in this algorithm.
-
4.1.3. AvailabilityStatus Enum Values
The
AvailabilityStatus
enum
indicates
the
availability
of
speech
recognition
capabilities.
Its
values
are:
-
"unavailable" -
Indicates
that
speech
recognition
is
not
available
for
the
specified
language(s)
and
processing
preference.
If
processLocallyof options istrue, this means on-device recognition for the language is not supported by the user agent. IfprocessLocallyof options isfalse, it means neither local nor remote recognition is available for at least one of the specified languages. -
"downloadable" -
Indicates
that
on-device
speech
recognition
for
the
specified
language(s)
is
supported
by
the
user
agent
but
not
yet
installed.
It
can
potentially
be
installed
using
the
install()method. This status is primarily relevant whenprocessLocallyof options is true. -
"downloading" -
Indicates
that
on-device
speech
recognition
for
the
specified
language(s)
is
currently
in
the
process
of
being
downloaded.
This
status
is
primarily
relevant
when
processLocallyof options is true. -
"available" -
Indicates
that
speech
recognition
is
available
for
all
specified
language(s)
and
the
given
processing
preference.
If
processLocallyof options is true, this means on-device recognition is installed and ready. IfprocessLocallyof options is false, it means recognition (which could be local or remote) is available.
When the availability algorithm with options and promise is invoked, the user agent MUST run the following steps:
-
If the current settings object ’s relevant global object ’s associated Document is NOT fully active , throw an
InvalidStateErrorand abort these steps. -
Let langs be
langsof options . -
If any lang in langs is not a valid [BCP47] language tag, throw a
SyntaxErrorand abort these steps. -
If
processLocallyof options isfalse:-
If langs is an empty sequence, let status be
unavailable. -
Else if speech recognition (which may be remote) is available for all language in langs , let status be
available. -
Else, let status be
unavailable.
-
-
If
processLocallyof options istrue:-
If
langs
is
an
empty
sequence,
let
status
be
unavailable. -
Else:
-
Let
finalStatus
be
available. -
For
each
language
in
langs
:
- Let currentLanguageStatus .
-
If
on-device
speech
recognition
for
language
is
installed,
set
currentLanguageStatus
to
available. -
Else
if
on-device
speech
recognition
for
language
is
currently
being
downloaded,
set
currentLanguageStatus
to
downloading. -
Else
if
on-device
speech
recognition
for
language
is
supported
by
the
user
agent
but
not
yet
installed,
set
currentLanguageStatus
to
downloadable. -
Else
(on-device
speech
recognition
for
language
is
not
supported),
set
currentLanguageStatus
to
unavailable. -
If
currentLanguageStatus
comes
after
finalStatus
in
the
ordered
list
[{{AvailabilityStatus/available}}, {{AvailabilityStatus/downloading}}, {{AvailabilityStatus/downloadable}}, {{AvailabilityStatus/unavailable}}], set finalStatus to currentLanguageStatus .
- Let status be finalStatus .
-
Let
finalStatus
be
-
If
langs
is
an
empty
sequence,
let
status
be
-
Queue a task on the relevant global object ’s task queue to run the following step:
-
Resolve promise with status .
-
When the start session algorithm with requestMicrophonePermission is invoked, the user agent MUST run the following steps:
-
If the current settings object ’s relevant global object ’s associated Document is NOT fully active , throw an
InvalidStateErrorand abort these steps. -
If
[[started]]istrueand no error event or end event has fired on it, throw anInvalidStateErrorand abort these steps. -
If this.
[[processLocally]]istrue:-
If the user agent determines that local speech recognition is not available for this.
lang, or if it cannot fulfill the local processing requirement for other reasons:-
Queue a task to fire an event named error at this using
SpeechRecognitionErrorEventwith itserrorattribute initialized toservice-not-allowedand itsmessageattribute set to an implementation-defined string detailing the reason. -
Abort these steps.
-
-
-
Set
[[started]]totrue. -
If requestMicrophonePermission is
trueand request permission to use "microphone" is "denied" , abort these steps. -
Once the system is successfully listening to the recognition, queue a task to fire an event named start at this .
4.1.4. SpeechRecognition Events
The DOM Level 2 Event Model is used for speech recognition events. The methods in the EventTarget interface should be used for registering event listeners. The SpeechRecognition interface also contains convenience attributes for registering a single event handler for each event type. These events do not bubble and are not cancelable.
For all these events, the timeStamp attribute defined in the DOM Level 2 Event interface must be set to the best possible estimate of when the real-world event which the event object represents occurred. This timestamp must be represented in the user agent’s view of time, even for events where the timestamps in question could be raised on a different machine like a remote recognition service (i.e., in a speechend event with a remote speech endpointer).
Unless specified below, the ordering of the different events is undefined. For example, some implementations may fire audioend before speechstart or speechend if the audio detector is client-side and the speech detector is server-side.
-
audiostartevent - Fired when the user agent has started to capture audio.
-
soundstartevent - Fired when some sound, possibly speech, has been detected. This must be fired with low latency, e.g. by using a client-side energy detector. The audiostart event must always have been fired before the soundstart event.
-
speechstartevent - Fired when the speech that will be used for speech recognition has started. The audiostart event must always have been fired before the speechstart event.
-
speechendevent - Fired when the speech that will be used for speech recognition has ended. The speechstart event must always have been fired before speechend.
-
soundendevent - Fired when some sound is no longer detected. This must be fired with low latency, e.g. by using a client-side energy detector. The soundstart event must always have been fired before soundend.
-
audioendevent - Fired when the user agent has finished capturing audio. The audiostart event must always have been fired before audioend.
-
resultevent -
Fired
when
the
speech
recognizer
returns
a
result.
The
event
must
use
the
SpeechRecognitionEventinterface. The audiostart event must always have been fired before the result event. -
nomatchevent -
Fired
when
the
speech
recognizer
returns
a
final
result
with
no
recognition
hypothesis
that
meet
or
exceed
the
confidence
threshold.
The
event
must
use
the
SpeechRecognitionEventinterface. Theresultsattribute in the event may contain speech recognition results that are below the confidence threshold or may be null. Theaudiostartevent must always have been fired before the nomatch event. -
errorevent -
Fired
when
a
speech
recognition
error
occurs.
The
event
must
use
the
SpeechRecognitionErrorEventinterface. -
startevent - Fired when the recognition service has begun to listen to the audio with the intention of recognizing.
-
endevent - Fired when the service has disconnected. The event must always be generated when the session ends no matter the reason for the end.
4.1.5. SpeechRecognitionErrorEvent
The
SpeechRecognitionErrorEvent
interface
is
used
for
the
error
event.
-
errorattribute, of type SpeechRecognitionErrorCode , readonly -
The
errorCode
is
an
enumeration
indicating
what
has
gone
wrong.
The
values
are:
-
"no-speech" - No speech was detected.
-
"aborted" - Speech input was aborted somehow, maybe by some user-agent-specific behavior such as UI that lets the user cancel speech input.
-
"audio-capture" - Audio capture failed.
-
"network" - Some network communication that was required to complete the recognition failed.
-
"not-allowed" - The user agent is not allowing any speech input to occur for reasons of security, privacy or user preference.
-
"service-not-allowed" - The user agent is not allowing the web application requested speech service, but would allow some speech service, to be used either because the user agent doesn’t support the selected one or because of reasons of security, privacy or user preference.
-
"language-not-supported" - The language was not supported.
-
"phrases-not-supported" - The speech recognition model does not support phrases for contextual biasing.
-
-
messageattribute, of type DOMString , readonly - The message content is implementation specific. This attribute is primarily intended for debugging and developers should not use it directly in their application user interface.
4.1.6. SpeechRecognitionAlternative
The SpeechRecognitionAlternative represents a simple view of the response that gets used in a n-best list.
-
transcriptattribute, of type DOMString , readonly - The transcript string represents the raw words that the user spoke. For continuous recognition, leading or trailing whitespace MUST be included where necessary such that concatenation of consecutive SpeechRecognitionResults produces a proper transcript of the session.
-
confidenceattribute, of type float , readonly -
The
confidence
represents
a
numeric
estimate
between
0
and
1
of
how
confident
the
recognition
system
is
that
the
recognition
is
correct.
A
higher
number
means
the
system
is
more
confident.
The group has discussed whether confidence can be specified in a speech-recognition-engine-independent manner and whether confidence threshold and nomatch should be included, because this is not a dialog API. See Confidence property thread on public-speech-api@w3.org.
4.1.7. SpeechRecognitionResult
The SpeechRecognitionResult object represents a single one-shot recognition match, either as one small part of a continuous recognition or as the complete return result of a non-continuous recognition.
-
lengthattribute, of type unsigned long , readonly - The long attribute represents how many n-best alternatives are represented in the item array.
-
item( index )getter - The item getter returns a SpeechRecognitionAlternative from the index into an array of n-best values. If index is greater than or equal to length, this returns null. The user agent must ensure that the length attribute is set to the number of elements in the array. The user agent must ensure that the n-best list is sorted in non-increasing confidence order (each element must be less than or equal to the confidence of the preceding elements).
-
isFinalattribute, of type boolean , readonly - The final boolean must be set to true if this is the final time the speech service will return this particular index value. If the value is false, then this represents an interim result that could still be changed.
4.1.8. SpeechRecognitionResultList
The SpeechRecognitionResultList object holds a sequence of recognition results representing the complete return result of a continuous recognition. For a non-continuous recognition it will hold only a single value.
-
lengthattribute, of type unsigned long , readonly - The length attribute indicates how many results are represented in the item array.
-
item( index )getter - The item getter returns a SpeechRecognitionResult from the index into an array of result values. If index is greater than or equal to length, this returns null. The user agent must ensure that the length attribute is set to the number of elements in the array.
4.1.9. SpeechRecognitionEvent
The SpeechRecognitionEvent is the event that is raised each time there are any changes to interim or final results.
-
resultIndexattribute, of type unsigned long , readonly - The resultIndex must be set to the lowest index in the "results" array that has changed.
-
resultsattribute, of type SpeechRecognitionResultList , readonly - The array of all current recognition results for this session. Specifically all final results that have been returned, followed by the current best hypothesis for all interim results. It must consist of zero or more final results followed by zero or more interim results. On subsequent SpeechRecognitionResultEvent events, interim results may be overwritten by a newer interim result or by a final result or may be removed (when at the end of the "results" array and the array length decreases). Final results must not be overwritten or removed. All entries for indexes less than resultIndex must be identical to the array that was present when the last SpeechRecognitionResultEvent was raised. All array entries (if any) for indexes equal or greater than resultIndex that were present in the array when the last SpeechRecognitionResultEvent was raised are removed and overwritten with new results. The length of the "results" array may increase or decrease, but must not be less than resultIndex. Note that when resultIndex equals results.length, no new results are returned, this may occur when the array length decreases to remove one or more interim results.
4.1.10. SpeechRecognitionPhrase
The SpeechRecognitionPhrase object represents a phrase for contextual biasing and has the following internal slots:
-
[[phrase]] -
A
DOMStringrepresenting the text string to be boosted. The initial value is null. An empty value is allowed but should be ignored by the speech recognition model.
-
[[boost]] -
A float representing approximately the natural log of the number of times more likely the website thinks this phrase is than what the speech recognition model knows. A valid boost must be a float value inside the range [0.0, 10.0], with a default value of 1.0 if not specified. A boost of 0.0 means the phrase is not boosted at all, and a higher boost means the phrase is more likely to appear. A boost of 10.0 means the phrase is extremely likely to appear and should be rarely set.
-
SpeechRecognitionPhrase( phrase , boost )constructor -
When
this
constructor
is
invoked,
run
the
following
steps:
-
If boost is smaller than 0.0 or greater than 10.0, throw a
SyntaxErrorand abort these steps. -
Let phr be a new object of type
SpeechRecognitionPhrase. -
Set phr .
[[phrase]]to be the value of phrase . -
Set phr .
[[boost]]to be the value of boost . -
Return phr .
-
-
phraseattribute, of type DOMString , readonly -
This
attribute
returns
the
value
of
[[phrase]]. -
boostattribute, of type float , readonly -
This
attribute
returns
the
value
of
[[boost]].
4.1.11. SpeechRecognitionPhraseList
The SpeechRecognitionPhraseList object holds a list of phrases for contextual biasing and has the following internal slot:
-
[[phrases]] -
A list of
SpeechRecognitionPhraserepresenting the phrases to be boosted. The initial value is an empty list.
-
SpeechRecognitionPhraseList( phrases )constructor -
When
this
constructor
is
invoked,
run
the
following
steps:
-
Let list be a new object of type
SpeechRecognitionPhraseList. -
Set list .
[[phrases]]to be the value of phrases . -
Return list .
-
-
lengthattribute, of type unsigned long , readonly -
This
attribute
indicates
the
number
of
phrases
in
the
list.
When
invoked,
return
the
number
of
items
in
[[phrases]]. -
item( index )method -
This
method
gets
the
SpeechRecognitionPhraseobject at the index of the list. When invoked, run the following steps:-
If index is smaller than 0, or greater than or equal to
length, throw aRangeErrorand abort these steps. -
Return the
SpeechRecognitionPhraseat the index of[[phrases]].
-
-
addItem( item )method -
This
method
adds
the
SpeechRecognitionPhraseobject item to the list. When invoked, add item to the end of[[phrases]]. The list is allowed to have multipleSpeechRecognitionPhraseobjects with the same[[phrase]]value, and the speech recognition model should use the last[[boost]]value for this[[phrase]]in the list. -
removeItem( index )method -
This
method
removes
the
SpeechRecognitionPhraseobject at the index of the list. When invoked, run the following steps:-
If index is smaller than 0, or greater than or equal to
length, throw aRangeErrorand abort these steps. -
Remove the
SpeechRecognitionPhraseobject at the index of[[phrases]].
-
4.1.12. SpeechGrammar
The SpeechGrammar object represents a container for a grammar.
Grammar support has been deprecated and removed. The grammar objects remain in the spec for backwards compatibility purposes only and do not affect speech recognition.
This structure has the following attributes:
-
srcattribute, of type DOMString - The required src attribute is the URI for the grammar.
-
weightattribute, of type float - The optional weight attribute controls the weight that the speech recognition service should use with this grammar. By default, a grammar has a weight of 1. Larger weight values positively weight the grammar while smaller weight values make the grammar weighted less strongly.
4.1.13. SpeechGrammarList
The SpeechGrammarList object represents a collection of SpeechGrammar objects. This structure has the following attributes:
Grammar support has been deprecated and removed. The grammar objects remain in the spec for backwards compatibility purposes only and do not affect speech recognition.
-
lengthattribute, of type unsigned long , readonly - The length attribute represents how many grammars are currently in the array.
-
item( index )getter - The item getter returns a SpeechGrammar from the index into an array of grammars. The user agent must ensure that the length attribute is set to the number of elements in the array. The user agent must ensure that the index order from smallest to largest matches the order in which grammars were added to the array.
-
addFromURI( src , weight )method - This method appends a grammar to the grammars array parameter based on URI. The URI for the grammar is specified by the src parameter, which represents the URI for the grammar. Note, some services may support builtin grammars that can be specified by URI. The weight parameter represents this grammar’s weight relative to the other grammar.
-
addFromString( string , weight )method - This method appends a grammar to the grammars array parameter based on text. The content of the grammar is specified by the string parameter. This content should be encoded into a data: URI when the SpeechGrammar object is created. The weight parameter represents this grammar’s weight relative to the other grammar.
4.2. The SpeechSynthesis Interface
The SpeechSynthesis interface is the scripted web API for controlling a text-to-speech output.
[Exposed =Window ]interface :SpeechSynthesis EventTarget {readonly attribute boolean pending ;readonly attribute boolean speaking ;readonly attribute boolean paused ;attribute EventHandler ;onvoiceschanged undefined speak (SpeechSynthesisUtterance );utterance undefined cancel ();undefined pause ();undefined resume ();sequence <SpeechSynthesisVoice >getVoices (); };partial interface Window { [SameObject ]readonly attribute SpeechSynthesis ; }; [speechSynthesis Exposed =Window ]interface :SpeechSynthesisUtterance EventTarget {(constructor optional DOMString );text attribute DOMString text ;attribute DOMString lang ;attribute SpeechSynthesisVoice ?voice ;attribute float volume ;attribute float rate ;attribute float pitch ;attribute EventHandler ;onstart attribute EventHandler ;onend attribute EventHandler ;onerror attribute EventHandler ;onpause attribute EventHandler ;onresume attribute EventHandler ;onmark attribute EventHandler ; }; [onboundary Exposed =Window ]interface :SpeechSynthesisEvent Event {(constructor DOMString ,type SpeechSynthesisEventInit );eventInitDict readonly attribute SpeechSynthesisUtterance utterance ;readonly attribute unsigned long charIndex ;readonly attribute unsigned long charLength ;readonly attribute float elapsedTime ;readonly attribute DOMString name ; };dictionary :SpeechSynthesisEventInit EventInit {required SpeechSynthesisUtterance ;utterance unsigned long = 0;charIndex unsigned long = 0;charLength float = 0;elapsedTime DOMString = ""; };name enum {SpeechSynthesisErrorCode "canceled" ,"interrupted" ,"audio-busy" ,"audio-hardware" ,"network" ,"synthesis-unavailable" ,"synthesis-failed" ,"language-unavailable" ,"voice-unavailable" ,"text-too-long" ,"invalid-argument" ,"not-allowed" , }; [Exposed =Window ]interface :SpeechSynthesisErrorEvent SpeechSynthesisEvent {(constructor DOMString ,type SpeechSynthesisErrorEventInit );eventInitDict readonly attribute SpeechSynthesisErrorCode error ; };dictionary :SpeechSynthesisErrorEventInit SpeechSynthesisEventInit {required SpeechSynthesisErrorCode ; }; [error Exposed =Window ]interface {SpeechSynthesisVoice readonly attribute DOMString voiceURI ;readonly attribute DOMString name ;readonly attribute DOMString lang ;readonly attribute boolean localService ;readonly attribute boolean default ; };
4.2.1. SpeechSynthesis Attributes
-
pendingattribute, of type boolean , readonly - This attribute is true if the queue for the global SpeechSynthesis instance contains any utterances which have not started speaking.
-
speakingattribute, of type boolean , readonly - This attribute is true if an utterance is being spoken. Specifically if an utterance has begun being spoken and has not completed being spoken. This is independent of whether the global SpeechSynthesis instance is in the paused state.
-
pausedattribute, of type boolean , readonly - This attribute is true when the global SpeechSynthesis instance is in the paused state. This state is independent of whether anything is in the queue. The default state of a the global SpeechSynthesis instance for a new window is the non-paused state.
4.2.2. SpeechSynthesis Methods
-
speak( utterance )method - This method appends the SpeechSynthesisUtterance object utterance to the end of the queue for the global SpeechSynthesis instance. It does not change the paused state of the SpeechSynthesis instance. If the SpeechSynthesis instance is paused, it remains paused. If it is not paused and no other utterances are in the queue, then this utterance is spoken immediately, else this utterance is queued to begin speaking after the other utterances in the queue have been spoken. If changes are made to the SpeechSynthesisUtterance object after calling this method and prior to the corresponding end or error event, it is not defined whether those changes will affect what is spoken, and those changes may cause an error to be returned. The SpeechSynthesis object takes exclusive ownership of the SpeechSynthesisUtterance object. Passing it as a speak() argument to another SpeechSynthesis object should throw an exception. (For example, two frames may have the same origin and each will contain a SpeechSynthesis object.)
-
cancel()method - This method removes all utterances from the queue. If an utterance is being spoken, speaking ceases immediately. This method does not change the paused state of the global SpeechSynthesis instance.
-
pause()method - This method puts the global SpeechSynthesis instance into the paused state. If an utterance was being spoken, it pauses mid-utterance. (If called when the SpeechSynthesis instance was already in the paused state, it does nothing.)
-
resume()method - This method puts the global SpeechSynthesis instance into the non-paused state. If an utterance was speaking, it continues speaking the utterance at the point at which it was paused, else it begins speaking the next utterance in the queue (if any). (If called when the SpeechSynthesis instance was already in the non-paused state, it does nothing.)
-
getVoices()method - This method returns the available voices. It is user agent dependent which voices are available. If there are no voices available, or if the the list of available voices is not yet known (for example: server-side synthesis where the list is determined asynchronously), then this method must return a SpeechSynthesisVoiceList of length zero.
4.2.3. SpeechSynthesis Events
-
voiceschangedevent - Fired when the contents of the SpeechSynthesisVoiceList, that the getVoices method will return, have changed. Examples include: server-side synthesis where the list is determined asynchronously, or when client-side voices are installed/uninstalled.
4.2.4. SpeechSynthesisUtterance Attributes
-
textattribute, of type DOMString - This attribute specifies the text to be synthesized and spoken for this utterance. This may be either plain text or a complete, well-formed SSML document. [SSML] For speech synthesis engines that do not support SSML, or only support certain tags, the user agent or speech engine must strip away the tags they do not support and speak the text. There may be a maximum length of the text, it may be limited to 32,767 characters.
-
langattribute, of type DOMString - This attribute specifies the language of the speech synthesis for the utterance, using a valid BCP 47 language tag. [BCP47] If unset it remains unset for getting in script, but will default to use the language of the html document root element and associated hierarchy. This default value is computed and used when the input request opens a connection to the recognition service.
-
voiceattribute, of type SpeechSynthesisVoice , nullable -
This
attribute
specifies
the
speech
synthesis
voice
that
the
web
application
wishes
to
use.
When
a
SpeechSynthesisUtteranceobject is created this attribute must be initialized to null. If, at the time of thespeak()method call, this attribute has been set to one of theSpeechSynthesisVoiceobjects returned bygetVoices(), then the user agent must use that voice. If this attribute is unset or null at the time of thespeak()method call, then the user agent must use a user agent default voice. The user agent default voice should support the current language (seelang) and can be a local or remote speech service and can incorporate end user choices via interfaces provided by the user agent such as browser configuration parameters. -
volumeattribute, of type float - This attribute specifies the speaking volume for the utterance. It ranges between 0 and 1 inclusive, with 0 being the lowest volume and 1 the highest volume, with a default of 1. If SSML is used, this value will be overridden by prosody tags in the markup.
-
rateattribute, of type float - This attribute specifies the speaking rate for the utterance. It is relative to the default rate for this voice. 1 is the default rate supported by the speech synthesis engine or specific voice (which should correspond to a normal speaking rate). 2 is twice as fast, and 0.5 is half as fast. Values below 0.1 or above 10 are strictly disallowed, but speech synthesis engines or specific voices may constrain the minimum and maximum rates further, for example, a particular voice may not actually speak faster than 3 times normal even if you specify a value larger than 3. If SSML is used, this value will be overridden by prosody tags in the markup.
-
pitchattribute, of type float - This attribute specifies the speaking pitch for the utterance. It ranges between 0 and 2 inclusive, with 0 being the lowest pitch and 2 the highest pitch. 1 corresponds to the default pitch of the speech synthesis engine or specific voice. Speech synthesis engines or voices may constrain the minimum and maximum rates further. If SSML is used, this value will be overridden by prosody tags in the markup.
4.2.5. SpeechSynthesisUtterance Events
Each
of
these
events
must
use
the
SpeechSynthesisEvent
interface,
except
the
error
event
which
must
use
the
SpeechSynthesisErrorEvent
interface.
These
events
do
not
bubble
and
are
not
cancelable.
-
startevent - Fired when this utterance has begun to be spoken.
-
endevent - Fired when this utterance has completed being spoken. If this event fires, the error event must not be fired for this utterance.
-
errorevent - Fired if there was an error that prevented successful speaking of this utterance. If this event fires, the end event must not be fired for this utterance.
-
pauseevent - Fired when and if this utterance is paused mid-utterance.
-
resumeevent - Fired when and if this utterance is resumed after being paused mid-utterance. Adding the utterance to the queue while the global SpeechSynthesis instance is in the paused state, and then calling the resume method does not cause the resume event to be fired, in this case the utterance’s start event will be called when the utterance starts.
-
markevent - Fired when the spoken utterance reaches a named "mark" tag in SSML. [SSML] The user agent must fire this event if the speech synthesis engine provides the event.
-
boundaryevent - Fired when the spoken utterance reaches a word or sentence boundary. The user agent must fire this event if the speech synthesis engine provides the event.
4.2.6. SpeechSynthesisEvent Attributes
-
utteranceattribute, of type SpeechSynthesisUtterance , readonly - This attribute contains the SpeechSynthesisUtterance that triggered this event.
-
charIndexattribute, of type unsigned long , readonly - This attribute indicates the zero-based character index into the original utterance string that most closely approximates the current speaking position of the speech engine. No guarantee is given as to where charIndex will be with respect to word boundaries (such as at the end of the previous word or the beginning of the next word), only that all text before charIndex has already been spoken, and all text after charIndex has not yet been spoken. The user agent must return this value if the speech synthesis engine supports it, otherwise the user agent must return 0.
-
charLengthattribute, of type unsigned long , readonly -
This
attribute
indicates
the
length
of
the
text
(word
or
sentence)
that
will
be
spoken
corresponding
to
this
event.
This
attribute
is
the
length,
in
characters,
starting
from
this
event’s
charIndex. The user agent must return this value if the speech synthesis engine supports it or the user agent can otherwise determine it, otherwise the user agent must return 0. -
elapsedTimeattribute, of type float , readonly - This attribute indicates the time, in seconds, that this event triggered, relative to when this utterance has begun to be spoken. The user agent must return this value if the speech synthesis engine supports it or the user agent can otherwise determine it, otherwise the user agent must return 0.
-
nameattribute, of type DOMString , readonly - For mark events, this attribute indicates the name of the marker, as defined in SSML as the name attribute of a mark element. [SSML] For boundary events, this attribute indicates the type of boundary that caused the event: "word" or "sentence". For all other events, this value should return "".
4.2.7. SpeechSynthesisErrorEvent Attributes
The SpeechSynthesisErrorEvent is the interface used for the SpeechSynthesisUtterance error event.
-
errorattribute, of type SpeechSynthesisErrorCode , readonly -
The
errorCode
is
an
enumeration
indicating
what
has
gone
wrong.
The
values
are:
-
"canceled" - A cancel method call caused the SpeechSynthesisUtterance to be removed from the queue before it had begun being spoken.
-
"interrupted" - A cancel method call caused the SpeechSynthesisUtterance to be interrupted after it has begun being spoken and before it completed.
-
"audio-busy" - The operation cannot be completed at this time because the user-agent cannot access the audio output device. (For example, the user may need to correct this by closing another application.)
-
"audio-hardware" - The operation cannot be completed at this time because the user-agent cannot identify an audio output device. (For example, the user may need to connect a speaker or configure system settings.)
-
"network" - The operation cannot be completed at this time because some required network communication failed.
-
"synthesis-unavailable" - The operation cannot be completed at this time because no synthesis engine is available. (For example, the user may need to install or configure a synthesis engine.)
-
"synthesis-failed" - The operation failed because synthesis engine had an error.
-
"language-unavailable" - No appropriate voice is available for the language designated in SpeechSynthesisUtterance lang.
-
"voice-unavailable" - The voice designated in SpeechSynthesisUtterance voice attribute is not available.
-
"text-too-long" - The contents of the SpeechSynthesisUtterance text attribute is too long to synthesize.
-
"invalid-argument" - The contents of the SpeechSynthesisUtterance rate, pitch or volume attribute is not supported by synthesizer.
-
"not-allowed" - Synthesis was not allowed to start by the user agent or system in the current context.
-
4.2.8. SpeechSynthesisVoice Attributes
-
voiceURIattribute, of type DOMString , readonly - The voiceURI attribute specifies the speech synthesis voice and the location of the speech synthesis service for this voice. Note that the voiceURI is a generic URI and can thus point to local or remote services, either through use of a URN with meaning to the user agent or by specifying a URL that the user agent recognizes as a local service.
-
nameattribute, of type DOMString , readonly - This attribute is a human-readable name that represents the voice. There is no guarantee that all names returned are unique.
-
langattribute, of type DOMString , readonly - This attribute is a BCP 47 language tag indicating the language of the voice. [BCP47]
-
localServiceattribute, of type boolean , readonly - This attribute is true for voices supplied by a local speech synthesizer, and is false for voices supplied by a remote speech synthesizer service. (This may be useful because remote services may imply additional latency, bandwidth or cost, whereas local voices may imply lower quality, however there is no guarantee that any of these implications are true.)
-
defaultattribute, of type boolean , readonly - This attribute is true for at most one voice per language. There may be a different default for each language. It is user agent dependent how default voices are determined.
5. Examples
This section is non-normative.
5.1. Speech Recognition Examples
Using speech recognition to fill an input-field and perform a web search.
< script type = "text/javascript" > var recognition= new SpeechRecognition(); recognition. onresult= function ( event) { if ( event. results. length> 0 ) { q. value= event. results[ 0 ][ 0 ]. transcript; q. form. submit(); } } </ script > < form action = "https://www.example.com/search" > < input type = "search" id = "q" name = "q" size = 60 > < input type = "button" value = "Click to Speak" onclick = "recognition.start()" > </ form >
Using speech recognition to fill an options list with alternative speech results.
< script type = "text/javascript" > var recognition= new SpeechRecognition(); recognition. maxAlternatives= 10 ; recognition. onresult= function ( event) { if ( event. results. length> 0 ) { var result= event. results[ 0 ]; for ( var i= 0 ; i< result. length; ++ i) { var text= result[ i]. transcript; select. options[ i] = new Option( text, text); } } } function start() { select. options. length= 0 ; recognition. start(); } </ script > < select id = "select" ></ select > < button onclick = "start()" > Click to Speak</ button >
Using continuous speech recognition to fill a textarea.
< textarea id = "textarea" rows = 10 cols = 80 ></ textarea > < button id = "button" onclick = "toggleStartStop()" ></ button > < script type = "text/javascript" > var recognizing; var recognition= new SpeechRecognition(); recognition. continuous= true ; reset(); recognition. onend= reset; recognition. onresult= function ( event) { for ( var i= event. resultIndex; i< event. results. length; ++ i) { if ( event. results[ i]. isFinal) { textarea. value+= event. results[ i][ 0 ]. transcript; } } } function reset() { recognizing= false ; button. innerHTML= "Click to Speak" ; } function toggleStartStop() { if ( recognizing) { recognition. stop(); reset(); } else { recognition. start(); recognizing= true ; button. innerHTML= "Click to Stop" ; } } </ script >
Using continuous speech recognition, showing final results in black and interim results in grey.
< button id = "button" onclick = "toggleStartStop()" ></ button > < div style = "border:dotted;padding:10px" > < span id = "final_span" ></ span > < span id = "interim_span" style = "color:grey" ></ span > </ div > < script type = "text/javascript" > var recognizing; var recognition= new SpeechRecognition(); recognition. continuous= true ; recognition. interimResults= true ; reset(); recognition. onend= reset; recognition. onresult= function ( event) { var final = "" ; var interim= "" ; for ( var i= 0 ; i< event. results. length; ++ i) { if ( event. results[ i]. isFinal) { final += event. results[ i][ 0 ]. transcript; } else { interim+= event. results[ i][ 0 ]. transcript; } } final_span. innerHTML= final ; interim_span. innerHTML= interim; } function reset() { recognizing= false ; button. innerHTML= "Click to Speak" ; } function toggleStartStop() { if ( recognizing) { recognition. stop(); reset(); } else { recognition. start(); recognizing= true ; button. innerHTML= "Click to Stop" ; final_span. innerHTML= "" ; interim_span. innerHTML= "" ; } } </ script >
5.2. Speech Synthesis Examples
Spoken text.
< script type = "text/javascript" > speechSynthesis. speak( new SpeechSynthesisUtterance( 'Hello World' )); </ script >
Spoken text with attributes and events.
< script type = "text/javascript" > var u= new SpeechSynthesisUtterance(); u. text= 'Hello World' ; u. lang= 'en-US' ; u. rate= 1.2 ; u. onend= function ( event) { alert( 'Finished in ' + event. elapsedTime+ ' seconds.' ); } speechSynthesis. speak( u); </ script >
Acknowledgments
Adam Sobieski (Phoster) Björn Bringert (Google) Charles Pritchard Dominic Mazzoni (Google) Gerardo Capiel (Benetech) Jerry Carter Kagami Sascha Rosylight Marcos Cáceres (Mozilla) Nagesh Kharidi (Openstream) Olli Pettay (Mozilla) Peter Beverloo (Google) Raj Tumuluri (Openstream) Satish Sampath (Google)
Also, the members of the HTML Speech Incubator Group, and the corresponding Final Report , which created the basis for this specification.