Copyright
2023,
AOM
Licensing
information
is
available
at
http://aomedia.org/license/
The
MATERIALS
ARE
PROVIDED
“AS
IS.”
The
Alliance
for
Open
Media,
its
members,and
its
contributors
expressly
disclaim
any
warranties
(express,
implied,
or
otherwise),
including
implied
warranties
of
merchantability,
non-infringement,
fitness
for
a
particular
purpose,
or
title,
related
to
the
materials.
The
entire
risk
as
to
implementing
or
otherwise
using
the
materials
is
assumed
by
the
implementer
and
user.
IN
NO
EVENT
WILL
THE
ALLIANCE
FOR
OPEN
MEDIA,
ITS
MEMBERS,
OR
CONTRIBUTORS
BE
LIABLE
TO
ANY
OTHER
PARTY
FOR
LOST
PROFITS
OR
ANY
FORM
OF
INDIRECT,
SPECIAL,
INCIDENTAL,
OR
CONSEQUENTIAL
DAMAGES
OF
ANY
CHARACTER
FROM
ANY
CAUSES
OF
ACTION
OF
ANY
KIND
WITH
RESPECT
TO
THIS
DELIVERABLE
OR
ITS
GOVERNING
AGREEMENT,
WHETHER
BASED
ON
BREACH
OF
CONTRACT,
TORT
(INCLUDING
NEGLIGENCE),
OR
OTHERWISE,
AND
WHETHER
OR
NOT
THE
OTHER
MEMBER
HAS
BEEN
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGE.
This
document
specifies
how
to
carry
AV1
video
elementary
streams
([
AV1
])
in
the
MPEG-2
Transport
Stream
format
([
MPEG-2
TS
]).
It
defines
the
carriage
of
AV1
in
a
single
PID,
assuming
buffer
model
info
from
the
first
operating
point.
It
may
not
be
optimal
for
layered
streams
or
streams
with
multiple
operating
points.
Future
versions
may
incorporate
this
capability.
In
the
present
document
"shall",
"shall
not",
"should",
"should
not",
"may",
"need
not",
"will",
"will
not",
"can"
and
"cannot"
are
to
be
interpreted
as
described
in
clause
3.2
of
the
ETSI
Drafting
Rules
(Verbal
forms
for
the
expression
of
provisions).
In
the
present
document
the
mnemonics,
the
syntax
functions,
and
the
syntax
descriptors
are
to
be
interpreted
as
described
in
[
MPEG-2
TS
].
The
uimsbf
and
bslbf
mnemonics
are
defined
in
Section
2.2.6
of
[
MPEG-2
TS
].
The
nextbits()
function
is
interpreted
as
in
[
MPEG-2
TS
].
1.1
Modal
verbs
terminology
1.2
Definition
of
mnemonics
and
syntax
function
The
presence
of
a
descriptor_tag
-
This
value
shall
be
set
to
descriptor_length
-
This
value
shall
be
set
to
format_identifier
-
This
value
shall
be
set
to
Registration
Descriptor,
as
defined
in
[
MPEG-2
TS
Registration
Descriptor
],
is
mandatory
with
the
format_identifier
field
set
to
'AV01'
(A-V-0-1).
The
Registration
Descriptor
mandatory.
It
shall
be
the
first
in
the
PMT
loop
and
included
before
the
AV1
video
descriptor.
AV1_video_descriptor
.
2.1.1
Syntax
Syntax
No.
Of
bits
Mnemonic
registration_descriptor()
{
descriptor_tag
8
uimsbf
descriptor_length
8
uimsbf
format_identifier
32
uimsbf
}
2.1.2
Semantics
0x05.
0x05
.
4.
4
.
'AV01'
(A-V-0-1).
AV01
.
The
descriptor_tag
-
This
value
shall
be
set
to
descriptor_length
-
This
value
shall
be
set
to
marker
-
This
value
shall
be
set
to
version
-
This
field
indicates
the
version
of
the
seq_profile
,
seq_level_idx_0
and
high_bitdepth
-
These
fields
shall
be
coded
according
to
the
semantics
defined
in
[
AV1
].
If
these
fields
are
not
coded
in
the
Sequence
Header
OBU
in
the
AV1
video
stream,
the
inferred
values
are
coded
in
the
descriptor.
seq_tier_0
,
twelve_bit
,
monochrome
,
chroma_subsampling_x
,
chroma_subsampling_y
,
chroma_sample_position
-
These
fields
shall
be
coded
according
to
the
same
semantics
when
they
are
present.
If
they
are
not
present,
they
will
be
coded
using
the
value
inferred
by
the
semantics.
hdr_wcg_idc
-
The
value
of
this
syntax
element
indicates
the
presence
or
absence
of
high
dynamic
range
(HDR)
and/or
wide
color
gamut
(WCG)
video
components
in
the
associated
PID
according
to
the
table
below.
HDR
is
defined
to
be
video
that
has
high
dynamic
range
if
the
video
stream
EOTF
is
higher
than
the
reference
EOTF
defined
in
[
BT-1886
].
WCG
is
defined
to
be
video
that
is
coded
using
colour
primaries
with
a
colour
gamut
not
contained
within
[
BT-709
].
reserved_zeros
-
Will
be
set
to
initial_presentation_delay_present
-
Indicates
initial_presentation_delay_minus_one
field
is
present.
initial_presentation_delay_minus_one
-
Ignored
for
[
MPEG-2
TS
]
use,
included
only
to
aid
conversion
to/from
ISOBMFF.
AV1
video
descriptor
AV1_video_descriptor
provides
basic
information
for
identifying
coding
parameters,
such
as
profile
and
level
parameters
of
an
AV1
video
stream.
The
same
data
structure
as
AV1CodecConfigurationRecord
in
ISOBMFF
is
used
to
aid
conversion
between
the
two
formats,
EXCEPT
that
two
of
the
reserved
bits
are
used
for
HDR/WCG
identification.
The
syntax
and
semantics
for
this
descriptor
appears
in
the
table
below
and
in
the
subsequent
text.
2.2.1
Syntax
Syntax
No.
Of
bits
Mnemonic
AV1_video_descriptor()
{
descriptor_tag
8
uimsbf
descriptor_length
8
uimsbf
marker
1
bslbf
version
7
uimsbf
seq_profile
3
uimsbf
seq_level_idx_0
5
uimsbf
seq_tier_0
1
bslbf
high_bitdepth
1
bslbf
twelve_bit
1
bslbf
monochrome
1
bslbf
chroma_subsampling_x
1
bslbf
chroma_subsampling_y
1
bslbf
chroma_sample_position
2
uimsbf
hdr_wcg_idc
2
uimsbf
reserved_zeros
1
bslbf
initial_presentation_delay_present
1
bslbf
if
(initial_presentation_delay_present)
{
initial_presentation_delay_minus_one
4
uimsbf
}
else
{
reserved_zeros
4
uimsbf
}
}
2.2.2
Semantics
0x80.
0x80
.
4.
4
.
1.
1
.
AV1_video_descriptor.
AV1_video_descriptor
.
This
value
shall
be
set
to
1.
1
.
hdr_wcg_idc
Description
0
SDR,
i.e.,
video
is
based
on
the
reference
EOTF
defined
in
[
BT-1886
]
with
a
color
gamut
that
is
contained
within
[
BT-709
]
with
a
[
BT-709
]
container
1
WCG
only,
i.e.,
video
color
gamut
in
a
[
BT-2020
]
container
that
exceeds
[
BT-709
]
2
Both
HDR
and
WCG
are
to
be
indicated
in
the
stream
3
No
indication
made
regarding
HDR/WCG
or
SDR
characteristics
of
the
stream
zeroes.
0
.
For AV1 video streams, the following constraints apply:
0x06
(MPEG-2
PES
packets
containing
private
data).
In
addition,
a
start
code
insertion
and
emulation
prevention
process
shall
be
performed
on
the
AV1
Bitstream
prior
to
its
PES
encapsulation.
This
process
is
described
in
section
3.2.
§
3.2
Start-code
based
format
.
Prior
to
carriage
into
PES,
the
AV1
open_bitstream_unit()
open_bitstream_unit
is
encapsulated
into
ts_open_bitstream_unit()
.
ts_open_bitstream_unit
.
This
is
required
to
provide
direct
access
to
OBU
through
a
start-code
mechanism
inserted
prior
to
each
OBU.
The
following
syntax
describes
how
to
retrieve
the
open_bitstream_unit()
open_bitstream_unit
from
the
ts_open_bitstream_unit()
(tsOBU).
ts_open_bitstream_unit
.
| Syntax | No. Of bits | Mnemonic |
|---|---|---|
| ts_open_bitstream_unit(NumBytesInTsObu) { | ||
| obu_start_code /* equal to 0x01 */ | 24 | uimsbf |
| NumBytesInObu = 0 | ||
| for( i = 2; i < NumBytesInTsObu; i++ ) { | ||
| if( i + 2 < NumBytesInTsObu && nextbits(24) == 0x000003 ) { | ||
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| i += 2 | ||
| emulation_prevention_three_byte /* equal to 0x03 */ | 8 | uimsbf |
| } else | ||
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| } |
obu_start_code
-
This
value
shall
be
set
to
0x000001.
0x000001
.
open_bitstream_unit[i] - i-th byte of the AV1 open bitstream unit (As defined in section 5.3 of [ AV1 ]).
It
is
the
responsability
of
the
TS
muxer
to
prevent
start
code
emulation
by
escaping
all
the
forbidden
three-byte
sequences
using
the
emulation_prevention_three_byte
(always
equal
to
0x03).
0x03
).
The
forbidden
sequences
are
defined
below.
Within
the
ts_open_bitstream_unit()
ts_open_bitstream_unit
payload,
the
following
three-byte
sequences
shall
not
occur
at
any
byte-aligned
position
:
0x000000
0x000001
0x000002
Within
the
ts_open_bitstream_unit()
ts_open_bitstream_unit
payload,
any
four-byte
sequence
that
starts
with
0x000003
other
than
the
following
sequences
shall
not
occur
at
any
byte-aligned
position
:
0x00000300
0x00000301
0x00000302
0x00000303
An
AV1
Access
Unit
consists
of
all
OBUs,
OBUs
,
including
headers,
between
the
end
of
the
last
OBU
associated
with
the
previous
frame,
and
the
end
of
the
last
OBU
associated
with
the
current
frame.
With
this
definition,
an
Access
Unit
sometimes
maps
with
a
Decodable
Frame
Group
(DFG)
as
defined
in
Annex
E
of
[
AV1
]
and
some
other
times
to
a
Temporal
Unit
(TU)
as
defined
in
[
AV1
],
or
both,
as
illustrated
in
the
figure
below.
An
illustration
is
provided
in
the
figure
below
for
a
group
of
pictures
with
frames
predicted
as
follows
:
AV1
video
encapsulated
as
defined
in
clause
4.2
§
3.2
Start-code
based
format
is
carried
in
PES
packets
as
PES_packet_data_bytes,
PES_packet_data_bytes
,
using
the
stream_id
0xBD
(private_stream_id_1).
A
PES
shall
encapsulate
one,
and
only
one,
AV1
access
unit
as
defined
in
clause
4.3.
.
All
the
PES
shall
have
data_alignment_indicator
set
to
1.
1
.
Usage
of
data_stream_alignment_descriptor
is
not
specified
and
the
only
allowed
alignment_type
is
1
(Access
unit
level).
The
highest
level
that
may
occur
in
an
AV1
video
stream,
as
well
as
a
profile
and
tier
that
the
entire
stream
conforms
to,
shall
be
signalled
using
the
AV1
video
descriptor.
AV1_video_descriptor
.
If an AV1 video descriptor is associated with an AV1 video stream, then this descriptor shall be conveyed in the descriptor loop for the respective elementary stream entry in the program map table. This specification does not specify the presentation of AV1 streams in the context of a program stream.
For AV1 video stream multiplexed into [ MPEG-2 TS ], the decoder_model_info may not be present. If the decoder_model_info is present, then the STD model shall match with the decoder model defined in Annex E of [ AV1 ].
For
synchronization
and
STD
management,
PTSs
and,
when
appropriate,
DTSs
are
encoded
in
the
header
of
the
PES
packet
that
carries
the
AV1
video
stream
data
setting
the
PTS_DTS_flags
to
'01'
01
or
'11'.
11
.
For
PTS
and
DTS
encoding,
the
constraints
and
semantics
apply
as
defined
in
the
PES
Header
and
associated
constraints
on
timestamp
intervals.
There
are
cases
in
AV1
bitstreams
where
information
about
a
frame
is
sent
multiple
times.
For
example,
first
to
be
decoded,
and
subsequently
to
be
displayed.
In
the
case
of
a
frame
being
decoded
but
not
displayed,
it
is
desired
to
assign
a
valid
DTS
but
without
need
for
a
PTS.
However,
the
MPEG2-TS
[
MPEG-2
TS
]
specification
prevents
a
DTS
from
being
transmitted
without
a
PTS.
Hence,
a
PTS
is
always
assigned
for
AV1
access
units
and
its
value
is
not
relevant
for
frames
being
decoded
but
not
displayed.
To achieve consistency between the STD model and the buffer model defined in Annex E of [ AV1 ], the following PTS and DTS assignment rules shall be applied :
| show_existing_frame | show_frame | showable_frame | PTS | DTS |
|---|---|---|---|---|
| 0 | 0 | 0 | ScheduledRemovalTiming[dfg] | ScheduledRemovalTiming[dfg] |
| 0 | 0 | 1 | ScheduledRemovalTiming[dfg] | ScheduledRemovalTiming[dfg] |
| 0 | 1 | n/a | PresentationTime[frame] | ScheduledRemovalTiming[dfg] |
| 1 | n/a | n/a | PresentationTime[frame] | ScheduledRemovalTiming[dfg] |
Note
:
The
ScheduleRemovalTiming[]
ScheduledRemovalTiming
and
PresentationTime[]
PresentationTime
are
defined
in
the
Annex
E
of
[
AV1
].
Carriage
of
an
AV1
video
stream
over
[
MPEG-2
TS
]
does
not
impact
the
size
of
the
Buffer
Pool.
Pool
.
For
decoding
of
an
AV1
video
stream
in
the
STD,
the
size
of
the
Buffer
Pool
is
as
defined
in
[
AV1
].
The
Buffer
Pool
shall
be
managed
as
specified
in
Annex
E
of
[
AV1
].
A
decoded
AV1
access
unit
enters
the
Buffer
Pool
instantaneously
upon
decoding
the
AV1
access
unit,
unit
,
hence
at
the
Scheduled
Removal
Timing
ScheduledRemovalTiming
of
the
AV1
access
unit.
unit
.
A
decoded
AV1
access
unit
is
presented
at
the
Presentation
Time.
PresentationTime
.
If
the
AV1
video
stream
provides
insufficient
information
to
determine
the
Scheduled
Removal
Timing
ScheduledRemovalTiming
and
the
Presentation
Time
PresentationTime
of
AV1
access
units,
units
,
then
these
time
instants
shall
be
determined
in
the
STD
model
from
PTS
and
DTS
timestamps
as
follows:
When there is an AV1 video stream in an [ MPEG-2 TS ] program, the T-STD model as described in the section "Transport stream system target decoder" is extended as specified below.
The following additional notations are used to describe the T-STD extensions and are illustrated in the figure above.
| Notation | Definition |
|---|---|
| t(i) | indicates the time in seconds at which the i-th byte of the transport stream enters the system target decoder |
| TB n | is the transport buffer for elementary stream n |
| TBS | is the size of the transport buffer TBn, measured in bytes |
| MB n | is the multiplexing buffer for elementary stream n |
| MBS n | is the size of the multiplexing buffer MBn, measured in bytes |
| EB n | is the elementary stream buffer for the AV1 video stream |
| EBS n | is the size of the multiplexing buffer MBn, measured in bytes |
| j | is an index to the AV1 access unit of the AV1 video stream |
| A n (j) | is the j-th access unit of the AV1 video bitstream |
| td n (j) | is the decoding time of An(j), measured in seconds, in the system target decoder |
| Rx n | is the transfer rate from the transport buffer TBn to the multiplex buffer MBn as specified below. |
| Rbx n | is the transfer rate from the multiplex buffer MBn to the elementary stream buffer EBn as specified below |
The following apply:
If there is PES packet payload data in MB n , and buffer EB n is not full, the PES packet payload is transferred from MB n to EB n at a rate equal to Rbx n . If EB n is full, data are not removed from MB n . When a byte of data is transferred from MB n to EB n , all PES packet header bytes that are in MB n and precede that byte are instantaneously removed and discarded. When there is no PES packet payload data present in MB n , no data is removed from MB n . All data that enters MB n leaves it. All PES packet payload data bytes enter EB n instantaneously upon leaving MB n .
The STD delay of any AV1 video through the system target decoders buffers TB n , MB n , and EB n shall be constrained by td n (j) – t(i) ≤ 10 seconds for all j, and all bytes i in access unit A n (j).
Transport streams shall be constructed so that the following conditions for buffer management are satisfied:
Referenced in: