Copyright
2023,
AOM
Licensing
information
is
available
at
http://aomedia.org/license/
The
MATERIALS
ARE
PROVIDED
“AS
IS.”
The
Alliance
for
Open
Media,
its
members,and
its
contributors
expressly
disclaim
any
warranties
(express,
implied,
or
otherwise),
including
implied
warranties
of
merchantability,
non-infringement,
fitness
for
a
particular
purpose,
or
title,
related
to
the
materials.
The
entire
risk
as
to
implementing
or
otherwise
using
the
materials
is
assumed
by
the
implementer
and
user.
IN
NO
EVENT
WILL
THE
ALLIANCE
FOR
OPEN
MEDIA,
ITS
MEMBERS,
OR
CONTRIBUTORS
BE
LIABLE
TO
ANY
OTHER
PARTY
FOR
LOST
PROFITS
OR
ANY
FORM
OF
INDIRECT,
SPECIAL,
INCIDENTAL,
OR
CONSEQUENTIAL
DAMAGES
OF
ANY
CHARACTER
FROM
ANY
CAUSES
OF
ACTION
OF
ANY
KIND
WITH
RESPECT
TO
THIS
DELIVERABLE
OR
ITS
GOVERNING
AGREEMENT,
WHETHER
BASED
ON
BREACH
OF
CONTRACT,
TORT
(INCLUDING
NEGLIGENCE),
OR
OTHERWISE,
AND
WHETHER
OR
NOT
THE
OTHER
MEMBER
HAS
BEEN
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGE.
This document specifies how to carry AV1 video elementary streams ([ AV1 ]) in the MPEG-2 Transport Stream format ([ MPEG-2 TS ]). It does not specify the presentation of AV1 streams in the context of a program stream.
This
document
defines
the
carriage
of
AV1
in
a
single
PID,
assuming
buffer
model
info
from
the
first
operating
point.
It
may
not
be
optimal
for
layered
streams
or
streams
with
multiple
operating
points.
Future
versions
may
incorporate
this
capability.
In
the
present
document
"shall",
"shall
not",
"should",
"should
not",
"may",
"need
not",
"will",
"will
not",
"can"
and
"cannot"
are
to
be
interpreted
as
described
in
clause
3.2
of
the
ETSI
Drafting
Rules
(Verbal
forms
for
the
expression
of
provisions).
In
the
present
document
the
mnemonics,
the
syntax
functions,
and
the
syntax
descriptors
are
to
be
interpreted
as
described
in
[
MPEG-2
TS
].
The
uimsbf
and
bslbf
mnemonics
are
defined
in
Section
2.2.6
of
[
MPEG-2
TS
].
The
nextbits()
function
is
interpreted
as
in
[
MPEG-2
TS
].
1.1
Modal
verbs
terminology
1.2
Definition
of
mnemonics
and
syntax
function
The
presence
of
a
Registration
Descriptor,
as
defined
in
[
MPEG-2
TS
],
is
mandatory
with
the
format_identifier
field
set
to
'AV01'
(A-V-0-1).
The
Registration
Descriptor
shall
be
descriptor_tag
-
This
value
shall
be
set
to
0x05.
descriptor_length
-
This
value
shall
be
set
to
4.
format_identifier
-
This
value
shall
be
set
to
'AV01'
(A-V-0-1).
the
first
conveyed
in
the
PMT
descriptor
loop
for
the
respective
elementary
stream
entry
in
the
program
map
table,
and
included
before
the
AV1
video
descriptor.
2.1.1
Syntax
Syntax
No.
Of
bits
Mnemonic
registration_descriptor()
{
descriptor_tag
8
uimsbf
descriptor_length
8
uimsbf
format_identifier
32
uimsbf
}
2.1.2
Semantics
The AV1 video descriptor provides basic information for identifying coding parameters, such as profile and level parameters of an AV1 video stream. The same data structure as AV1CodecConfigurationRecord in ISOBMFF is used to aid conversion between the two formats, EXCEPT that two of the reserved bits are used for HDR/WCG identification. The syntax and semantics for this descriptor appears in the table below and in the subsequent text.
If
an
AV1
video
descriptor
is
associated
with
an
AV1
video
stream,
then
this
descriptor
shall
be
conveyed
in
the
descriptor
loop
for
the
respective
elementary
stream
entry
in
the
program
map
table.
descriptor_tag
-
This
value
shall
be
set
to
0x80.
descriptor_length
-
This
value
shall
be
set
to
4.
marker
-
This
value
shall
be
set
to
1.
version
-
This
field
indicates
the
version
of
the
AV1_video_descriptor.
This
value
shall
be
set
to
1.
seq_profile
,
seq_level_idx_0
and
high_bitdepth
-
These
fields
shall
be
coded
according
to
the
semantics
defined
in
[
AV1
].
If
these
fields
are
not
coded
in
the
Sequence
Header
OBU
in
the
AV1
video
stream,
the
inferred
values
are
coded
in
the
descriptor.
seq_tier_0
,
twelve_bit
,
monochrome
,
chroma_subsampling_x
,
chroma_subsampling_y
,
chroma_sample_position
-
These
fields
shall
be
coded
according
to
the
same
semantics
when
they
are
present.
If
they
are
not
present,
they
will
be
coded
using
the
value
inferred
by
the
semantics.
hdr_wcg_idc
-
The
value
of
this
syntax
element
indicates
the
presence
or
absence
of
high
dynamic
range
(HDR)
and/or
wide
color
gamut
(WCG)
video
components
in
the
associated
PID
according
to
the
table
below.
HDR
is
defined
to
be
video
that
has
high
dynamic
range
if
the
video
stream
EOTF
is
higher
than
the
reference
EOTF
defined
in
[
BT-1886
].
WCG
is
defined
to
be
video
that
is
coded
using
colour
primaries
with
a
colour
gamut
not
contained
within
[
BT-709
].
reserved_zeros
-
Will
be
set
to
zeroes.
initial_presentation_delay_present
-
Indicates
initial_presentation_delay_minus_one
field
is
present.
initial_presentation_delay_minus_one
-
Ignored
for
[
MPEG-2
TS
]
use,
included
only
to
aid
conversion
to/from
ISOBMFF.
2.2.1
Syntax
Syntax
No.
Of
bits
Mnemonic
AV1_video_descriptor()
{
descriptor_tag
8
uimsbf
descriptor_length
8
uimsbf
marker
1
bslbf
version
7
uimsbf
seq_profile
3
uimsbf
seq_level_idx_0
5
uimsbf
seq_tier_0
1
bslbf
high_bitdepth
1
bslbf
twelve_bit
1
bslbf
monochrome
1
bslbf
chroma_subsampling_x
1
bslbf
chroma_subsampling_y
1
bslbf
chroma_sample_position
2
uimsbf
hdr_wcg_idc
2
uimsbf
reserved_zeros
1
bslbf
initial_presentation_delay_present
1
bslbf
if
(initial_presentation_delay_present)
{
initial_presentation_delay_minus_one
4
uimsbf
}
else
{
reserved_zeros
4
uimsbf
}
}
2.2.2
Semantics
hdr_wcg_idc
Description
0
SDR,
i.e.,
video
is
based
on
the
reference
EOTF
defined
in
[
BT-1886
]
with
a
color
gamut
that
is
contained
within
[
BT-709
]
with
a
[
BT-709
]
container
1
WCG
only,
i.e.,
video
color
gamut
in
a
[
BT-2020
]
container
that
exceeds
[
BT-709
]
2
Both
HDR
and
WCG
are
to
be
indicated
in
the
stream
3
No
indication
made
regarding
HDR/WCG
or
SDR
characteristics
of
the
stream
For AV1 video streams, the following constraints apply:
In addition, a start code insertion and emulation prevention process shall be performed on the AV1 Bitstream prior to its PES encapsulation. This process is described in section 3.2.
Prior to carriage into PES, the AV1 open_bitstream_unit() is encapsulated into ts_open_bitstream_unit() . This is required to provide direct access to OBU through a start-code mechanism inserted prior to each OBU. The following syntax describes how to retrieve the open_bitstream_unit() from the ts_open_bitstream_unit() (tsOBU).
| Syntax | No. Of bits | Mnemonic |
|---|---|---|
| ts_open_bitstream_unit(NumBytesInTsObu) { | ||
| obu_start_code /* equal to 0x01 */ | 24 | uimsbf |
| NumBytesInObu = 0 | ||
| for( i = 2; i < NumBytesInTsObu; i++ ) { | ||
| if( i + 2 < NumBytesInTsObu && nextbits(24) == 0x000003 ) { | ||
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| i += 2 | ||
| emulation_prevention_three_byte /* equal to 0x03 */ | 8 | uimsbf |
| } else | ||
| open_bitstream_unit[NumBytesInObu++] | 8 | uimsbf |
| } |
obu_start_code - This value shall be set to 0x000001.
open_bitstream_unit[i] - i-th byte of the AV1 open bitstream unit (As defined in section 5.3 of [ AV1 ]).
It is the responsability of the TS muxer to prevent start code emulation by escaping all the forbidden three-byte sequences using the emulation_prevention_three_byte (always equal to 0x03). The forbidden sequences are defined below.
Within the ts_open_bitstream_unit() payload, the following three-byte sequences shall not occur at any byte-aligned position :
Within the ts_open_bitstream_unit() payload, any four-byte sequence that starts with 0x000003 other than the following sequences shall not occur at any byte-aligned position :
An AV1 Access Unit consists of all OBUs, including headers, between the end of the last OBU associated with the previous frame, and the end of the last OBU associated with the current frame. With this definition, an Access Unit sometimes maps with a Decodable Frame Group (DFG) as defined in Annex E of [ AV1 ] and some other times to a Temporal Unit (TU) as defined in [ AV1 ], or both, as illustrated in the figure below. An illustration is provided in the figure below for a group of pictures with frames predicted as follows :
AV1 video encapsulated as defined in clause 4.2 is carried in PES packets as PES_packet_data_bytes, using the stream_id 0xBD (private_stream_id_1).
A PES shall encapsulate one, and only one, AV1 access unit as defined in clause 4.3. All the PES shall have data_alignment_indicator set to 1. Usage of data_stream_alignment_descriptor is not specified and the only allowed alignment_type is 1 (Access unit level).
The highest level that may occur in an AV1 video stream, as well as a profile and tier that the entire stream conforms to, shall be signalled using the AV1 video descriptor.
For AV1 video stream multiplexed into [ MPEG-2 TS ], the decoder_model_info may not be present. If the decoder_model_info is present, then the STD model shall match with the decoder model defined in Annex E of [ AV1 ].
For synchronization and STD management, PTSs and, when appropriate, DTSs are encoded in the header of the PES packet that carries the AV1 video stream data setting the PTS_DTS_flags to '01' or '11'. For PTS and DTS encoding, the constraints and semantics apply as defined in the PES Header and associated constraints on timestamp intervals.
There are cases in AV1 bitstreams where information about a frame is sent multiple times. For example, first to be decoded, and subsequently to be displayed. In the case of a frame being decoded but not displayed, it is desired to assign a valid DTS but without need for a PTS. However, the MPEG2-TS specification prevents a DTS from being transmitted without a PTS. Hence, a PTS is always assigned for AV1 access units and its value is not relevant for frames being decoded but not displayed.
To achieve consistency between the STD model and the buffer model defined in Annex E of [ AV1 ], the following PTS and DTS assignment rules shall be applied :
| show_existing_frame | show_frame | showable_frame | PTS | DTS |
|---|---|---|---|---|
| 0 | 0 | 0 | ScheduledRemovalTiming[dfg] | ScheduledRemovalTiming[dfg] |
| 0 | 0 | 1 | ScheduledRemovalTiming[dfg] | ScheduledRemovalTiming[dfg] |
| 0 | 1 | n/a | PresentationTime[frame] | ScheduledRemovalTiming[dfg] |
| 1 | n/a | n/a | PresentationTime[frame] | ScheduledRemovalTiming[dfg] |
Note : The ScheduleRemovalTiming[] and PresentationTime[] are defined in the Annex E of [ AV1 ].
Carriage of an AV1 video stream over [ MPEG-2 TS ] does not impact the size of the Buffer Pool.
For decoding of an AV1 video stream in the STD, the size of the Buffer Pool is as defined in [ AV1 ]. The Buffer Pool shall be managed as specified in Annex E of [ AV1 ].
A decoded AV1 access unit enters the Buffer Pool instantaneously upon decoding the AV1 access unit, hence at the Scheduled Removal Timing of the AV1 access unit. A decoded AV1 access unit is presented at the Presentation Time.
If the AV1 video stream provides insufficient information to determine the Scheduled Removal Timing and the Presentation Time of AV1 access units, then these time instants shall be determined in the STD model from PTS and DTS timestamps as follows:
When there is an AV1 video stream in an [ MPEG-2 TS ] program, the T-STD model as described in the section "Transport stream system target decoder" is extended as specified below.
The following additional notations are used to describe the T-STD extensions and are illustrated in the figure above.
| Notation | Definition |
|---|---|
| t(i) | indicates the time in seconds at which the i-th byte of the transport stream enters the system target decoder |
| TB n | is the transport buffer for elementary stream n |
| TBS | is the size of the transport buffer TBn, measured in bytes |
| MB n | is the multiplexing buffer for elementary stream n |
| MBS n | is the size of the multiplexing buffer MBn, measured in bytes |
| EB n | is the elementary stream buffer for the AV1 video stream |
| EBS n | is the size of the multiplexing buffer MBn, measured in bytes |
| j | is an index to the AV1 access unit of the AV1 video stream |
| A n (j) | is the j-th access unit of the AV1 video bitstream |
| td n (j) | is the decoding time of An(j), measured in seconds, in the system target decoder |
| Rx n | is the transfer rate from the transport buffer TBn to the multiplex buffer MBn as specified below. |
| Rbx n | is the transfer rate from the multiplex buffer MBn to the elementary stream buffer EBn as specified below |
The following apply:
If there is PES packet payload data in MB n , and buffer EB n is not full, the PES packet payload is transferred from MB n to EB n at a rate equal to Rbx n . If EB n is full, data are not removed from MB n . When a byte of data is transferred from MB n to EB n , all PES packet header bytes that are in MB n and precede that byte are instantaneously removed and discarded. When there is no PES packet payload data present in MB n , no data is removed from MB n . All data that enters MB n leaves it. All PES packet payload data bytes enter EB n instantaneously upon leaving MB n .
The STD delay of any AV1 video through the system target decoders buffers TB n , MB n , and EB n shall be constrained by td n (j) – t(i) ≤ 10 seconds for all j, and all bytes i in access unit A n (j).
Transport streams shall be constructed so that the following conditions for buffer management are satisfied: