Immersive Audio Model and Formats

1. Introduction

This specification defines an immersive audio model and formats (IAMF) to provide an immersive audio experience to end-users.

The term Immersive Audio (IA) means the combination of 3D audio signal s recreating a sound experience close to that of a natural environment.
The term 3D audio signal means a representation of sound that incorporates additional information beyond traditional stereo or surround sound formats such as Ambisonics (Scene-based), Object-based audio and Channel-based audio (e.g. 3.1.2ch or 7.1.4ch).

IAMF is used to provide Immersive Audio content for presentation on a wide range of devices in both streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems, and big screens.

Here are some typical IAMF use cases and examples of how to instantiate the model for the use cases.

UC1: One Audio Element (e.g., 3.1.2ch or First Order Ambisonics (FOA)) is delivered to a big-screen TV (in Home) or a Mobile device through a unicast network. It is rendered to a loudspeaker layout (e.g. 3.1.2ch) or Binaural (or Stereo) with loudness normalization and is played on loudspeakers built into the big-screen TV or headset through the Mobile device, respectively.
UC2: Two Audio Element s (e.g., 5.1.2ch and Stereo) are delivered to a big-screen TV through a unicast network. Both are rendered to the same loudspeaker layout built into the big-screen TV and are mixed. After loudness normalization appropriate to the Home environment, the Rendered Mix Presentation is played back on the loudspeakers.
UC3: Two Audio Element s (e.g., FOA and Stereo) are delivered to a Mobile device through a unicast network. Both are rendered to the same loudspeaker layout to be mixed. After mixing them, it is rendered to Binaural (or Stereo) with loudness normalization and is played back on the headset through the Mobile device.

Example 1: 3D audio signal = 3.1.2ch of UC1,

Audio Substream: L and R channels are coded as one audio stream, Ltf and Rtf channels as one audio stream, Center as one audio stream, and LFE as one audio stream.
Audio Element (3.1.2ch): Consists of 4 Audio Substreams which are grouped into one ChannelGroup .
Mix Presentation: Provide rendering algorithm of Audio Element to popular loudspeaker layouts and loudness information of the 3D audio signal .

Example 2: Two 3D audio signal s = 5.1.2ch and Stereo of UC2,

Audio Substream: L and R channels are coded as one audio stream, Ls and Rs channels as one audio stream, Ltf and Rtf channels as one audio stream, Center as one audio stream, and LFE as one audio stream.
Audio Element 1 (5.1.2ch): Consists of 5 Audio Substreams which are grouped into one ChannelGroup .
Audio Element 2 (Stereo): Consists of 1 Audio Substream which is grouped into one ChannelGroup .
Parameter Substream 1-1: Mixing parameter values applied to Audio Element 1 by considering the Home environment
Parameter Substream 1-2: Mixing parameter values applied to Audio Element 2 by considering the Home environment
Mix Presentation: Provide rendering algorithm of Audio Element 1 & 2 to popular loudspeaker layouts, mixing based on Parameter Substream 1-1 & 1-2 and loudness information of the Rendered Mix Presentation .

Example 3: Two 3D audio signal s = FOA and Stereo of UC3,

Audio Substream: L and R channels are coded as one audio stream and each channel of FOA as one audio stream.
Audio Element 1 (FOA): Consists of 4 Audio Substreams which are grouped into one ChannelGroup .
Audio Element 2 (Stereo): Consists of 1 Audio Substream which is grouped into one ChannelGroup .
Parameter Substream 1-1: Mixing parameter values applied to Audio Element 1 by considering the Mobile environment
Parameter Substream 1-2: Mixing parameter values applied to Audio Element 2 by considering the Mobile environment
Mix Presentation: Provide rendering algorithm of Audio Element 1 & 2 to popular loudspeaker layouts, mixing based on Parameter Substream 1-1 & 1-2 and loudness information of the Rendered Mix Presentation .

2. Immersive Audio Model

This specification defines a model for representing Immersive Audio contents based on Audio Substream s contributing to Audio Element s meant to be rendered and mixed to form one or more presentations as depicted in the figure below.

Processing flow to decode, reconstruct, render, and mix the 3D audio signals for immersive audio playback.

The model comprises a number of coded Audio Substream s and the metadata that describes how to decode, render and mix the Audio Substream s for playback. The model itself is codec-agnostic; any supported audio codec may be used to code the Audio Substream s.

The model includes one or more Audio Element s, each of which consists of one or more Audio Substream s. The Audio Substream s that make up an Audio Element are grouped into one or more ChannelGroup s. The model further includes Mix Presentation s and Parameter Substream s.

2.1. Terminology

The term Audio Substream means a sequence of audio samples, which may be encoded with any compatible audio codec.

The term Audio Element means a 3D audio signal , and is constructed from one or more Audio Substream s and the metadata describing them. The Audio Substream s associated with one Audio Element use the same audio codec.

The term ChannelGroup means a set of Audio Substream (s) which is(are) able to provide a spatial resolution of audio contents by itself or which is(are) able to provide an enhanced spatial resolution of audio contents by combining with the preceding ChannelGroup s.

The term Parameter Substream means a sequence of parameter values that are associated with the algorithms used for decoding, reconstructing, rendering, and mixing. It is applied to its associated Audio Element .

Parameter Substream may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time duration. As such, they may be viewed as a 1D signal with different metadata specified for different time durations.

The term Mix Presentation means a series of processes to present Immersive Audio contents to end-users by using Audio Element (s). It contains metadata that describes how the Audio Element (s) is(are) rendered and mixed together for playback through physical loudspeakers or headsets, and loudness information.

The term Rendered Mix Presentation means a 3D audio signal after the Audio Element (s) defined in a Mix Presentation is(are) rendered and mixed together for playback through physical loudspeaker or headsets.

2.2. Architecture

Based on the model, this specification defines a hypothetical immersive audio model and format ( IAMF ) architecture as depicted in the figure below.

Hypothetical IAMF Architecture

For a given input 3D audio,

A Pre-Processor generates ChannelGroup (s), Descriptors and Parameter Substream (s).
A Codec Enc generates coded Audio Substream (s).
An OBU Packetizer generates an IA Sequence from the coded Audio Substream (s) and Descriptors and Parameter Substream (s).
A File Packager (ISOBMFF Encapsulation) generates an IAMF File by encapsulating the IA Sequence into [ISOBMFF] track(s).
A File Parser (ISOBMFF Parser) reconstructs the IA Sequence by decapsulating the IAMF File.
An OBU Parser outputs the coded Audio Substream (s) and the Parameter Substream (s).
A Codec Dec outputs decoded ChannelGroup (s) after decoding of the coded Audio Substream (s).
A Post-Processor outputs an Immersive Audio by using the ChannelGroup (s), the Descriptors and the Parameter Substream (s).
Pre-Processor, ChannelGroup (s), Codec Enc and OBU Packetizer are defined in § 8 IAMF Generation Process (Informative) .
IA Sequence is defined in § 2.3.1 IA Sequence
ISOBMFF Encapsulation, IAMF file (ISOBMFF file) and ISOBMFF Parser are deifned in § 6 ISOBMFF IAMF Encapsulation .
OBU Parser, Codec Dec, and Post-Processor are defined in § 7 IAMF processing .

2.3. Bitstream Structure

2.3.1. IA Sequence

An IA Sequence is a bitstream to represent Immersive Audio contents and consists of Descriptors and IA Data .

The metadata in the Descriptors and IA Data are packetized into individual Open Bitstream Units (OBU)s. The term Open Bitstream Unit (OBU) is the concrete, physical unit used to represent the components in the model.

2.3.2. Use of OBU

2.3.2.1. Descriptors

Descriptors contain all the information that is required to set up and configure the decoders, reconstruction algorithm, renderers, and mixers. Descriptors do not contain audio signals.

IA Sequence Header OBU indicates the start of a full IA Sequence description and contains information related to profiles.
Codec Config OBU describes information to set up a decoder for an coded Audio Substream .
Audio Element OBU describes information to combine one or more Audio Substream s to reconstruct an Audio Element .
Mix Presentation OBU describes information to render and mix one or more Audio Element s to generate the final 3D audio output.
- Multiple Mix Presentation s can be defined as alternatives to each other within the same IA Sequence . Furthermore, the choice of which Mix Presentation to use at playback is left to the user. For example, multi-language support is implemented by defining different Mix Presentation s, where the first mix describes the use of the Audio Element with English dialogue, and the second mix describes the use of the Audio Element with French dialogue.

2.3.2.2. IA Data

IA Data contains the actual time-varying data that is required in the generation of the final 3D audio output.

Audio Frame OBU provides the coded audio frame for an Audio Substream . It has the start timestamp and duration. So, a coded Audio Substream is represented as a sequence of Audio Frame OBU s with the same identifier, in time order. It is represented by different types of OBUs.
Parameter Block OBU provides the parameter values in a block for an time-varying Parameter Substream . It has the start timestamp and duration. So, a time-varying Parameter Substream is represented as a sequence of parameter values in Parameter Block OBU s with the same identifier, in time order.
Temporal Delimiter OBU identifies the Temporal Unit s. It may or may not be present in IA Sequence . If present, the first OBU of every Temporal Unit is Temporal Delimiter OBU .

2.4. Timing Model

A coded Audio Substream is made of consecutive Audio Frame OBU s. Each Audio Frame OBU is made of audio samples at a given sample rate. The decode duration of an Audio Frame OBU is the number of audio samples divided by the sample rate. The presentation duration of an Audio Frame OBU is the number of audio samples remaining after trimming divided by the sample rate. The decode start time (respectively presentation start time) of an Audio Frame OBU is the sum of the decode durations (respectively presentation durations) of previous Audio Frame OBU s in the IA Sequence, or 0 otherwise. The decode duration (respectively presentation duration) of a coded Audio Substream is the sum of the decode durations (respectively presentation durations) of all its Audio Frame OBU s. The decode start time of an Audio Substream is the decode start time of its first Audio Frame OBU . The presentation start time of an Audio Substream is the presentation start time of its first Audio Frame OBU which is not entirely trimmed.

A Parameter Substream is made of consecutive Parameter Block OBU s. Each Parameter Block OBU is made of parameter values at a given sample rate. The decode duration of a Parameter Block OBU is the number of parameter values divided by the sample rate. The decode start time of a Parameter Block OBU s is the sum of the decode duration of previous Parameter Block OBU s if any, 0 otherwise. The decode duration of a Parameter Substream is the sum of all its Parameter Block OBU 's decode durations. The start time of an Parameter Substream is the decode start time of its first Audio Frame OBU . When all parameter values of Parameter Substream are constant, no Parameter Block OBU s may present in the IA Sequence .

Within an Audio Element , the presentation start times of all Audio Substream s coincide and is the presentation start time of the Audio Element . All Audio Substream s have the same presentation duration which is the presentation duration of the Audio Element .

The decode start times of all coded Audio Substream s and all Parameter Substream s coincide and is the decode start time of the Audio Element .
All coded Audio Substream s and all Parameter Substream s have the same decode duration which is the decode duration of the Audio Element .

Within an Mix Presentation , the presentation start time of all Audio Element s coincide and all Audio Element s have the same duration defining the duration of the Mix Presentation .

Within an IA Sequence , all Mix Presentation s have the same duration, defining the duration of the IA Sequence , and have the same presentation start time defining the presentation start time of the IA Sequence .

The term Temporal Unit means a set of all Audio Frame OBU s with the same decode start time and the same duration from all coded Audio Substream s and all non-redundant Parameter Block OBU s with the decode start time within the duration.

The figure below shows an example of the Timing Model in terms of the decode start times and durations of the coded Audio Substream and Parameter Substream .

An example of the IAMF Timing Model. AFO: Audio Frame OBU, PBO: Parameter Block OBU, PT

: time

(ms) on the presentation layer’s timeline, DT

: time

(ms) on the decoding layer’s timeline.

NOTE: For a given decoded Audio Substream (before trimming) and its associated Parameter Substream (s), a decoder operates 1) or 2). 1) the decoder trims the audio samples to be trimmed of the Audio Substream after applying the Parameter Substream (s) or 2) the decoder trims the audio samples to be trimmed of the Audio Substream and the parameter values of the Parameter Substream (s) which are mapped to the audio samples to be trimmed, and then applies its remained Parameter Substream (s) to the Audio Substream after trimming.

3. Open Bitstream Unit (OBU) Syntax and Semantics

The IA Sequence uses the OBU syntax.

This section specifies the OBU syntax elements and their semantics.

3.1. Immersive Audio OBU Syntax and Semantics

obu_header() and all OBU payloads including the reserved_obu() are byte aligned.

Syntax

class ia_open_bitstream_unit() {
  obu_header();
  if (obu_type == OBU_IA_Sequence_Header)
    ia_sequence_header_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu(true);
  else if (obu_type >= 6 and <= 23)
    audio_frame_obu(false);
  else if (obu_type >=24 and <= 30)
    reserved_obu();
}

Semantics

If the syntax element obu_type is equal to OBU_IA_Sequence_Header, an ordered series of OBUs is presented to the decoding process as a string of bytes.

3.2. OBU Header Syntax and Semantics

Syntax

class obu_header() {
  unsigned int (5) obu_type;
  unsigned int (1) obu_redundant_copy;
  unsigned int (1) obu_trimming_status_flag;
  unsigned int (1) obu_extension_flag;
  leb128() obu_size;
  if (obu_trimming_status_flag) {
    leb128() num_samples_to_trim_at_end;
    leb128() num_samples_to_trim_at_start;
  }
  if (obu_extension_flag == 1) {
    leb128() extension_header_size;
    unsigned int (8*extension_header_size) extension_header_bytes;
  }
}

Semantics

OBUs are structured with a header and a payload.

obu_type specifies the type of data structure contained in the OBU payload.

obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Temporal_Delimiter
   5    : OBU_IA_Audio_Frame
  6~23  : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17
 24~30  : Reserved
   31   : OBU_IA_Sequence_Header

obu_redundant_copy indicates whether this OBU is a redundant copy of the previous OBU in the IA Sequence with the same obu_type . A value of 1 indicates that it is a redundant copy, while a value of 0 indicates that it is not.

It SHALL always be set to 0 for the following obu_type values:

OBU_IA_Temporal_Delimiter
OBU_IA_Audio_Frame
OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17

If a decoder encounters an OBU with obu_redundant_copy = 1, and it has also received the previous non-redundant OBU, it SHALL ignore the redundant OBU. If the decoder has not received the previous non-redundant OBU, it SHALL treat the redundant copy as a non-redundant OBU and process the OBU accordingly.

obu_trimming_status_flag indicates whether this OBU has audio samples to be trimmed or not. It SHALL be set only when obu_type is set to OBU_IA_Audio_Frame or OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17.

For a given coded Audio Substream ,

If an Audio Frame OBU has its num_samples_to_trim_at_start field set to a non-zero value N, the decoder SHALL discard the first N audio samples.
If an Audio Frame OBU has its num_samples_to_trim_at_end field set to a non-zero value N, the decoder SHALL discard the last N audio samples.

NOTE: Because of coding dependency, discarding a sample can sometimes mean decoding the entire audio frame.

For a given Audio Frame OBU , the sum of num_samples_to_trim_at_start and num_samples_to_trim_at_end SHALL be less or equal to the number of samples (i.e. num_samples_per_frame ) in the Audio Frame OBU .

NOTE: This means that if one of the values is set to the number of samples (i.e. num_samples_per_frame ) in the Audio Frame OBU , the other value is set to 0.

When num_samples_to_trim_at_start is non-zero, all Audio Frame OBU s with the same audio_substream_id , and preceding this OBU back until the Codec Config OBU defining this Audio Substream , SHALL have their num_samples_to_trim_at_start field equal to the number of samples (i.e. num_samples_per_frame ) in the corresponding Audio Frame OBU.
When num_samples_to_trim_at_end is non-zero in an Audio Frame OBU , there SHALL be no subsequent Audio Frame OBU with the same audio_substream_id until a non-redundant Codec Config OBU defining an Audio Substream with the same audio_substream_id .

obu_extension_flag indicates whether the extension_header_size field is present. If it is set to 0, the extension_header_size field SHALL NOT be present. Otherwise, the extension_header_size field SHALL be present.

This flag SHALL be set to 0 for this version of the specification. An OBU parser that is conformant with this version of the specification SHALL ignore the extension_header_bytes .

NOTE: A future version of the specification may use this flag to specify an extension header field by setting obu_extension_flag = 1 and setting the size of the extended header to extension_header_size .

obu_size indicates the size in bytes of the OBU immediately following the obu_size field of the OBU. An OBU MAY have extra bytes after consuming all the bytes per the OBU syntax definition, parsers compliant to this version of the specification SHALL ignore the extra bytes.

num_samples_to_trim_at_start indicates the number of samples that need to be trimmed from the start of the samples in this Audio Frame OBU .

num_samples_to_trim_at_end indicates the number of samples that need to be trimmed from the end of the samples in this Audio Frame OBU .

extension_header_size indicates the size in bytes of the extension header immediately following this field.

extension_header_bytes indicates the byte representations of the syntaxes of the extension header.

3.3. Reserved OBU Syntax and Semantics

Reserved OBUs SHALL be ignored by parsers compliant with this version of the specification. Future versions of the specification MAY define semantics for these reserved OBUs that would only be supported by parsers compliant to these future versions.

3.4. IA Sequence Header OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Sequence_Header.

This OBU is used to indicate the start of IA Sequence . So, the first OBU of IA Sequence SHALL be OBU_IA_Sequence_Header.

NOTE: When an IA Sequence is stored in a file, the IA Sequence Header OBU can be used to detect that the file contains an IA Sequence .

This OBU MAY be placed frequently within one single IA Sequence for an application such as a broadcasting or multicasting scenario. In that case, the other IA Sequence Header OBU s except the first one SHALL be marked as redundant (i.e. obu_redundant_copy = 1).

Syntax

class ia_sequence_header_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) primary_profile;
  unsigned int (8) additional_profile;
}

Semantics

ia_code indicates a ‘four-character code’ (4CC), ‘iamf’.

NOTE: When IA OBUs are delivered over a protocol that does not provide explicit IA Sequence boundaries, a parser may locate the IA Sequence start by searching for the code iamf preceded by specific OBU header values, e.g. assuming obu_extension_flag is set to 0 and because obu_trimming_status_flag is set to 0 for an IA Sequence Header OBU , the OBU header can be 0xF806 or 0xFC06

primary_profile indicates the primary profile this IA Sequence complies to. Parsers compliant with this version of the specification SHALL discard the IA Sequence when this value is not one they support.

The below mappings are applied for both primary_profile and additional_profile .

0: Simple Profile
1: Base Profile
2~255: Reserved

additional_profile indicates an additional profile of this specification to which this IA Sequence complies. If an IA Sequence only complies with the primary_profile profile, this field SHALL be set to the same primary_profile value.

NOTE: If a future version defines a new profile, e.g. HypotheticalProfile, that is backward compatible with the Base profile, for example by defining new OBUs that would be ignored by the Base-compatible parser, an IA writer can decide to set the primary_profile to Base while setting the additional_profile to HypotheticalProfile . This way an old processor will know it can parse and produce an acceptable rendering, while a new processor still knows it can produce a better result because it will not ignore the additional featurest.

3.5. Codec Config OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Codec_Config.

Syntax

class codec_config_obu() {
  leb128() codec_config_id;  
  codec_config();
}
class codec_config() {
  unsigned int (32) codec_id;
  leb128() num_samples_per_frame;
  signed int (16) audio_roll_distance;
  decoder_config(codec_id);
}

Semantics

codec_config_id defines an identifier for a codec configuration. Within an IA Sequence , there SHALL be exactly one non-redundant Codec Config OBU with a given identifier. Each Codec Config OBU in the first Descriptors within the IA Sequence is regarded as a non-redundant OBU regardless of the value of its obu_redundant_copy . Audio Element s that need a decoder configuration based on this codec configuration refer to this identifier.

codec_id indicates a ‘four-character code’ (4CC) to identify the codec used to generate the coded Audio Substream s. For this version of the specification, it SHALL be set to one of four codec_id values defined below:

Opus : All coded Audio Substream s of all Audio Element s referring to this codec configuration SHALL comply with the specification [RFC6716] and the decoder_config() structure SHALL comply with the constraints given in § 3.11.1 OPUS Specific .
mp4a : All coded Audio Substream s of all Audio Element s referring to this codec configuration SHALL comply with the specification [AAC] and the decoder_config() structure SHALL comply with the constraints given in § 3.11.2 AAC-LC Specific .
fLaC : All coded Audio Substream s of all Audio Element s referring to this codec configuration SHALL comply with the specification [FLAC] and the decoder_config() structure SHALL comply with the constraints given in § 3.11.3 FLAC Specific .
ipcm : All coded Audio Substream s of all Audio Element s referring to this codec configuration SHALL be audio samples for linear PCM (LPCM) and the decoder_config() structure SHALL comply with the constraints given in § 3.11.4 LPCM Specific .

Parsers compliant with this version of the specification SHALL ignore Codec Config OBU s with an unknown codec_id .

NOTE: ipcm should not be confused with lpcm which is another 4CC to identify codecs in other container formats (e.g. QuickTime).

num_samples_per_frame indicates the frame length, in samples, of the audio_frame() provided in by audio_frame_obu(). It SHALL NOT be set to zero. If the decoder_config() structure for a given codec specifies a value for the frame length, the two values SHALL be equal.

audio_roll_distance indicates how many audio frames prior to the current audio frame need to be decoded (and decoded samples discarded) to set the encoder in a state that will produce the perfect decoded audio signal. It SHALL always be a negative value or zero. For some audio codecs, even if an audio frame can be decoded independently, the decoded signal after decoding only that frame may not represent a perfect, decoded audio signal, even ignoring compression artifacts. This can be due to overlap transforms. While potentially acceptable when starting to decode an Audio Substream , it may be problematic when automatically switching between similar Audio Substream s of different quality and/or bitrate.

It SHALL be set to -R when codec_id is set to Opus . Where R is roundup(3840 % num_samples_per_frame ).
It SHALL be set to -1 when codec_id is set to mp4a .
It SHALL be set to 0 when codec_id is set to fLaC or ipcm .

decoder_config() specifies the set of codec parameters required to decode a coded Audio Substream by the codec_id . It is byte aligned.

3.6. Audio Element OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Audio_Element.

Syntax

class audio_element_obu() {
  leb128() audio_element_id;
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;
  
  leb128() codec_config_id;  
  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() param_definition_type;
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
        DemixingParamDefinition demixing_info;
    }
    else if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
        ReconGainParamDefinition recon_gain_info;
    }
    else if (param_definition_type > 2) {
        leb128() param_definition_size;
        unsigned int (8*param_definition_size) param_definition_bytes;
    }
  }
  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  } else {
    leb128() audio_element_config_size;
    unsigned int (8*audio_element_config_size) audio_element_config_bytes;
  }
}

class DemixingParamDefinition() extends ParamDefinition() {
  default_demixing_info_parameter_data();
}

class default_demixing_info_parameter_data() extends demixing_info_parameter_data() {
  unsigned int (4) default_w;
  unsigned int (4) reserved;
}

class ReconGainParamDefinition() extends ParamDefinition() {
}

Semantics

audio_element_id defines an identifier for an Audio Element . Within an IA Sequence , there SHALL be exactly one non-redundant Audio Element OBU with a given identifier. Each Audio Element OBU in the first Descriptors within the IA Sequence is regarded as a non-redundant OBU regardless of the value of its obu_redundant_copy . Mix Presentation s that use an Audio Element refer to this identifier.

audio_element_type specifies the audio representation of this Audio Element which is constructed from one or more Audio Substream s. Parsers compliant with this version of the specification SHALL ignore Audio Element OBU s with a reserved audio_element_type .

audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved

codec_config_id indicates the identifier for the codec configuration which this Audio Element refers to. Parsers compliant with this version of the specification SHALL ignore Audio Element OBU s with a codec_config_id identifying an unknown codec_id .

num_substreams specifies the number of Audio Substream s that are used to reconstruct this Audio Element . It SHALL NOT be set to 0.

audio_substream_id indicates the identifier for an Audio Substream which this Audio Element refers to.

Let a particular ChannelGroup 's [=Aduio Substream]s be indexed as [ c , n_c ], where

c = [1, ..., C] is the ChannelGroup index and C is the number of ChannelGroup s.
n_c = [1, ..., N_c] is the Audio Substream index in the c-th ChannelGroup and N_c is the number of Audio Substream s in the c-th ChannelGroup .
The i-th audio_substream_id maps to a ChannelGroup 's Audio Substream s as follows, where i is the index of the array:

[[1, 1], [1, 2], ..., [1, N_1], [2, 1], [2, 2], ..., [2, N_2], ..., [C, 1], [C, 2], ..., [C, N_c]]

A ChannelGroup is defined in § 8 IAMF Generation Process (Informative) . The order of the Audio Substream s in each ChannelGroup ., i.e. the semantics of n_c, is specified in § 3.6.2 Scalable Channel Layout Config Syntax and Semantics .

num_parameters specifies the number of Parameter Substream s that are used by the algorithms specified in this Audio Element .

When audio_element_type = 0, this field SHALL be set to 0, 1, or 2 for this version of the specification.
When audio_element_type = 1, this field SHALL be set to 0 for this version of the specification.
Parsers compliant with this version of the specification SHALL be able to parse any value of num_parameters .

NOTE: For a given audio_element_type , a future version of the specification may define a new Parameter Substream which may be ignored by IA decoders compliant with this version of the specification. In that case, a new param_definition_type will be defined in a future version of Audio Element OBU .

param_definition_type specifies the type of the parameter definition. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.

The type PARAMETER_DEFINITION_MIX_GAIN SHALL NOT be present in Audio Element OBU .
The type SHALL NOT be duplicated in one Audio Element OBU .
When codec_id = fLaC or ipcm , the type PARAMETER_DEFINITION_RECON_GAIN SHALL NOT be present.
When num_layers > 1, the type PARAMETER_DEFINITION_RECON_GAIN SHALL be present.
When the highest loudspeaker_layout of (non-)scalable channel audio is less than or equal to 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING SHALL NOT be present.
When the highest loudspeaker_layout of scalable channel audio ( num_layers > 1) is greater than 3.1.2ch, the types of both PARAMETER_DEFINITION_DEMIXING and both PARAMETER_DEFINITION_RECON_GAIN SHALL be present.
When num_layers = 1 and loudspeaker_layout is greater than 3.1.2ch, the type PARAMETER_DEFINITION_DEMIXING MAY be present.
An OBU parser that is conformant with this version of the specification SHALL be able to parse param_definition_type = P (where, P > 2) and param_definition_size . And, the OBU Parser SHALL be able to ignore or skip the bytes indicated by param_definition_size .

param_definition_type	Parameter definition type	Parameter definition
0	PARAMETER_DEFINITION_MIX_GAIN	MixGainParamDefinition
1	PARAMETER_DEFINITION_DEMIXING	DemixingParamDefinition
2	PARAMETER_DEFINITION_RECON_GAIN	ReconGainParamDefinition

demixing_info provides the parameter definition for the demixing information to reconstruct channel audios according to loudspeaker_layout from scalable channel audio. The parameter definition is provided by DemixingParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in demixing_info_parameter_data() .

In this parameter definition, parameter_rate SHALL be set to the sample rate of this Audio Element and param_definition_mode SHALL be set to 0.
- duration SHALL be same as num_samples_per_frame of this Audio Element .
- num_subblocks SHALL be set to 1.
- constant_subblock_duration SHALL be same as duration

recon_gain_info provides the parameter definition for the gain value to reconstruct channel audios according to loudspeaker_layout from scalable channel audio. The parameter definition is provided by ReconGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in recon_gain_info_parameter_data() .

In this parameter definition, parameter_rate SHALL be set to the sample rate of this Audio Element and param_definition_mode SHALL be set to 0.
- duration SHALL be same as num_samples_per_frame of this Audio Element .
- num_subblocks SHALL be set to 1.
- constant_subblock_duration SHALL be same as duration

param_definition_size indicates the size in bytes of param_definition_bytes .

param_definition_bytes represents reserved bytes for future use when new param_definition_type values are defined. Parsers compliant with this version of the specification SHALL ignore these bytes.

scalable_channel_layout_config() is a class that provides the metadata required for combining the Audio Substream s identified here in order to reconstruct a scalable channel layout.

ambisonics_config() is a class that provides the metadata required for combining the Audio Substream s identified here in order to reconstruct an Ambisonics layout.

audio_element_config_size indicates the size in bytes of audio_element_config_bytes .

audio_element_config_bytes represents reserved bytes for future use when new audio_element_type values are defined. Parsers compliant with this version of the specification SHALL ignore these bytes.

default_demixing_info_parameter_data() is a class that provides the default parameter data for demixing to apply to all audio samples when there are no Parameter Block OBU s (with parameter_id defined in this DemixingParamDefinition()) provided.

In this class, w_idx_offset of the demixing_info_parameter_data() SHALL be ignored.
Instead of that, default_w directly indicates the weight value w(k).

default_w indicates the weight value w(k) for TF2toT2 de-mixer specified in § 7.2.2 De-mixer .

Mapping of default_w to w(k) SHOULD be as follows:

default_w :   w(k)
   0      :    0
   1      :  0.0179
   2      :  0.0391
   3      :  0.0658
   4      :  0.1038
   5      :  0.25
   6      :  0.3962
   7      :  0.4342
   8      :  0.4609
   9      :  0.4821
   10     :  0.5
11 ~ 15   :  reserved

A default recon gain value of 0db is implied when there are no Parameter Block OBU s (with parameter_id defined in this ReconGainParamDefinition()) provided.

3.6.1. Parameter Definition Syntax and Semantics

Parameter definition classes inherit from the abstract ParamDefinition() class.

For a given Parameter Substream , its timeline is fully aligned with the timeline of the Audio Element to which the given Parameter Substream will be applied, where the timeline of the Audio Element is post-coding (i.e. before trimming data). So, when we assume the same sample rate between the given Parameter Substream and the Audio Element , the start timestamp and the duration of the given Parameter Substream are the same as those of the Audio Element .

Syntax

abstract class ParamDefinition() {
  leb128() parameter_id;
  leb128() parameter_rate;
  unsigned int (1) param_definition_mode;
  unsigned int (7) reserved;
  if (param_definition_mode == 0) {
    leb128() duration;
    leb128() constant_subblock_duration;
    if (constant_subblock_duration == 0) {
      leb128() num_subblocks;
      for (i=0; i< num_subblocks; i++) {
        leb128() subblock_duration;
      }
    }
    
  }
}

Semantics

parameter_id indicates the identifier for the Parameter Substream which this parameter definition refers to. There SHALL be one unique parameter_id per Parameter Substream.

parameter_rate specifies the rate used by this Parameter Substream , expressed as ticks per second. Time-related fields associated with this Parameter Substream , such as durations, SHALL be expressed in the number of ticks.

The rate SHALL be such a value that (the rate x num_samples_per_frame ) / (the sample rate of Audio Element ) is a non-zero integer.

param_definition_mode indicates if this parameter definition specifies duration , num_subblocks , constant_subblock_duration and subblock_duration fields for the parameter blocks associated to the parameter_id .

When this field is set to 0, all of duration , num_subblocks , constant_subblock_duration , and subblock_duration fields SHALL be specified in this parameter definition mapped to the parameter_id . In that case, none of the parameter blocks associated with this parameter definition SHALL specify duration , num_subblocks , constant_subblock_duration , and subblock_duration fields.
When this field is set to 1, none of duration , num_subblocks , constant_subblock_duration , and subblock_duration fields SHALL be specified in this parameter definition. In that case, each of the parameter blocks associated with this parameter definition SHALL specify its own duration , num_subblocks , constant_subblock_duration , and subblock_duration fields.

duration specifies the duration for which all of the parameter blocks associated with this parameter definition are valid and applicable. It SHALL NOT be set to 0.

constant_subblock_duration specifies the duration of each subblock, in the case where all subblocks except the last subblock have equal durations. If all subblocks except the last subblock do not have equal durations, the value of constant_subblock_duration SHALL be set to 0.

Given D = the value of duration , NS = the value of num_subblocks , CSD = the value of constant_subblock_duration and SD = the value of subblock_duration .

When CSD != 0, num_subblocks is implicitly calculated as NS = roundup( D ÷ CSD ).
- If NS x CSD > D , the actual duration of the last subblock SHALL be D - ( NS - 1) x CSD .
When CSD = 0, the summation of all SD s in this parameter block SHALL be equal to D .

num_subblocks specifies the number of different sets of parameter values specified in each parameter block with the same parameter_id , where each set describes a different subblock of the timeline, contiguously.

subblock_duration specifies the duration for the given subblock. It SHALL NOT be set to 0.

Each value of duration , constant_subblock_duration , and subblock_duration SHALL be expressed as the number of ticks at the parameter_rate specified in the corresponding parameter definition.

3.6.2. Scalable Channel Layout Config Syntax and Semantics

scalable_channel_layout_config() contains information regarding the configuration of scalable channel audio.

Syntax

class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}
class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flags(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}

When an Audio Element is composed of G(r) number of Audio Substream s, scalable channel audio for the Audio Element is layered into num_layers = r number of ChannelGroup s.

The order of ChannelGroup s in each Temporal Unit SHALL be same as the order of channel_audio_layer_config()s in scalable_channel_layout_config().
ChannelGroup #q consists of G(q)-G(q-1) number of Audio Substream s. Where, q = 1, 2, ..., r and G(0) = 0.
Audio Frames is a set of Audio Frame OBU s with the same start timestamp of the Audio Element for scalable channel audio. Each of them comes from each coded Audio Substream .
Every Audio Frame SHALL have the same number of Audio Frame OBU s.
Parameter Block OBU MAY or MAY not present with Audio Frames.

Immersive Audio Sequence with scalable channel audio (before OBU packing)

The IA decoder SHALL select one of one or more channel audios provided by scalable channel audio. The IA decoder SHOULD select the appropriate channel audio according to the following rules, in order:

The IA decoder SHOULD first attempt to select the channel audio whose loudspeaker layout matches the physical playback layout.
If there is no match, the IA decoder SHOULD select the channel audio with the closest specified loudspeaker layout to the physical layout and then apply up or down-mixing appropriately, after decoding and reconstruction of the channel audio. § 10.2.2 Annex B-2: Down-mix Mechanism and § 7.6 Down-mix Matrix provide examples of dynamic and static down-mixing matrices for some common layouts that MAY be used.

Semantics

num_layers indicates the number of ChannelGroup s for scalable channel audio. It SHALL NOT be set to zero and its maximum number SHALL be limited to 6.

For Binaural, this field SHALL be set to 1.

channel_audio_layer_config() is a class that provides information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) provides information regarding the configuration of ChannelGroup #i.

loudspeaker_layout indicates the channel layout for the channels to be reconstructed from the precedent ChannelGroup s and the current ChannelGroup among ChannelGroup s for scalable channel audio. When a reserved value for loudspeaker_layout is used, parsers compliant with this version of the specification SHOULD skip that layer and all subsequent ones if any.

In this version of the specification, loudspeaker_layout indicates one of 10 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch, 3.1.2ch and Binaural, where

Stereo is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System A (0+2+0) of [ITU2051-3] .
5.1ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System B (0+5+0) of [ITU2051-3] .
5.1.2ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System C (2+5+0) of [ITU2051-3] .
5.1.4ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System D (4+5+0) of [ITU2051-3] .
7.1ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System I (0+7+0) of [ITU2051-3] .
7.1.2ch is the combination of the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System I (0+7+0) of [ITU2051-3] and the left and right top front pair of the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System J (4+7+0) of [ITU2051-3] .
7.1.4ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System J (4+7+0) of [ITU2051-3] .
3.1.2ch is the front subset (L/C/R/Ltf/Rtf/LFE) of 7.1.4ch .

Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
             1001           :     Binaural     : L/R
            others          :     reserved     :

Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, Lrs: Left Rear Surround, Rrs: Right Rear Surround
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

For a given input audio with audio_element_type = CHANNEL_BASED, if the input audio has height channels (e.g 7.1.4ch or 5.1.2ch), it is RECOMMENDED to use channel layouts with height channels (i.e., higher than or equal to 3.1.2ch) for all loudspeaker_layouts .

Examples for RECOMMENDED list of channel layouts: 3.1.2ch/5.1.2ch, 3.1.2ch/5.1.2ch/7.1.4ch, 5.1.2ch/7.1.4ch etc..
Examples for NOT RECOMMENDED list of channel layouts: 2ch/3.1.2ch/5.1.2ch, 2ch/3.1.2ch/5.1.2ch/7.1.4ch, 2ch/5.1.2ch/7.1.4ch, 2ch/7.1.4ch etc..

NOTE: Contents providers may be satisfied with the down-mixed audio having no height channels even though the down-mix mechanism, specified in § 10.2.2 Annex B-2: Down-mix Mechanism , drops height channels when it does down-mix from input channel audio with height channels to surround channels for example from 7.14ch to Mono, Stereo, 5.1ch or 7.1ch. In that case, an encoder may generate scalable audio with the down-mixed audio without having height channels from the input channel audio with height channels. In other words, this specification does not disallow for scalable audios to have a down-mixed audio without having height channels from input channel audio having height channels.

NOTE: The Ltr and Rtr of 5.1.4ch down-mixed from 7.1.4ch is within the range of Ltb and Rtb of 7.1.4ch.

output_gain_is_present_flag indicates if output_gain information fields for the ChannelGroup presents .

0: No output_gain information fields for the ChannelGroup present.
1: output_gain information fields for the ChannelGroup present. In this case, output_gain_flags and output_gain fields present.

recon_gain_is_present_flag indicates if recon_gain information fields for the ChannelGroup presents in recon_gain_info_parameter_data() .

0: No recon_gain information fields for the ChannelGroup present in recon_gain_info_parameter_data() .
1: recon_gain information fields for the ChannelGroup present in recon_gain_info_parameter_data() . In this case, recon_gain_flags and recon_gain fields present.

substream_count specifies the number of Audio Substream s. The sum of all substream_count s in this OBU SHALL be the same as num_substreams in this OBU. It SHALL NOT be set to 0.

coupled_substream_count specifies the number of referenced Audio Substream s, each of which is coded as coupled stereo channels.

Each pair of coupled stereo channels in the same ChannelGroup SHALL be coded in stereo mode to generate one single coded Audio Substream and each of the non-coupled channels in the same ChannelGroup SHALL be coded in mono mode to generate one single coded Audio Substream .

Coupled stereo channels : L/R, Ls/Rs, Lss/Rss, Lrs/Rrs, Ltf/Rtf, Ltb/Rtb
Non-coupled channels: C, LFE, L

The order of Audio Substream s in each ChannelGroup SHALL be as follows:

Coupled substreams come first and followed by non-coupled substreams.
Coupled substreams for surround channels comes first and followed by one(s) for top channels.
Coupled substreams for front channels comes first and followed by one(s) for side, rear and back channels.
Coupled substreams for side channels comes first and followed by one(s) for rear channels.
Center channel comes first and is followed by LFE, and then L.
Where, non-coupled substream is a coded Audio Substream from one of non-coupled channels.

output_gain_flags indicates the channels which output_gain is applied to. If a bit is set to 1, output_gain SHALL be applied to the channel. Otherwise, output_gain SHALL NOT be applied to the channel.

Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Right Top Front channel (Rtf)

output_gain indicates the gain value to be applied to the mixed channels which are indicated by output_gain_flags , where each mixed channel is generated by downmixing two or more input channels. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format] ).

3.6.3. Ambisonics Config Syntax and Semantics

ambisonics_config() contains information regarding the configuration of Ambisonics. In this specification, the [AmbiX] format is adopted, which uses Ambisonics Channel Number (ACN) channel ordering and normalizes the channels with Schmidt Semi-Normalization (SN3D).

Syntax

class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}
class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}
class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  signed int (16 * (N + M) * C) demixing_matrix;
}

Semantics

ambisonics_mode specifies the method of coding Ambisonics.

ambisonics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION

If ambisonics_mode is equal to MONO, this indicates that the Ambisonics channels are coded as individual mono Audio Substream s. For LPCM, ambisonics_mode SHALL be equal to MONO.

If ambisonics_mode is equal to PROJECTION, this indicates that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono Audio Substream s.

output_channel_count complies with channel count in [RFC8486] with following restrictions:

The allowed numbers of output_channel_count are (1+n)^2, where n = 0, 1, 2, ..., 14.
In other words, the scene-based Audio Element SHALL NOT include non-diegetic channels.

substream_count specifies the number of Audio Substream s. It SHALL be the same as num_substreams in this OBU.

channel_mapping complies with the one for ChannelMappingFamily = 2 in [RFC8486] .

coupled_substream_count specifies the number of referenced Audio Substream s that are coded as coupled stereo channels, where M <= N.

demixing_matrix complies with the one for ChannelMappingFamily = 3 in [RFC8486] except that the byte order of each of the matrix coefficients is converted to big-endian.

The order of Audio Substream s in ChannelGroup SHALL conform to [RFC8486] .

3.7. Mix Presentation OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Mix_Presentation.

The metadata in mix_presentation() specifies how to render, process and mix one or more Audio Element s, with details provided in § 7.3 Mix Presentation .

An IA Sequence MAY have one or more Mix Presentation s specified. The IA parser SHALL select the appropriate Mix Presentation to process according to the rules specified in § 7.3.1 Selecting a Mix Presentation .

A Mix Presentation MAY contain one or more sub-mixes. Common use cases MAY specify only one sub-mix, which includes all rendered and processed Audio Element s used in the Mix Presentation . The use-case for specifying more than one sub-mix arises if an IA multiplexer is merging two or more IA Sequence s. In this case, it MAY choose to capture the loudness information from the original IA Sequence s in multiple sub-mixes, instead of recomputing the loudness information for the final mix.

Syntax

class mix_presentation_obu() {
  leb128() mix_presentation_id;
  leb128() count_label;
  for (i = 0; i < count_label; i++) {
    string language_label;
  }
  for (i = 0; i < count_label; i++) {
    mix_presentation_annotations();
  }
  leb128() num_sub_mixes;
  for (i = 0; i < num_sub_mixes; i++) {    
    leb128() num_audio_elements;
    for (j = 0; j < num_audio_elements; j++) {
      leb128() audio_element_id;
      for (i = 0; i < count_label; i++) {
        mix_presentation_element_annotations();
      }
      rendering_config();
      element_mix_config();
    }
    output_mix_config();
    
    leb128() num_layouts;
    for (j = 0; j < num_layouts; j++) {
      layout loudness_layout;
      loudness_info loudness; 
    }
  }
}

Semantics

mix_presentation_id defines an identifier for a Mix Presentation . Within an IA Sequence , there SHALL be exactly one non-redundant Mix Presentation OBU with a given identifier. Each Mix Presentation OBU in the first Descriptors within the IA Sequence is regarded as a non-redundant OBU regardless of the value of its obu_redundant_copy . This identifier MAY be used by the application to select which Mix Presentation (s) to offer.

count_label indicates the number of labels in different languages.

language_label specifies the language which both mix_presentation_friendly_label and audio_element_friendly_label are written in. It SHALL conform to [BCP47] . The same language SHALL NOT be duplicated in this loop.

The ith label of both mix_presentation_annotation() and mix_presentation_element_annotation() SHALL be written in the language indicated by the ith language_label . Where, i = 0, 1, ..., count_label -1.

mix_presentation_annotations() is a class that provides informational metadata that an IA parser SHOULD refer to when selecting the Mix Presentation to use. The metadata MAY also be used by the playback system to display information to the user but is not used in the rendering or mixing process to generate the final output audio signal.

num_sub_mixes specifies the number of sub-mixes. It SHALL NOT be set to 0.

num_audio_elements specifies the number of Audio Element s that are used in this Mix Presentation to generate the final output audio signal for playback. It SHALL NOT be set to 0.

audio_element_id indicates the identifier for an Audio Element which this Mix Presentation refers to.

mix_presentation_element_annotations() is a class that provides informational metadata that an IA parser SHOULD refer to when selecting the referenced Audio Element to use. The metadata MAY also be used by the playback system to display information to the user, but is not used in the rendering or mixing process to generate the final output audio signal.

rendering_config() is a class that provides the metadata required for rendering the referenced Audio Element .

element_mix_config() is a class that provides the metadata required for applying any processing to the referenced and rendered Audio Element before being summed with other processed Audio Element s.

output_mix_config() is a class that provides the metadata required for post-processing the mixed audio signal to generate the audio signal for playback.

num_layouts specifies the number of layouts for this sub-mix on which the loudness information was measured.

loudness_layout identifies the layout that was used to measure the loudness information provided in this sub-mix.

loudness provides the loudness information which was measured on loudness_layout for the Rendered Mix Presentation by this sub-mix.

The layout specified in loudness_layout SHOULD NOT be higher than the highest layout among layouts provided by the Audio Element s except zero-order Ambisonics or Mono. In other words, rendering from an Audio Element with the highest layout (except zero-order Ambisonics or Mono) to the loudness_layout SHOULD NOT require an up-mix.

loudness_layout for zero-order Ambisonics or Mono SHOULD NOT be higher than Stereo. Zero-order Ambisonics or Mono MAY be rendered to Stereo.

If one sub-mix of Mix Presentation OBU includes only one single scalable channel audio, then it complies with as follows:

num_layouts SHALL be greater than or equal to num_layers specified in scalable_channel_layout_config() of Audio Element OBU for the audio_element_id except the following cases:

The highest loudness_layout specified in one sub-mix except for zero-order Ambisonics or Mono is the layout that was used for authoring the sub-mix.

The highest loudness_layout for zero-order Ambisonics or Mono is Stereo.

Each sub-mix SHALL include loudness_layout to identify Loudspeaker configuration for Sound System A (0+2+0) (i.e. Stereo). In other words, each sub-mix SHALL include loudness_info() for Stereo.

If the Rendered Mix Presentation by each sub-mix is Mono, then its loudness for loudness_layout = Stereo SHOULD be measured on the Stereo generated from the Mono by the equations, L = 0.707 x Mono and R = 0.707 x Mono.

3.7.1. Mix Presentation Annotations Syntax and Semantics

Syntax

class mix_presentation_annotations() {
  string mix_presentation_friendly_label;
}

Semantics

mix_presentation_friendly_label specifies a human-friendly label to describe this Mix Presentation .

3.7.2. Mix Presentation Element Annotations Syntax and Semantics

Syntax

class mix_presentation_element_annotations() {
  string audio_element_friendly_label;
}

Semantics

audio_element_friendly_label specifies a human-friendly label to describe the referenced Audio Element .

3.7.3. Rendering Config Syntax and Semantics

During playback, an Audio Element SHOULD be rendered using a pre-defined renderer according to § 7.3.2 Rendering an Audio Element .

Syntax

class rendering_config() {
  unsigned int (2) headphones_rendering_mode;
  unsigned int (6) reserved;
  leb128() rendering_config_extension_size;
  unsigned int (8*rendering_config_extension_size) rendering_config_extension_bytes;
}

Semantics

headphones_rendering_mode indicates whether the input channel-based Audio Element is rendered to stereo loudspeakers or spatialized with a binaural renderer when played back on headphones. If the playback layout is a loudspeaker layout or the input Audio Element is not CHANNEL_BASED, the parsers SHALL ignore this field.

0: It indicates that the input Audio Element is rendered to loudspeaker_layout = Stereo.
1: It indicates that the input Audio Element is rendered to binaural output.
2~3: Reserved.

Parsers encountering a Reserved value of headphones_rendering_mode SHALL ignore the Mix Presentation OBU that contains this rendering_config() .

reserved SHALL be ignored by the parser.

rendering_config_extension_size indicates the size in bytes of rendering_config_extension_bytes .

rendering_config_extension_bytes represents reserved bytes for future use. Parsers compliant to this version of the specification SHALL ignore these bytes.

3.7.4. Element Mix Config Syntax and Semantics

element_mix_config() provides a gain value to be applied to the rendered Audio Element signal.

Syntax

class element_mix_config() {
  MixGainParamDefinition mix_gain;
}

class MixGainParamDefinition() extends ParamDefinition() {
  signed int (16) default_mix_gain;
}

Semantics

mix_gain provides the parameter definition for the gain value that is applied to all channels of the rendered Audio Element signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data() .

default_mix_gain specifies the default mix gain value to apply when there are no mix gain parameter blocks provided. This value is expressed in dB and SHALL be applied to all channels in the rendered Audio Element . It is stored as a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format] ).

3.7.5. Output Mix Config Syntax and Semantics

output_mix_config() provides a gain value to be applied to the mixed audio signal.

Syntax

class output_mix_config() {
  MixGainParamDefinition output_mix_gain;
}

Semantics

output_mix_gain provides the parameter definition for the gain value that is applied to all channels of the mixed audio signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data() .

3.7.6. Layout Syntax and Semantics

The layout class specifies either a binaural system or the list of physical loudspeaker positions according to [ITU2051-3] .

Syntax

class layout() {
  unsigned int (2) layout_type;
  
  if (layout_type == LOUDSPEAKERS_SS_CONVENTION) {
    unsigned int (4) sound_system;
    unsigned int (2) reserved;
  }
  else if (layout_type == BINAURAL or RESERVED) {
    unsigned int (6) reserved;
  }
}

Semantics

layout_type specifies the layout type.

layout_type : Layout type
   0 - 1    : RESERVED
     2      : LOUDSPEAKERS_SS_CONVENTION
     3      : BINAURAL

A value of 0 or 1 is reserved.
A value of 2 indicates that the layout is defined using the sound system convention of [ITU2051-3] .
A value of 3 indicates that the layout is binaural.

sound_system specifies the sound system A to J as specified in [ITU2051-3] , 7.1.2ch and 3.1.2ch of loudspeaker_layout as follows:

0: It indicates Loudspeaker configuration for Sound System A (0+2+0)
1: It indicates Loudspeaker configuration for Sound System B (0+5+0)
2: It indicates Loudspeaker configuration for Sound System C (2+5+0)
3: It indicates Loudspeaker configuration for Sound System D (4+5+0)
4: It indicates Loudspeaker configuration for Sound System E (4+5+1)
5: It indicates Loudspeaker configuration for Sound System F (3+7+0)
6: It indicates Loudspeaker configuration for Sound System G (4+9+0)
7: It indicates Loudspeaker configuration for Sound System H (9+10+3)
8: It indicates Loudspeaker configuration for Sound System I (0+7+0)
9: It indicates Loudspeaker configuration for Sound System J (4+7+0)
10: It indicates the same loudspeaker configuration as loudspeaker_layout = 0110 (i.e. 7.1.2ch)
11: It indicates the same loudspeaker configuration as loudspeaker_layout = 1000 (i.e. 3.1.2ch)
12: It indicates Mono
13 ~ 15: Reserved

When a value for layout_type or sound_system is not supported, parsers compliant with this version of the specification SHOULD ignore this layout() and the following loudness_info() .

3.7.7. Loudness Info Syntax and Semantics

loudness_info() provides loudness information for a given audio signal.

All signed values are stored as signed Q7.8 fixed-point values (in [Q-Format] ).

Syntax

class loudness_info() {
  unsigned int (8) info_type;
  signed int (16) integrated_loudness;
  signed int (16) digital_peak;
  if (info_type & 1) {
    signed int (16) true_peak;
  }
  if (info_type & 2) {
    unsigned int (8) num_anchored_loudness;
    for (i = 0; i < num_anchored_loudness; i++) {
      unsigned int (8) anchor_element;
      signed int (16) anchored_loudness;
    }
  }
  if (info_type & 0b11111100 > 0) {
      leb128() info_type_size;
      unsigned int (8*info_type_size) info_type_bytes;
    }
  }
}

Semantics

info_type is a bitmask that specifies the type of loudness information provided. The bits are set as follows, where the first bit is the LSB:

Bit : Type of information provided
 0 (LSB)  : True peak
 1        : Anchored Loudness (one or more)
2~7 (MSB) : Reserved

When a bitmask for reserved of info_type is set, parsers compliant with this version of the specification SHOULD ignore all bytes from the first byte of the syntaxes defined by the bitmask to the last byte of the OBU.

integrated_loudness provides the program integrated loudness information, specified in LKFS as defined in [ITU1770-4] , and measured according to [ITU1770-4] .

digital_peak specifies the digital (sampled) peak value of the audio signal, specified in dBFS.

true_peak specifies the true peak of the audio signal, specified in dBFS and measured according to [ITU1770-4] .

anchor_element specifies the anchor element used in computation of the anchored_loudness which follows, as defined in [ISOIEC-23091-3-2018] , as follows:

  0   : Unknown
  1   : Dialogue
  2   : Album
3~255 : Reserved

There SHALL be no duplicate values of anchor_element within one loudness_info(). When a reserved value of anchor_element is set, parsers compliant with this version of the specification MAY treat it as Unknown.

anchored_loudness specifies the loudness information according to the anchor element, specified in LKFS as defined in [ITU1770-4] .

NOTE: [ITU1770-4] adopts the convention of using the dBov unit for dBFS, where the RMS value of a full-scale square wave is 0 dBov. The same convention is adopted here.

info_type_size indicates the size in bytes of info_type_bytes .

info_type_bytes represents reserved bytes for future use when new marks of info_type are defined. Parsers compliant to this version of the specification SHALL ignore these bytes.

3.8. Parameter Block OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameter values for an algorithm for an indicated duration, including any animation of the parameter values over this duration. The metadata is used in conjunction with a corresponding parameter definition and parameter data specification. The parameter definition is specified based on ParamDefinition() . The parameter data provides the values to apply in each parameter block. These are specified using the AnimatedParameterData() function template if parameter animation is supported.

Syntax

class parameter_block_obu() {
  leb128() parameter_id;
  
  (param_definition_type, param_definition_mode, duration, num_subblocks, constant_subblock_duration, subblock_duration) = get_param_definition(parameter_id);
  
  if (param_definition_mode) {
    leb128() duration;
    leb128() constant_subblock_duration;
    if (constant_subblock_duration == 0) {
      leb128() num_subblocks;
    } else {
      // num_subblocks = roundup(duration ÷ constant_subblock_duration)
    }
  }
  for (i = 0; i < num_subblocks; i++) {
    if (param_definition_mode) {
      if (constant_subblock_duration == 0) {
        leb128() subblock_duration;
      }
    }
    if (param_definition_type == PARAMETER_DEFINITION_MIX_GAIN) {
      mix_gain_parameter_data();
    }
    else if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
      demixing_info_parameter_data();
    }
    else if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
      recon_gain_info_parameter_data();
    }
    else {
      leb128 parameter_data_size;
      unsigned int (8*parameter_data_size) parameter_data_bytes;
    }
  }
}

Semantics

parameter_id defines an identifier for a Parameter Substream . Parameter Block OBU refer to the Parameter Substream through this identifier. If no Audio Element OBU s or Mix Presentation OBU s refer to this parameter_id , parsers compliant to this version of the specification SHALL ignore Parameter Block OBU s with this identifier.

get_param_definition() is a run-time function to get the parameter definition type, the parameter definition mode, duration, num_subblocks, constant_subblock_duration, and subblock_duration mapped to the parameter_id.

When it gets an unknown param_definition_type , parsers compliant with this version of the specification SHALL ignore the Parameter Block OBU .

duration specifies the duration for which this parameter block is valid and applicable. It SHALL NOT be set to 0.

num_subblocks specifies the number of different sets of parameter values specified in this parameter block, where each set describes a different subblock of the timeline, contiguously. When constant_subblock_duration != 0, num_subblocks is implicitly calculated as num_subblocks = roundup( duration ÷ constant_subblock_duration ).

Audio Element OBU and/or Mix Presentation OBU is mapping a parameter_id to the parameter definition type. So, IA decoders can know the definition type mapped to the parameter_id.

subblock_duration specifies the duration for the given subblock. It SHALL NOT be set to 0.

Each value of duration, constant_subblock_duration, and subblock_duration SHALL be expressed as the number of ticks at the parameter_rate specified in the corresponding parameter definition.

parameter_data_size indicates the size in bytes of parameter_data_bytes .

parameter_data_bytes represents reserved bytes for future use when new syntaxes are defined. Parsers compliant with this version of the specification SHALL ignore these bytes.

3.8.1. Mix Gain Parameter Data Syntax and Semantics

mix_gain_parameter_data() specifies a parameter data to be used for mixing of an Audio Element .

Syntax

class mix_gain_parameter_data() {
  leb128() animation_type;
  AnimatedParameterData<signed int (16)> param_data;
}

Semantics

animation_type specifies the type of animation applied to the parameter values. When an unknown value of animation_type is used, parsers compliant with this version of the specification SHALL ignore the Parameter Block OBU that contains this mix_gain_parameter_data() .

param_data uses the AnimatedParameterData function template. Each of the values defined within this instance (start_point_value, end_point_value, and control_point_value) is expressed in dB and SHALL be applied to all channels in the rendered Audio Element . They are stored as 16-bit, signed, two’s complement fixed-point values with 8 fractional bits (i.e. Q7.8 in [Q-Format] ).

animation_type : Animation Type
       0       : STEP
       1       : LINEAR
       2       : BEZIER

Classes that take animation_type as an input argument use the AnimatedParameterData() function template. The method of applying the animation is described in § 7.4 Animated Parameters .

template <class T>
class AnimatedParameterData(animation_type) {
  if (animation_type == STEP) {
    T start_point_value;
  }
  if (animation_type == LINEAR) {
    T start_point_value;
    T end_point_value;
  }
  if (animation_type == BEZIER) {
    T start_point_value;
    T end_point_value;
    T control_point_value;
    unsigned int (8) control_point_relative_time;
  }
}

start_point_value specifies the parameter value that is applied at the start of the subblock.

end_point_value specifies the parameter value that is applied at the end of the subblock.

control_point_value specifies the parameter value of the middle control point of a quadratic Bezier curve, i.e. its y-axis value.

control_point_relative_time specifies the time of the middle control point of a quadratic Bezier curve, i.e. its x-axis value. This value is expressed as a fraction of the parameter subblock duration with valid values in the range of 0 and 1, inclusively. A value equal to 0 or 1 indicates that this animation implements a linear Bezier curve, in which case control_point_value shall be ignored by the IA parser. It is stored as an 8-bit, unsigned, fixed-point value with 8 fractional bits (i.e. Q0.8 in [Q-Format] ).

3.8.2. Demixing Info Parameter Data Syntax and Semantics

demixing_info_parameter_data() specifies the demixing parameter mode to be used to reconstruct output channel audio according to its loudspeaker_layout .

Syntax

class demixing_info_parameter_data() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}

Semantics

dmixp_mode indicates a mode of pre-defined combinations of five demix parameters.

0: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, -1)
1: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, -1)
2: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, -1)
3: reserved
4: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, 1)
5: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, 1)
6: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, 1)
7: reserved

alpha and beta are gain values used for S7to5 down-mixer, gamma for T4to2 down-mixer, delta for S5to3 down-mixer and w_idx_offset is the offset to generate a gain value w used for T2toTF2 down-mixer.

IA Down-mix Mechanism

3.8.3. Recon Gain Info Parameter Data Syntax and Semantics

recon_gain_info_parameter_data() contains recon gain values for demixed channels.

NOTE: recon_gain_info_parameter_data() is required to compensate the propagated errors by De-mixer and Gain modules specified in § 7.2.2 De-mixer and § 7.2.1 Gain due to the error caused by lossy codecs such as OPUS and AAC-LC. However, it is not required for lossless codecs such as FLAC and LPCM because the propagated errors are negligible.

Syntax

class recon_gain_info_parameter_data() {
  for (i=0; i< num_layers; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flags(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}

Semantics

recon_gain_flags indicates the channels which recon_gain is applied to.

Table for Recon Gain Flags

Each bit of recon_gain_flags indicates the presence of recon_gain applied to the channel as depicted in the above figure.

0: It indicates that no recon_gain presents for the channel.
1: It indicates that recon_gain presents for the channel.

n(i) indicates the number of bits for recon_gain_flags (i). It SHALL be 7 or 12 as depicted in the above figure. Where, i = 0, 1, ..., num_layers - 1.

recon_gain indicates the gain value to be applied to the channel, which is indicated by recon_gain_flags , after decoding the associated frames and demixing operation. Where the channel is indicated by recon_gain_flags . Detailed operation by using this value is specified in § 7.2.3 Recon Gain .

3.9. Audio Frame OBU Syntax and Semantics

This section specifies OBU payloads of OBU_IA_Audio_Frame and OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17.

audio_substream_id is an identifier for the Audio Substream associated with this audio frame. Within an IA Sequence , there SHALL be exactly one non-redundant Audio Element OBU with a audio_substream_id . Each Audio Element OBU in the first Descriptors within the IA Sequence is regarded as a non-redundant OBU regardless of the value of its obu_redundant_copy .

Syntax

class audio_frame_obu(audio_substream_id_in_bitstream) {
  if (audio_substream_id_in_bitstream) {
     leb128() explicit_audio_substream_id;
  }
  unsigned int (8*coded_frame_size) audio_frame();
}

Semantics

Where audio_substream_id_in_bitstream is not a syntax in an IA Sequence but indicates a status whether this OBU payload is including explicitly audio_substream_id . It is `true` for obu_type = OBU_IA_Audio_Frame and `false` for obu_type = OBU_IA_Audio_Frame_ID0, OBU_IA_Audio_Frame_ID1, ..., or OBU_IA_Audio_Frame_ID17.

explicit_audio_substream_id defines the audio_substream_id of this frame. The value SHALL be greater than 17. When this field is not present audio_substream_id is implicit and is defined as a value from 0 to 17 for OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17 respectively.

NOTE: The first 18 Audio Substream s in an IA Sequence MAY use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID17, which have predefined audio_substream_id s associated with them. This reduces bitrate by avoiding the extra explicit_audio_substream_id field in the bitstream.

coded_frame_size is the size of audio_frame() in bytes.

audio_frame() is the coded audio data for the frame. It is codec specific and its format is defined in § 3.11 Codec Specific .

3.10. Temporal Delimiter OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Temporal_Delimiter.

Syntax

class temporal_delimiter_obu() {
}

NOTE: The Temporal Delimiter OBU has an empty payload.

3.11. Codec Specific

This section defines codec-specific information for Codec_Specific_Info and its coded Audio Substream . To generate one single coded Audio Substream , only mono or stereo coding SHALL be allowed for this version of the specification.

Codec_Specific_Info is composed of codec_id and decoder_config() .

For legacy codecs, decoder_config() SHALL be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the coded Audio Substream . For future codecs, decoder_config() SHALL include all of the decoding parameters which are required to decode the coded Audio Substream .

A coded Audio Substream is a coded stream for one or more channels. The format of audio_frame() is exactly the same as the sample format (before packing OBU) for the audio file which consists of only one single coded stream by the codec_id .

3.11.1. OPUS Specific

codec_id SHALL be Opus .

decoder_config() for OPUS conforms to ID Header with ChannelMappingFamily = 0 of [RFC7845] with following constraints:

Magic Signature SHALL NOT be present.
Output Channel Count SHALL be set to 2. Output Channel Count can be ignored because the real value can be determined from the Audio Element OBU and from the opus packet header.
Pre-skip SHALL be the same as the number of audio samples to be trimmed at the start of coded Audio Substream s.
Output Gain SHALL NOT be used. In other words, it SHALL be set to 0dB.
The byte order of each field in ID Header is converted to big-endian.

The format of audio_frame() is opus packet of [RFC6716] which contains only one single frame of mono or stereo channels and which has a non-delimiting frame structure.

The sample rate used for computing offsets SHALL be 48 kHz.

3.11.2. AAC-LC Specific

codec_id SHALL be mp4a .

decoder_config() for AAC-LC is DecoderConfigDescriptor() of [MP4-Systems] , which is a subset of ESDBox for [MP4-Audio] , with the following constraints:

objectTypeIndication = 0x40
streamType = 0x05 (Audio Stream)
upstream = 0
decSpecificInfo() : The syntax and values conforms to AudioSpecificConfig() of [MP4-Audio] with the following constraints:
- audioObjectType = 2
- channelConfiguration SHALL be set to 2. The real value can be implied from the Audio Element OBU .
- GASpecificConfig() : The syntax and values conform to GASpecificConfig() of [MP4-Audio] with the following constraints:
  - frameLengthFlag = 0 (1024 lines IMDCT)
  - dependsOnCoreCoder = 0
  - extensionFlag = 0

The format of audio_frame() is one single raw_data_block() of [AAC] which contains only one single frame of mono or stereo channels.

The sample rate used for computing offsets SHALL be the rate indicated by the samplingFrequencyIndex in GASpecificConfig()

3.11.3. FLAC Specific

codec_id SHALL be fLaC , the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

decoder_config() for FLAC is METADATA_BLOCK of [FLAC] .

The format of audio_frame() is FRAME of [FLAC] , which is composed of FRAME_HEADER , followed by SUBFRAME (s) (one SUBFRAME per channel) and followed by FRAME_FOOTER .

The sample rate used for computing offsets SHALL be the sampling rate indicated in the METADATA_BLOCK .

3.11.4. LPCM Specific

codec_id SHALL be ipcm .

decoder_config() for LPCM is as follows:

class decoder_config(ipcm) {
  unsigned int (8) sample_format_flags;
  unsigned int (8) sample_size;
  unsigned int (32) sample_rate;
}

sample_format_flags complies with format_flags specified in [MP4-PCM] . In other words, 0x01 indicates little-endian PCM sample format and 0x00 indicates big-endian PCM sample format.

sample_size complies with PCM_sample_size specified in [MP4-PCM] . In other words, it SHALL take a value from sets 16, 24, and 32.

sample_rate indicates the sample rate of the input audio in Hz. It SHALL take a value from the set 44.1k, 16k, 32k, 48k, and 96k.

audio_frame() of Audio_Frame_OBU is only one single PCM audio frame of mono or stereo channels.

In case of that audio_frame() contains one single PCM audio frame of stereo channels, the ith audio sample of the left channel is followed by the ith audio sample of the right channel, and then the (i+1)th audio sample of the left channel is followed by the (i+1)th audio sample of the right channel, where i = 1, 2, ..., num_samples_per_frame .
When more than one byte is used to represent a PCM sample, the byte order (i.e. its endianness) is indicated in sample_format_flags .

The sample rate used for computing offsets SHALL be sample_rate .

4. Profiles

The IA Profiles define a set of capabilities that are REQUIRED to parse, decode and process the corresponding IA Sequence .

IA decoders SHALL be able to parse all OBUs explicitly listed for this version of the specification. They can still encounter reserved OBUs that they SHOULD skip. This allows future versions of the specification to define new profiles that can be backward compatible with old profiles.

NOTE: In this section and subsections, the meaning of a unique OBU is that it is still unique if it only varies by the redundant flag.

Common restrictions on the IA Sequence for all profiles specified in this version of the specification:

There SHALL be only one unique Descriptors . So, if the Descriptors present in the middle of the IA Sequence , all OBUs of the Descriptors SHALL be marked as redundant (i.e. obu_redundant_copy = 1)
- When Descriptors are placed in the middle of the IA Sequence , it SHALL NOT be placed in the middle of a Temporal Unit . In other words, Descriptors SHALL be present after the last OBU of a Temporal Unit and before the first OBU of the next Temporal Unit .
There SHALL be only one unique Codec Config OBU .
Every Audio Substream in the IA Sequence SHALL have the same start timestamp, SHALL consist of the same number of Audio Frame OBU s, and SHALL have the same trimming information.
Every Parameter Substream in the IA Sequence SHALL have the same start timestamp as the Audio Substream which the Parameter Substream is applied to, SHALL consist of the same number of Parameter Block OBU s.
- Every Parameter Block OBU SHALL have the same duration as that of the Audio Frame OBU under the same sample rate.
  - For example, when Audio Frame OBU has 960 audio samples at 48000Hz, the duration of every Parameter Block OBU SHALL be 960 units at 48000Hz or 480 units at 24000Hz.
In every Temporal Unit , the start timestamp of every Audio Frame OBU SHALL be the same as that of every Parameter Block OBU if present.
- There SHALL be no redundant Parameter Block OBU .
- Parameter Block OBU s shall come first and followed by Audio Frame OBU s.
num_sub_mixes SHOULD be set to 1. Mix Presentation OBUs with num_sub_mixes > 1 SHOULD be ignored.
num_audio_elements SHOULD be set to 1 or 2. Mix Presentation OBUs with num_audio_elements > 2 SHOULD be ignored.

NOTE: This behavior is to allow future versions of this specification to define new profiles that support a number of audio elements and/or a number of sub-mixes greater than those recommended in this profile, while still permitting streams compliant with these new profiles to be processed by parsers compliant to the profiles defined in this version of the specification.

When num_layers = 1, DemixingParamDefinion() for demixing MAY be present in the Audio Element OBU and IA decoders MAY use the (default) parameter data for demixing for (dynamic) down-mixing.

4.1. IA Simple Profile

This section specifies the conformance points of the simple profile.

Restrictions on the IA Sequence :

There SHALL be only one unique Audio Element OBU .
There SHALL NOT be any Temporal Delimiter OBU s present.
additional_profile SHALL be set to 0 to indicate that the IA Sequence complies to this profile.
primary_profile SHALL be set to 0.

Capabilities of the IA parser, decoder, and processor:

They SHALL be able to parse an IA Sequence with primary_profile = 0.
They SHALL be able to decode and process up to 16 channels.
They SHALL be able to reconstruct one Audio Element .
They MAY use (default_)demixing_info_parameter_data() to do down-mixing.

4.2. IA Base Profile

This section specifies the conformance points of the base profile.

Restrictions on IA Sequence :

There SHALL be at most two unique Audio Element OBU s at any one time.
- There SHALL be at most one Channel-based Audio Element having num_layers > 1 at any one time.
- There SHALL be at most one Scene-based Audio Element at any one time.
- In other words, following combinations of two Audio Element s are only allowed.
  - Channel-based Audio Element having num_layers = 1 + Channel-based Audio Element having num_layers = 1.
  - Channel-based Audio Element having num_layers = 1 + Channel-based Audio Element having num_layers > 1.
  - Scene-based Audio Element + Channel-based Audio Element having num_layers = 1.
  - Scene-based Audio Element + Channel-based Audio Element having num_layers > 1.
There MAY be Temporal Delimiter OBU s present. If present, the first OBU of every Temporal Unit SHALL be Temporal Delimiter OBU .
additional_profile SHALL be set to 1 to indicate that the IA Sequence complies to this profile.
primary_profile MAY be set to 0 or 1.

Capabilities of the IA parser, decoder, and processor:

They SHALL be able to parse an IA Sequence with primary_profile = 0 or 1.
They SHALL be able to support the capabilities of the Simple Profile.
They SHALL be able to decode and process up to 18 channels.
- Where 18 channels mean not the number of channels after mixing of two Audio Element s but the total sum of channels before mixing of two Audio Element s.
- One specific example of 18 channels is 3rd-order Ambisonics (16 channels) + non-diegetic stereo (2 channels).
They SHALL be able to reconstruct two Audio Element s.
They SHALL be able to mix two Audio Element s.

5. Standalone IAMF Representation

This section details the order in which the OBUs are sequenced in a standalone IAMF representation.

5.1. OBU Sequence Order

An IA Sequence is composed of a series of OBUs in the sequence of a set of Descriptors followed by their associated IA Data s.

NOTE: In a typical case, the first Descriptors of the IA Sequence are all non-redundant (i.e. obu_redundant_copy = 0). When two IA Sequence s are concatenated, every OBU of the first Descriptors of the second IA Sequence is markded as non-redundant (i.e. obu_redundant_copy = 0).

The Descriptors MAY additionally be repeated redundantly and as frequently as necessary. In this case, the obu_redundant_copy field in their OBU headers SHALL be set to 1.

The below figure shows an example of IA Sequence .

Example of Immersive Audio Sequence

5.1.1. Descriptor OBUs

A set of Descriptors SHALL be placed in the following order regardless of where they appear in the bitstream:

One IA Sequence Header OBU
All Codec Config OBU s
All Audio Element OBU s
All Mix Presentation OBU s

5.1.2. IA Data OBUs

IA Data consists of a sequence of Audio Frame OBU s, Parameter Block OBU s and Temporal Delimiter OBU s (if present), according to the rules below:

Audio Frame OBU s and Parameter Block OBU s SHALL be ordered by their implied timestamp in the timeline.
If there are multiple Audio Frame OBU s that have the same implied start timestamp, they SHALL be grouped by Audio Element s.
A Temporal Delimiter OBU MAY be inserted at the beginning of a Temporal Unit .
If Temporal Delimiter OBU s are present, one of them SHALL be inserted at the beginning of every Temporal Unit .

Additionally, the following constraints apply to the Audio Frame OBU s and Parameter Block OBU s:

Audio Frame OBU s SHALL be provided non-redundantly, such that for each Audio Substream , there are no two Audio Frame OBU s that are overlapping in time.
Non-redundant Parameter Block OBU s SHALL NOT provide data for overlapping time regions.

5.1.3. Refreshing Descriptor OBUs

The above describes the full sequence of OBUs for a given set of Descriptors and their associated IA Data .

If the IAMF configuration changes, a new set of Descriptors is REQUIRED. In that case, a new IA Sequence of the complete set of Descriptors and their corresponding IA Data s SHALL follow, in the same order as described above.

Each OBU of the Descriptors of the new IA Sequence SHALL be marked as non-redundant (i.e. obu_redundant_copy = 0 in the OBU header).

6. ISOBMFF IAMF Encapsulation

6.1. General Requirements & Brands

A file conformant to this specification satisfies the following:

It SHALL conform to the normative requirements of [ISOBMFF]
It SHALL have the iamf brand among the compatible brands array of the FileTypeBox
It SHALL contain at least one track using an IASampleEntry
It SHOULD indicate a structural ISOBMFF brand among the compatible brands array of the FileTypeBox, such as iso6
It MAY indicate other brands not specified in this specification provided that the associated requirements do not conflict with those given in this specification

Parsers SHALL support the structures required by the iso6 brand and MAY support structures required by further ISOBMFF structural brands.

6.2. ISOBMFF IAMF Encapsulation

This section describes the basic data structures used to signal encapsulation of IA Sequence in [ISOBMFF] containers.

6.2.1. Requirement of IA Sequence

Even though an IA Sequence can theoretically group audio data coded with different codecs, potentially with different timing properties, which would require multiple tracks, this version of the specification only supports storing an IA Sequence as a single track thanks to the restrictions of the selected profiles.

6.2.2. Encapsulation Scheme

The result of encapsulating an IA Sequence into an [ISOBMFF] file is as follows:

If there are audio samples to be trimmed at the start or at the end, the edts and elst boxes SHALL be present to reflect the trimming status.
Sample Entry
- An IA Sample SHALL be associated with only one sample entry, and the configOBUs in that sample entry SHALL contain the Descriptors required to process the IA Sample . If a different set of Descriptors is needed, a new sample entry SHALL be defined.

NOTE: Multiple sample entries may be used in a track, for example when the track is the concatenation of multiple tracks or multiple IA Sequence s, and some IA Sample s have different configOBUs values.

Decoding Time to IA Sample
- The stts or trun box SHALL indicate the number of audio samples in an IA Sample (i.e. the duration of an IA Sample ).
- The duration of IA Sample includes audio samples trimmed at the beginning but excludes audio samples trimmed at the end.
Sample Group
- When the codec_id is set to Opus or mp4a in an IA Track, every sample SHALL be associated with a sample group of type roll . The roll_distance value SHALL equal the value of the audio_roll_distance field in the Codec Config OBU stored in the configOBUs array in the sample entry.
Composition Time Stamp (CTS)
- For each IA Sample , CTS = DTS (Decoding Time Stamp), and as a consequence, the ctts box (and similar signaling in movie fragments) SHALL not be used.

6.2.3. IA Sample Entry

Sample Entry Type: iamf
Container:         Sample Description Box ('stsd')
Mandatory:         Yes
Quantity:          One or more.

IASampleEntry specifies that the track contains IA Sample s.

Syntax

class IASampleEntry extends AudioSampleEntry('iamf') {
  unsigned int (8) configOBUs[];
}

The channelcount field of AudioSampleEntry SHALL be set to 0. The samplerate field of AudioSampleEntry SHALL be set to 0. There SHALL be no SamplingRateBox . Parsers SHALL ignore these two fields.

Semantics

configOBUs SHALL contain the following OBUs in order.

IA Sequence Header OBU
Codec Config OBU
One or more Audio Element OBU s
One or more Mix Presentation OBU s

6.2.4. IA Sample Format

Syntax

class ia_sample() {
  unsigned int (8) obus[];
}

Semantics

obus is a sequence of OBUs representing one Temporal Unit .

For tracks using the IASampleEntry , an IA Sample has the following constraints:

All IA Sample s SHALL be marked as sync samples.
One IA Sample SHALL be one Temporal Unit and SHALL NOT contain the Temporal Delimiter OBU .
The decode duration of an IA Sample SHALL equal the duration of the underlying Temporal Unit , i.e. the decode duration of the Audio Frame OBU .

NOTE: Per the restriction of the profiles carried in an IA track, all Audio Frame OBU s in an IA Sample have the same duration and have the same trimming information. If Audio Frame OBU s in the IA Sample contain trimming information, the corresponding audio samples are removed from the presentation using edit list information.

NOTE: In typical cases, when a track contains a single IA Sequence , trimming can only happen at the beginning or the end of the IA Sequence . Therefore, the edit list can describe the start and end trimming with a single edit entry. A track storing consecutive IA Sequence s may need multiple edits in the edit list.

6.3. Codecs Parameter String

DASH and other applications require defined values for the codecs parameter specified in [RFC6381] for ISO Media tracks. The codecs parameter string for codec_id SHALL be:

Per [RFC6381] and [ISOBMFF] , the first element of the codecs parameter string is iamf .
The second element indicates the primary_profile . It is three digits within the range of 0 to 255.
The 3rd element indicates the additional_profile . It is three digits within the range of 0 to 255.
The 4th element and any additional elements, if any, SHALL be the elements of the codecs parameter string if that stream was carried in its own track (i.e. not encapsulated in IAMF).

For example,

the codecs parameter string for codec_id = Opus is

iamf.xxx.yyy.Opus

the codecs parameter string for codec_id = mp4a is

iamf.xxx.yyy.mp4a.40.2

the codecs parameter string for codec_id = fLaC is

iamf.xxx.yyy.fLaC

the codecs parameter string for codec_id = ipcm is

iamf.xxx.yyy.ipcm

where xxx is three digits to indicate the value of the primary_profile and yyy is three digits to indicate the value of the additional_profile .

6.4. ISOBMFF IAMF Decapsulation (Informative)

6.4.1. Decapsulating an ISOBMFF IAMF File with a Single Track

This section provides a guideline for IAMF parsers reconstructing IA Sequence s from an IAMF file with a single track, as follows.

The configOBUs from the IASampleEntry are placed at the beginning of the IA Sequence . These are the Descriptors .
Next, place the OBUs from the j = 1, 2, ..., m -th IA Sample s associated with the IASampleEntry in the IA Sequence , in order. These form the j = 1, 2, ..., m -th Temporal Unit s.
- If it is desirable to have Temporal Delimiter OBU s in the IA Sequence , insert a Temporal Delimiter OBU in front of every Temporal Unit .
- Otherwise, do not insert any Temporal Delimiter OBU s in the IA Sequence .

6.4.2. Handling Trimming Information

This section provides a guideline for handling trimming information in an ISOBMFF file.

Recommendation for handling ISOBMFF trimming information. PTS is the presentation start time. PTS1 is the presentation start time of the first audio sample before trimming. PTS2 is the presentation start time of the first audio sample after trimming.

As depicted in the figure above,

The ISOBMFF parser passes the Descriptors , PTS1 and IA Samples (or Temporal Unit s) to the IAMF decoder.
The ISOBMFF parser passes PTS1 and the trimming information to the ISOBMFF player. (This is optional if the IAMF decoder trims the audio samples.)
The IAMF decoder passes PTS and the audio samples after decoding to the ISOBMFF player.
- If the IAMF decoder trims the audio samples based on the trimming information within the Audio Frame OBU s, then the IAMF decoder passes PTS2 and the audio samples after trimming.
- If the IAMF decoder does not trim, then the IAMF decoder passes PTS1 and the audio samples before trimming.
The ISOBMFF player plays back the trimmed audio samples through the loudspeakers starting at PTS2.

7. IAMF processing

This section provides processes for IA decoding for a given IA Sequence .

IA decoding can be done by using the combination of the following decoding processing.

Decoding of a scene-based Audio Element (Ambisonics decoding)
Decoding of a channel-based Audio Element (Scalable Channel Audio decoding)
Rendering and mixing of each Audio Element before mixing of multiple Audio Element s.
- It may include re-sampling of each Audio Element .
Mixing of multiple Audio Element s with synchronization
Post-processing such as Loudness and Limiter.

Ambisonics decoding , it SHALL conform to [RFC8486] except codec specific processing.

Scalable Channel Audio decoding , it SHALL output the 3D audio signal (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of an OBU parser, Codec decoder, Audio Element Renderer, and Post-processor as depicted in the figure below.

OBU parser SHALL depacketize IA Sequence to output one or more Audio Substream s with one decoder_config() , Descriptors and Parameter Substream s.
Codec decoder for each Audio Substream SHALL output decoded channels.
Audio Element Renderer reconstructs 3D audio signal from decoded channels of Codec decoders according to the type of Audio Element which is specified Audio Element OBU , and renders the audio channels to the target loudspeaker layout.
- For scene-based audio element, it SHALL output 3D audio signal for the target loudspeaker layout from the reconstructed ambisonics channels.
- For channel-based audio element, it SHALL output 3D audio signal for the target loudspeaker layout from the reconstructed audio channels.
Post-processor outputs Immersive Audio according to the target loudspeaker layout after processing mixing and post-processing such as Loudness and Limiter.

IA Decoder Configuration

7.1. Ambisonics decoding

This section describes the decoding of Ambisonics.

The figure below shows the decoding flowchart of Ambisonics decoding.

OBU parser SHALL output the Audio Substream s for the scene-based Audio Element in the IA sequence .
- OBU parser SHALL output channel_mapping or demixing_matrix according to ambisonics_mode to Channel_Mapping/Demixing_Matrix module.
Codec decoder SHALL output decoded channels (PCM) in the transmission order as many as output_channel_count after decoding each Audio Substream .
Channel_Mapping/Demixing_Matrix module SHALL apply channel_mapping or demixing_matrix according to ambisonics_mode to the channels (PCM) and outputs channels as many as output_channel_count in ACN order.
Ambisonics to Channel Format module may render the output channels to 3D audio signal according to the target loudspeaker layout.

Ambisonics Decoding Flowchart

7.2. Scalable Channel Audio decoding

This section describes the decoding of Scalable Channel Audio.

The figure below shows the decoding flowchart of the decoding for Scalable Channel Audio.

Scalable Channel Audio Decoding Flowchart

For a given loudspeaker layout (i.e. CL #i) among the list of loudspeaker_layout in scalable_channel_layout_config() ,

OBU Parser SHALL get Audio Substream s for ChannelGroup #1 ~ ChannelGroup #i and pass them to Codec decoder with decoder_config() .
Codec decoder SHALL output decoded channels (PCM) in the transmission order.
- For non-scalable audio (i.e. i = num_layers = 1), its order SHALL be converted to the loudspeaker location order for CL #1.
Following are further processed for scalable audio (i.e. i > 1)
- When output_gain_is_present_flag (j) for ChannelGroup #j (j = 1, 2, …, i-1) is on, Gain module SHALL apply output_gain (j) to all audio samples of the mixed channels in the ChannelGroup #j indicated by output_gain_flag (j).
- De-Mixer SHALL output de-mixed channels (PCM) for CL #i generated through de-mixing of the mixed channels from the Gain module by using non-mixed channels and demixing parameters for each frame.
- Recon_Gain module SHALL output smoothed channels (PCM) by applying recon_gain to each frame of the de-mixed channels.
- The order for Non-mixed channels and Smoothed channels SHALL be converted to the loudspeaker location order for CL #i after going through necessary modules such as Gain, De-Mixer, Recon_Gain, etc.
Following may be further processed
- Loudness normalization module may output loudness normalized channels at -24 LKFS from non-mixed channels and smoothed channels (if present) by using loudness information for CL #i.
- Limiter module may limit the true peak of input channels at -1dB.

Following sections, § 7.2.1 Gain , § 7.2.2 De-mixer and § 7.2.3 Recon Gain are only needed for decoding of scalable audio with num_layers > 1.

7.2.1. Gain

The Gain module is the mirror process of the Attenuation module. It recovers the reduced sample values using output_gain (i) when its output_gain_is_present_flag (i) for ChannelGroup #i is on. When its output_gain_is_present_flag (i) is off, then this module SHALL be bypassed for ChannelGroup #i. output_gain (i) for ChannelGroup #i SHALL be applied to all samples of the mixed channels in the ChannelGroup #i, where mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation SHALL use the following:

Sample *= pow(10, [=output_gain=](i) / (20.0*256))

Where, n = num_layers and i = 1, 2, ..., n. output_gain (i) is the raw 16-bit value for the ith layer which is specified in channel_audio_layer_config() .

7.2.2. De-mixer

For scalable channel audio with num_layers > 1, some channels of down-mixed audio for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by the Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i SHALL comply with the result by the combination of the following surround and top de-mixers:

Surround de-mixers
- S1to2 de-mixer : R2 = 2 x Mono – L2
- S2to3 de-mixer : L3 = L2 – 0.707 x C and R3 = R2 – 0.707 x C
- S3to5 de-mixer : Ls = 1/δ(k) x (L3 – L5) and Rs = 1/δ(k) x (R3 – R5)
- S5to7 de-mixer : Lrs = 1/β(k) x (Ls – α(k) x Lss) and Rrs = 1/β(k) x (Rs – α(k) x Rss)
Top de-mixers
- TF2toT2 de-mixer : Ltf2 = Ltf3 – w(k) x (L3 – L5) and Rtf2 = Rtf3 – w(k) x (R3 – R5)
- T2to4 de-mixer : Ltb = 1/γ(k) x (Ltf2 – Ltf4) and Rtb = 1/γ(k) x (Rtf2 – Rtf4)
Where, Ltf2 / Rtf2 is top channel of x.1.2ch, Ltf3 / Rtf3 is top channel of 3.1.2ch, and Ltf4 / Rtf4 is top channel of x.1.4ch (x = 5 or 7) and w(k) is determined from the value of wIdx(k).

Initially, wIdx(0) = 0 and the value of wIdx(k) SHALL be derived as follows:

wIdx(k) = Clip3(0, 10, wIdx(k-1) + w_idx_offset(k))

Mapping of wIdx(k) to w(k) SHOULD be as follows:

wIdx(k) :   w(k)
   0    :    0
   1    :  0.0179
   2    :  0.0391
   3    :  0.0658
   4    :  0.1038
   5    :  0.25
   6    :  0.3962
   7    :  0.4342
   8    :  0.4609
   9    :  0.4821
   10    : 0.5

When D_set = { x | S1 < x ≤ Si and x is an integer},

If 2 is an element of D_set, the combination SHALL include S1to2 de-mixer .
If 3 is an element of D_set, the combination SHALL include S2to3 de-mixer .
If 5 is an element of D_set, the combination SHALL include S3to5 de-mixer .
If 7 is an element of D_set, the combination SHALL include S5to7 de-mixer .

When Ti = 2,

If Sj = 3 (j=1,2,…, i-1), the combination SHALL include TF2toT2 de-mixer .

When Ti = 4,

If Sj = 3 (j=1,2,…, i-1), the combination SHALL include TF2toT2 de-mixer and T2to4 de-mixer .
Elseif Tj = 2 (j=1,2,…, i-1), the combination SHALL include T2to4 de-mixer .

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of the down-mixed audio 5.1.2ch,

The combination includes S2to3 de-mixer , S3to5 de-mixer and [=TF2toF2 de-mixer].
Ls5 and Rs5 are recovered by S2to3 de-mixer and S3to5 de-mixer.
Ltf and Rtf are recovered by S2to3 de-mixer and TF2toT2 de-mixer.

Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).

7.2.3. Recon Gain

Recon gain is REQUIRED only for num_layers > 1 and when codec_id is set to Opus or mp4a .

recon_gain SHALL be only applied to all audio samples of the de-mixed channels from the De-mixer module.

recon_gain_info_parameter_data() indicates each channel of CL #i to which recon gain needs to be applied and provides recon_gain value for each frame of the channel.
- Sample (k,i) *= Smoothed_Recon_Gain (k,i), where k is the frame index and i is the sample index of the frame.
- Smoothed_Recon_Gain (k) = MA_gain (k-1) x e_window + MA_gain (k) x s_window
- MA_gain (k) = 2 / (N+1) x recon_gain (k) / 255 + (1 – 2/(N+1)) x MA_gain (k-1), where MA_gain (0) = 1.
- e_window[0: olen] = hanning[olen:], e_window[olen:flen] = 0.
- s_window[0: olen] = hanning[:olen], s_window[olen:flen] = 1.
- Where hanning = np.hanning (2*olen), flen is the frame size and olen is the overlap size.
- Recommend values: N = 7

The figure below shows the smoothing scheme of recon_gain .

Smoothing Scheme of Recon Gain

RECOMMENDED values for specific codecs are as follows

When codec_id is set to Opus : olen = 60.
When codec_id is set to mp4a : olen = 64.

7.3. Mix Presentation

An IA Sequence MAY contain more than one Mix Presentation . § 7.3.1 Selecting a Mix Presentation details how a Mix Presentation SHOULD be selected from multiple of them.

A Mix Presentation specifies how to render, process and mix one or more Audio Element s. Each Audio Element SHOULD first be individually rendered and processed before mixing. Then, any additional processing specified by output_mix_config() SHOULD be applied to the mixed audio signal in order to generate the final output audio for playback. § 7.3.2 Rendering an Audio Element details how each Audio Element SHOULD be rendered, while § 7.3.3 Mixing Audio Elements details how the Audio Element s SHOULD be processed and mixed.

7.3.1. Selecting a Mix Presentation

When an IA Sequence contains multiple Mix Presentation s, the IA parser SHOULD select the appropriate Mix Presentation in the following order.

If there are any user-selectable mixes, the IA parser SHOULD select the mix, or mixes, that match the user’s preferences. An example might be a mix with a specific language. Mix Presentation s may use mix_presentation_friendly_label to describe such mixes.
If there is more than one valid mix remaining, the IA parser SHOULD select an appropriate mix for rendering, in the following order.
1. If the playback layout is binaural, i.e. headphones:
  1. Select the mix with audio_element_id whose loudspeaker_layout is BINAURAL.
  2. If there is no such mix, select the mix with the highest available loudness_layout .
2. If the playback layout is loudspeakers:
  1. If there is a mix with an loudness_layout that matches the playback loudspeaker layout, it SHOULD be selected. If there is more than one matching mix, the first one SHOULD be selected.
  2. If there is no such mix, select the Mix Presentation with the highest available loudness_layout .

7.3.2. Rendering an Audio Element

This specification supports the rendering of either a channel-based or scene-based Audio Element to either a target loudspeaker layout or a binaural output.

In this section, for a given x.y.z layout, the next highest layout x'.y'.z' means that x', y', and z' are greater than or equal to x, y, and z, respectively.

`audio_element_type`	Playback layout	Section
CHANNEL_BASED	Loudspeakers	§ 7.3.2.1 Rendering a channel-based audio element to loudspeakers
SCENE_BASED	Loudspeakers	§ 7.3.2.2 Rendering a scene-based audio element to loudspeakers
CHANNEL_BASED	Binaural output	§ 7.3.2.3 Rendering a channel-based audio element to a binaural output
SCENE_BASED	Binaural output	§ 7.3.2.4 Rendering a scene-based audio element to a binaural output

7.3.2.1. Rendering a channel-based audio element to loudspeakers

This section defines the renderer to use, given a channel-based Audio Element and a loudspeaker playback layout.

The input layout (x.y.z) of the IA renderer is set as follows:
- If num_layers = 1, use the loudspeaker_layout of the Audio Element .
- Else, if there is one of the Audio Element 's the loudspeaker_layout s that matches the playback layout, use it.
- Else, use the next highest available layout from all available loudspeaker_layout .
The output layout of the IA renderer is set to the playback layout (X.Y.Z).
The IA renderer used is selected according to the following rules:
- If DemixingParamDefinition() is not present,
  - If the playback layout is neither 3.1.2ch nor 7.1.2ch,
    - If the playback layout complies with loudspeaker layouts supported by [ITU2051-3] , use EAR Direct Speakers renderer ( [ITU2127-0] ).
    - Else, use an implementation-specific renderer.
  - Else if the playback layout is 7.1.2ch, use EAR Direct Speakers renderer ( [ITU2127-0] ) to render the input audio to 7.1.4ch first and followed by down-mixing from 7.1.4ch to 7.1.2ch.
    - Where height channels of 7.1.4ch are down-mixed to height channels of 7.1.2ch as follows: Ltf2 = Ltf4 + 0.707 * Ltb and Rtf2 = Rtf4 + 0.707 * Rtb.
  - Else if the playback layout is 3.1.2ch,
    - If the input layout has height channels, use the static down-mix matrices specified in § 7.6.2 Static Down-mix Matrix .
    - Else if the surround channels(x) of the input layout > 3, use the static down-mix matrices specified in § 7.6.2 Static Down-mix Matrix after padding empty height channels to the input audio relevant to the input layout.
    - Else, pad empty channels to the input audio relevant to the input layout to make 3.1.2ch.
- Else,
  - If the playback layout matches a loudspeaker_layout which can be generated from the highest loudspeaker layout of the Audio Element according to § 10.2.3 Annex B-3: Channel Layout Generation Rule ,
    - If the playback layout has height channels, use demixing_info_parameter_data() or default_demixing_info_parameter_data() .
    - Else,
      - If the input layout does not have height channels, use demixing_info_parameter_data() or default_demixing_info_parameter_data() .
      - Else, use EAR Direct Speakers renderer ( [ITU2127-0] ).
  - Else,
    - If the playback layout is neither 3.1.2ch nor 7.1.2ch,
      - If the playback layout complies with loudspeaker layouts supported by [ITU2051-3] , use EAR Direct Speakers renderer ( [ITU2127-0] ).
      - Else, use an implementation-specific renderer.
    - Else if the playback layout is 7.1.2ch, use EAR Direct Speakers renderer ( [ITU2127-0] ) to render the input audio to 7.1.4ch first and followed by down-mixing from 7.1.4ch to 7.1.2ch.
      - Where height channels of 7.1.4ch are down-mixed to height channels of 7.1.2ch as follows: Ltf2 = Ltf4 + 0.707 * Ltb and Rtf2 = Rtf4 + 0.707 * Rtb.
    - Else if the playback layout is 3.1.2ch,
      - If the input layout has height channels, use the static down-mix matrices specified in § 7.6.2 Static Down-mix Matrix .
      - Else if the surround channels(x) of the input layout > 3, use the static down-mix matrices specified in § 7.6.2 Static Down-mix Matrix after padding empty height channels to the input audio relevant to the input layout.
      - Else, pad empty channels to the input audio relevant to the input layout to make 3.1.2ch.

If the EAR Direct Speakers renderer is used, the following SHOULD be provided for each audio channel of the Audio Element :

speaker label: the label of the speaker position, using the same convention as "SP Label" in [ITU2051-3] . This is defined for each audio channel of the Audio Element based on the information from loudspeaker_layouts .

In [ITU2051-3] , an LFE audio channel MAY be identified either by an explicit label or its frequency content. In this specification, the LFE channel is identified based on the explicit label only, given by loudspeaker_layout .

7.3.2.2. Rendering a scene-based audio element to loudspeakers

This section defines the renderer to use, given a scene-based Audio Element and a loudspeaker playback layout.

The input layout of the IA renderer is set to Ambisonics.
The output layout of the IA renderer is set to the playback layout.
The IA renderer used is selected according to the following rules:
- If the playback layout complies with loudspeaker layouts supported by [ITU2051-3] , use EAR HOA renderer ( [ITU2127-0] ).
- Else, use an implementation-specific renderer.
  - If there is no implementation-specific Ambisonics renderer, use the EAR HOA renderer to render to the next highest [ITU2051-3] layout compared to the playback layout, and then downmix using an implementation-specific renderer or use the static down-mix matrices specified in § 7.6.2 Static Down-mix Matrix if available.

If the EAR HOA renderer is used, the following metadata SHOULD be provided to the renderer for each audio channel:

Ambisonics order
Ambisonics degree
Ambisonics normalization method

In this specification, the [AmbiX] format is adopted, which uses SN3D normalization and ACN channel ordering. Accordingly, the Ambisonics order and degree can be computed from the channel index k as follows:

order   n = floor(sqrt(k)),
degree  m = k - n * (n + 1).

7.3.2.3. Rendering a channel-based audio element to a binaural output

Given a channel-based Audio Element and a binaural playback layout, the Binaural EBU ADM Direct Speaker renderer [EBU-Tech-3396] SHOULD be used. The highest layout provided in scalable_channel_layout_config() SHOULD be used as the input to the renderer.

7.3.2.4. Rendering a scene-based audio element to a binaural output

Given a scene-based Audio Element and a binaural playback system, the Resonance Audio renderer [Resonance-Audio] SHOULD be used.

7.3.3. Mixing Audio Elements

Each Audio Element SHOULD be processed individually before mixing as follows:

Render to the playback layout.
If all Audio Element s do not have a common sample rate, re-sample to 48 kHz is RECOMMENDED.
If all Audio Element s do not have a common bit-depth, convert to a common bit-depth. This specification RECOMMENDs using 16 bits.
If loudness_layout matches with the playback layout, apply any per-element processing according to element_mix_config() .

The rendered and processed Audio Element s SHOULD be then summed, and then apply output_mix_config() to generate one sub-mixed audio signal.

7.4. Animated Parameters

This section describes how a set of parameter values is animated over a subblock in a parameter_block_obu() and applied to the corresponding audio samples, using the information provided in AnimatedParameterData() .

If animation_type is equal to STEP, the parameter value provided by start_point_value SHOULD be applied to all time steps in the subblock.

If animation_type is equal to LINEAR or BEZIER, the information provided in AnimatedParameterData() describes how the set of parameter values is animated as a Bezier curve. Let T be the subblock_duration defined in the parameter_block_obu() and P0, P1 and P2 be 2D coordinates defined as

P0 = (t0, start_point_value),
P1 = (t1, control_point_value),
P2 = (t2, end_point_value),

where t0 = 0 is the subblock start time, t2 = D is the subblock end time and t1 is the control point time given by

t1 = round(D * control_point_relative_time).

The values of t0, t1 and t2 are expressed as ticks at the parameter_rate given in the associated parameter definition.

If animation_type is equal to LINEAR, the set of parameter values is linearly interpolated between start_point_value and end_point_value at a given point in time as:

B_linear(a) = (1 - a) * P0 + a * P2,
0 <= a <= 1,

where B_linear(a) = (t, y) is a 2D coordinate with the parameter value y at time t.

If animation_type is equal to BEZIER, the set of parameter values is interpolated following a quadratic Bezier curve between start_point_value and end_point_value at a given point in time as:

B_quad(a) = (1 - a)^2 * P0 + 2 * (1 - a) * a * P1 + a^2 * P2,
0 <= a <= 1.

where B_quad(a) = (t, y) is a 2D coordinate with parameter value y at time t.

To apply the parameter values to the audio samples in the subblock without interpolation, the parameter_rate SHOULD be first resampled to the audio sample rate to give:

n0 = t0 * audio_sample_rate / parameter_rate,
n1 = t1 * audio_sample_rate / parameter_rate,
n2 = t2 * audio_sample_rate / parameter_rate.

Then, P0, P1, P2 can be rewritten as:

P0 = (n0, start_point_value),
P1 = (n1, control_point_value),
P2 = (n2, end_point_value).

Next, the parameter value y is computed for each time t that corresponds to an integer audio sample index, t = n = [0, 1, 2, ..., n2]. This is done by computing the equivalent value of a for every n, and then applying the Bezier equations B_linear(a) and B_quad(a) to find the parameter value y.

In the case of B_linear(a), the mapping between n and a is given by:

a = n ÷ n2.

In the case of B_quad(a), the mapping between n and a is given as follows. Let

alpha = n0 - 2 * n1 + n2,
beta = 2 * (n1 - n0),
gamma = n0 - n.

If alpha is equal to 0, then

a = -gamma ÷ beta,

else,

a = (-beta + sqrt(beta^2 - 4 * alpha * gamma)) ÷ (2 * alpha).

7.5. Post Processing

7.5.1. Loudness Normalization

Loudness normalization SHOULD be done by adjusting the loudness level to a target output level using the information provided in § 3.7.7 Loudness Info Syntax and Semantics . A control MAY be provided to set unique target output levels for each anchored loudness and/or the integrated loudness. If loudness normalization increases the output level, a peak limiter to prevent saturation and/or clipping MAY be necessary; true_peak or digital_peak MAY be used to determine if peak limiting is needed. Alternatively, the total amount of normalization MAY be limited.

The rendered layouts that were used to measure the loudness information of a sub-mix are provided by loudness_layout s.

If one of them matches the playback layout, the loudness information SHOULD be used directly for normalization. If there is a mismatch between loudness_layout and the playback layout, the implementation MAY choose to use the provided loudness information of the highest loudness_layout as-is.

7.5.2. Limiter

The limiter SHOULD limit the true peak of an audio signal at -1 dBTP, where the true peak is defined in [ITU1770-4] . The limiter SHOULD apply to multichannel signals in a linked manner and further support auto-release.

7.6. Down-mix Matrix

7.6.1. Dynamic Down-mix Matrix

This section RECOMMENDs dynamic down-mixing matrices.

The dynamic down-mixing matrix complies with the down-mixing mechanism which is specified in § 10.2.2 Annex B-2: Down-mix Mechanism .

7.6.2. Static Down-mix Matrix

This section RECOMMENDs static down-mix matrices to render to 3.1.2ch from each of 5.1.2ch, 5.1.4ch, 7.1.2ch, and 7.1.4ch.

The figures below show static down-mix matrices to 3.1.2ch.

3.1.2ch Down-mix matrix for 5.1.2ch

3.1.2ch Down-mix matrix for 5.1.4ch

3.1.2ch Down-mix matrix for 7.1.2ch

3.1.2ch Down-mix matrix for 7.1.4ch

Where, p1 = 0.707. Implementations MAY use a limiter defined in § 7.5.2 Limiter to preserve the energy of audio signals instead of normalization factors.

8. IAMF Generation Process (Informative)

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:

Ambisonics format: It conforms to ChannelMappingFamily = 2 or 3 of [RFC8486] .
Channel Audio format: It conforms to loudspeaker_layout specified in channel_audio_layer_config() .
Input Sampling Rate: 48000Hz
Bit depth: 16 bits or 24 bits
- 16 bits are recommended for OPUS.
Input file format: .wav file (Linear PCM, simply called PCM)

For a given input audio and user inputs, the IA encoder outputs IA Sequence which conforms to § 3 Open Bitstream Unit (OBU) Syntax and Semantics .

Input audio is as follows:

Ambisonics format
Channel Audio format

User inputs are:

Ambisonics mode to indicate if ChannelMappingFamily = 2 or 3 of [RFC8486] .
List of channel layouts to be supported for scalable channel audio: it conforms to loudspeaker_layout .

IA encoding can be done by using the combination of following generation processing.

Encoding of an Audio Element (Ambisonics encoding or Scalable Channel Audio encoding)
Encoding of a Mix Presentation

The figure below shows the IA encoder configuration for one single Audio Element .

The IA encoder is composed of Pre-processor, Codec encoder, and OBU packetizer.

Pre-processor outputs one or more ChannelGroup s, Descriptors and optional Parameter Substream s based on the input audio and user inputs.
- It outputs one single ChannelGroup for scene-based Audio Element .
- It outputs one or more ChannelGroup s for channel-based Audio Element .
- It outputs Descriptors which are composed of one IA Sequence Header OBU , one Codec Config OBU , one Audio Element OBU , one or more Mix Presentation OBU s.
- It may output Parameter Substream s
  - For channel-based Audio Element with num_layers = 1, it may output Parameter Substream for demixing info.
  - For channel-based Audio Element with num_layers > 1, it outputs Parameter Substream s for demixing info and recon gain info, respectively.
  - It may further output Parameter Substream for mixing gain.
Codec encoder generates one or more Audio Substream s from each ChannelGroup based on Codec Config OBU .
OBU packetizer packetizes Descriptors , Parameter Substream s and Audio Substream s by OBU, and outputs IA Sequence .
- Temporal unit generator generates Temporal Unit for each frame from Audio Frame OBU s and Parameter Block OBU s (if present).

IA Encoder Configuration

8.1. Ambisonics Encoding

For Ambisonics encoding:

Pre-processor outputs one ChannelGroup and Descriptors and it is only composed of Meta Generator.
- Meta generator generates Descriptors based on Ambisonics mode and the number of channels for Ambisonics.
  - ambisonics_mode is set to 0 for ChannelMappingFamily = 2 of [RFC8486] or 1 for ChannelMappingFamily = 3 of [RFC8486] .
  - ambisonics_config() is set to as follows:
    - output_channel_count is set to the number of channels for Ambisonics. For example 1, 4, 9, or 16.
    - channel_mapping for ambisonics_mode = 0 is assigned to according to the order of Audio Substream s in the ChannelGroup .
    - demixing_matrix for ambisonics_mode = 1 is assigned to according to the order of Audio Substream s in the ChannelGroup .
Codec Enc. outputs Audio Substream s as many as the number of channels which is indicated in substream_count .
Temporal Unit is composed of Audio Frame OBU s for the Audio Substream s.
- It may have the immediately preceding Temporal Delimiter OBU .
- The order of Audio Substream s in the ChannelGroup is aligned with channel_mapping for ambisonics_mode = 0 or demixing_matrix for ambisonics_mode = 1.

8.2. Scalable Channel Audio Encoding

For Scalable Channel Audio encoding:

Pre-processor outputs N ChannelGroup s ( num_layers = N), Descriptors and Parameter Substream s. It is composed of a Down-mix parameter generator, Down-mixer, Loudness, ChannelGroup generator, Attenuation, and Meta generator.
- For non-scalable channel audio (i.e. num_layers = 1):
  - Parameter Substream for recon gain is not generated.
  - Parameter Substream for demixing info may be generated by implementers who assume it to be recommended for dynamic downmixing on a decoder side.
  - Down-mixer, ChannelGroup generator, and Attenuation modules are not needed.
- Down-mix parameter generator generates 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k)) by analyzing input channel audio.
- Down-mixer generates down-mixed audio s according to the list of channel layouts and the down-mix parameters.
- Loudness module outputs the loudness level ( LKFS ) of each down-mixed audio based on [ITU1770-4] .
- ChannelGroup generator transforms the input channel audio to N ChannelGroup s for scalable channel audio with num_layers = N scalability by using the down-mix parameters and the list of channel layouts.
- The Attenuation module applies a gain to the transformed ChannelGroup s to prevent clipping.
- Meta generator generates Descriptors and Parameter Substream s.
  - Descriptors are set to as follows:
    - num_layers is set to N (i.e. the number of channel layouts).
    - channel_audio_layer_config() is set to as follows:
      - loudspeaker_layout is set to the ith list of channel layouts for the ith ChannelGroup .
      - output_gain_is_present_flag is set to 1 for the ith ChannelGroup if attenuation is applied to the mixed channels of the ith ChannelGroup . Otherwise, it is set to 0 for the ith ChannelGroup .
      - recon_gain_is_present_flag is set to 1 for the ith ChannelGroup if the preceding ChannelGroup s has one or more mixed channels from the down-mixed audio for the ith channel layout. Otherwise, it is set to 0 for the ith ChannelGroup . Especially, when num_layers = 1, this flag is set to 0.
        
        This flag is set to 0 for lossless codecs including LPCM.
      - substream_count is set to the number of Audio Substream s composing of the ith ChannelGroup .
      - coupled_substream_count is set to the number of coupled substreams among the Audio Substream s composing of the ith ChannelGroup .
      - loudness is set to the loudness ( LKFS ) of the down-mixed audio for the ith channel layout.
      - Each bit of output_gain_flags is set to 1 for the ith ChannelGroup if attenuation is applied to the relevant channel of the ith ChannelGroup . Otherwise, it is set to 0 for the ith ChannelGroup .
      - output_gain is set to the gain (i.e. the inverse of attenuation gain) which is applied to the channels which are indicated by output_gain_flags .
  - Parameter Substream s can be composed of one for demixing info and the other for recon gain. When recon_gain_is_present_flag = 0 for all ChannelGroup s, no Parameter Block OBU s for recon gain info present in IA Sequence .
    - dmixp_mode of demixing_info_parameter_data() for the kth frame is set to indicate (α(k), β(k), γ(k), δ(k)) and w_idx_offset(k). Where w_idx_offset(k) = 1 or -1.
    - recon_gain_flags of recon_gain_info_parameter_data() is set to indicate the de-mixed channels, which need to apply recon_gain among the output channels after demixing for ith channel layout.
    - recon_gain is set to the gain value to be applied to the channel which is indicated by recon_gain_flags for the ith ChannelGroup .
Temporal Unit for kth frame is composed of zero or more Parameter Block OBU s and followed by Audio Frame OBU s for the kth frames.
- It may have the immediately preceding Temporal Delimiter OBU ,
- ChannelGroup s in Temporal Unit is placed in order. In other words, ChannelGroup for the first channel layout comes first, followed by ChannelGroup for the second channel layout, followed by ChannelGroup for the third channel layout, and so on.

The figure below shows the IA encoding flowchart for Scalable Channel Audio.

For a given input channel audio and a given list of channel layouts for scalability, PCMs for the input channel audio are passed to the CG Generation module.
CG Generation module generates the transformed audio according to the CG generation rule based on the list of CLs and the down-mix parameters.
- The transformed audio is structured as ChannelGroup s.
Non-mixed channels of the transformed audio (i.e., the original channels of the input channel audio) are directly input to the Codec encoder, but the mixed channels may be input first to the Attenuation module and then to the Codec encoder.
The Attenuation module reduces all sample values of the mixed channels in the same ChannelGroup at a uniform rate ( output_gain ).
- A range of 0dB to -6dB is recommended for attenuation. (i.e. a range of 0dB to 6dB for output_gain )
Codec Enc. generates the coded Audio Substream s from PCMs and passes the coded Audio Substream s and one single decoder_config() to OBU Packetizer.
OBU packetizer generates Descriptors which consists of one IA Sequence Header OBU , one Codec Config OBU , one Audio Element OBU and one or more Mix Presentation OBU .
- Codec Config OBU is generated based on decoder_config() .
OBU packetizer generates Parameter Block OBU s for each frame which contains demixing_info_parameter_data() and recon_gain_info_parameter_data() .
OBU packetizer generates Audio Frame OBU s for each frame of the Audio Substream s.
OBU packetizer generates Temporal Unit for each frame.
- Temporal Unit consists of zero or more Parameter Block OBU s and followed by Audio Frame OBU s.
  - It may have the immediately preceding Temporal Delimiter OBU ,
OBU Packetizer outputs IA Sequence which is composed of OBUs for Descriptors and followed by OBUs for Temporal Unit s.

IA Encoding Flowchart for Scalable Channel Audio

8.3. Mix Presentation Encoding

For Mix Presentation for one single channel-based Audio Element , Mix Presentation OBU follows following restrictions:

num_sub_mixes : set to 1
num_audio_elements : set to 1
element_mix_config() : No Parameter Block OBU s for element_mix_config() and default_mix_gain = 0dB
output_mix_config() : No Parameter Block OBU s for output_mix_config() and default_mix_gain = 0dB
num_layouts : set to N, where N is the number of input channel layouts.
loudness_layout : set to L(1), L(2), ..., L(N).
- loudness_info() on L(1), loudness_info() on L(2), ..., loudness_info() on L(N): loudness information of the rendered audio to the measured layout L(i).
- Where L(i) is the measured layout for the ith layer and i = 1, 2, ..., N.

NOTE: If the input channel layouts do not include Stereo, then num_layers is set to N + 1 and the loudness_layout s includes Stereo.

For Mix Presentation for one single scene-based Audio Element , Mix Presentation OBU follows following restrictions:

num_sub_mixes : set to 1
num_audio_elements : set to 1
element_mix_config() : set to mix_gain
output_mix_config() : set to output_mix_gain
num_layouts : set to M1, the number of loudness informations which are provided.
loudness_layout : set to L(1), L(2), ..., L(M1). One of them is Stereo.
loudness_info() on L(1), loudness_info() on L(2), ..., loudness_info() on L(M1): loudness information of the rendered audio to the measured layout L(i).
Where L(i) is the measured layout for the ith loudness information and i = 1, 2, ..., M1.
This Mix Presentation is authored by using the highest loudness_layout .

For Mix Presentation for 2 Audio Element s, Mix Presentation OBU follows following restrictions:

num_sub_mixes : set to 1
num_audio_elements : set to 2
element_mix_config() for each Audio Element : set to mix_gain
output_mix_config() : set to output_mix_gain
num_layouts : set to M2, the number of loudness informations which are provided.
loudness_layout : set to L(1), L(2), ..., L(M2). One of them is Stereo.
loudness_info() on L(1), loudness_info() on L(2), ..., loudness_info() on L(M2): loudness information of the rendered audio to the measured layout L(i).
Where L(i) is the measured layout for the ith loudness information and i = 1, 2, ..., M2.
This Mix Presentation is authored by using the highest loudness_layout .

8.3.1. Element Mix Config

This section provides a guideline to generate element_mix_config() .

An IA multiplexer may merge two IA Sequence s (or two Audio Element s). In this case, it adjusts the gain values for element_mix_config() s as necessary to describe the desired relative gains between the IA Sequence s (or two Audio Element s) when they are summed to generate the final mix. It also ensure that the gains selected do not result in clipping when the final mix is generated.

8.4. Two Audio Elements Encoding

This section provides a way to generate IA Sequence having two Audio Element s from two IA simple profiles.

8.4.1. Two Audio Elements with One Codec Config

This section provides a way how to generate IA Sequence having two Audio Element s from two IA simple profiles with the same Codec Config OBU . However, the result complies with the base profile of IA Sequence .

Step 1: Descriptors are generated as follows:

IA Sequence Header OBU : get the larger primary_profile field and the larger additional_profile field, respectively.
Codec Config OBU : just take the Codec Config OBU of an IA Sequence .
Two Audio Element OBU s: take both of them and make the following modifications:
- codec_config_id in each Audio Element OBU is updated to indicate the codec_config_id specified in the taken Codec Config OBU .
- Each audio_element_id is updated to be unique between two Audio Element OBU s.
- Each audio_substream_id is updated to be unique between two Audio Element OBU s.
- parameter_id s in ParamDefinition() s carried in each Audio Element OBU are updated to refer its associated Parameter Substream s, respectively.
Mix Presentation OBU s: generate new ones which are used for mixing the two Audio Element s. Make the following modifications:
- audio_element_id s in each Mix Presentation OBU are updated to indicate the audio_element_id s specified in Audio Element OBU s, respectively.
- parameter_id s in ParamDefinition() s carried in each Mix Presentation OBU are updated to refer its associated Parameter Substream s, respectively.

Step 2: ith Temporal Unit is generated as follows:

Place the Parameter Block OBU s in the ith Temporal Unit from each Audio Element and followed by Audio Frame OBU s in the ith Temporal Unit from each Audio Element except following:
- obu_type s are updated to be aligned according to audio_substream_id s specified in the two Audio Element OBU s.
- parameter_id s in Parameter Block OBU s are updated to indentify the Parameter Substream in IA Sequence and based on the parameter_id s in ParamDefinition() s carried in the Descriptors .
It may have the immediately preceding Temporal Delimiter OBU for each Temporal Unit .

Step 3: Generate an IA Sequence which starts with Descriptors and is followed by Temporal Unit s in order.

8.5. Post Processing

This section provides a way to generate algorithms for post-processing.

8.5.1. Loudness Information

This section provides a way to generate loudness_info() .

For a given Mix Presentation OBU and a given loudness_layout , the followings are processed in order to produce loudness_info() .

Each of the Audio Element s specified in the given Mix Presentation OBU is rendered to the given loudness_layout according to rendering_config() for the Audio Element .
Each of the Audio Element s specified in the given Mix Presentation OBU applies mix_gain according to element_mix_config() for the Audio Element .
All of the Audio Element s specified in the given Mix Presentation OBU are summed, and then applies mix_gain according to output_mix_config() .
Generate loudness_info() of the mixed audio according to § 3.7.7 Loudness Info Syntax and Semantics .

9. Convention

9.1. Syntax Description

All syntax elements conform to the Syntactic Description Language specified in [MP4-Systems] unless it is explicitly described in the specification.

9.1.1. Data types

leb128() syntaxName

leb128() indicates the type of an unsigned integer. To encode the following unsigned integer syntaxName , it first represents the integer in binary with an N-bit representation, where N is a multiple of 7. Then break the integer up into groups of 7 bits. Output one encoded byte for each 7 bits group, from least significant to most significant group. Each byte will have the group in its 7 least significant bits. Set the most significant bit on each byte except the last byte.

syntaxName is an unsigned integer which is encoded by leb128() . Its size is limited to 32 bits.

string syntaxName

string indicates a null-terminated (i.e. ending at the first byte set to 0x00), UTF-8 encoded as defined in [RFC3629] and whose length SHALL be limited to 128 bytes.

syntaxName is a human readable label.

9.2. Arithmetic Operators

+	Addition.
-	Subtraction.
*	Multiplication.
÷	Floating point (arithmetic) division.
/	Integer division with truncation of the result toward zero.
floor(x)	The largest integer that is smaller than or equal to x.
sqrt(x)	The square root of x.

9.3. Function

9.3.1. Function templates

When the template keyword is used to decorate the class declaration, it indicates that the code is a template with a placeholder type that can be reused by other classes. Only classes that use the template present in the bitstream; the template itself does not present in the bitstream. Classes that use a function template pass a data type that is specified in either [MP4-Systems] or § 9.1.1 Data types .

Example

template <class T>
class Foo {
  T t;
}
class Bar {
  Foo<int> f;
}

9.3.2. Mathematical functions

Clip3(x, y, z)

It conforms to Clip3 specified in [AV1-Convention] .

round(x)

The round() function returns the integer value closest to x and may be implemented as

round(x) = floor(x + 0.5).

roundup(x)

The roundup() function returns the smallest integer value greater than or equal to x

MOD(Number, Divisor)

The MOD() function returns the remainder after Number is divided by Divisor .

pow(x, y)

The pow() function returns the value of x to the power of y.

10. Annex

10.1. Annex A: ID Linking Scheme (Informative)

The figure below shows the linking scheme among IDs in the obu_header or OBU payload.

ID Linking Scheme

In the above figure,

Codec Config OBU with codec_config_id = 0 is providing codec_id and its decoder_config() .
Mix Presentation OBU with mix_presentation_id = 21 is saying:
- There are two Audio Element s( audio_element_id = 11 and 12) which need to be mixed. The audio_element_id = 11 and the audio_element_id = 12 are linked to the Audio Element OBU s with audio_element_id = 11 and audio_element_id = 12, respectively.
  - There are Parameter Block OBU s with parameter_id = 32 to be used for mixing of the Audio Element with audio_element_id = 11.
  - There are Parameter Block OBU s with parameter_id = 33 to be used for mixing of the Audio Element with audio_element_id = 12.
- There are Parameter Block OBU s with parameter_id = 34 to be used for mixing of the two Audio Element s.
Audio Element OBU with audio_element_id = 11 is saying:
- This Audio Element has been coded using Codec Config OBU with codec_config_id = 0.
- There are two Audio Substream s ( audio_substream_id = 0 and 1) in this Audio Element . The audio_substream_id = 0 and the audio_substream_id = 1 are linked to the Audio Frame OBU s with audio_substream_id = 0 and audio_substream_id = 1(i.e. obu_type = OBU_IA_Audio_Frame_ID0 and obu_type = OBU_IA_Audio_Frame_ID1), respectively.
- There are Parameter Block OBU s with parameter_id = 31 to be used for demixing of this Audio Element .
Audio Element OBU with audio_element_id = 12 is saying:
- This Audio Element has been coded by using Codec Config OBU with codec_config_id = 0.
- There is one Audio Substream ( audio_substream_id = 2) in this Audio Element . The audio_substream_id = 2 is linked to the Audio Frame OBU s with audio_substream_id = 2 (i.e. obu_type = OBU_IA_Audio_Frame_ID2).
Audio Frame OBU with audio_substream_id = 0 (i.e. obu_type = OBU_IA_Audio_Frame_ID0) is providing the coded data which has been coded by using Codec Config OBU with codec_config_id = 0 of Audio Substream with audio_substream_id = 0.
Audio Frame OBU with audio_substream_id = 1 (i.e. obu_type = OBU_IA_Audio_Frame_ID1) is providing the coded data which has been coded by using Codec Config OBU with codec_config_id = 0 of Audio Substream with audio_substream_id = 1.
Audio Frame OBU with audio_substream_id = 2 (i.e. obu_type = OBU_IA_Audio_Frame_ID2) is providing the coded data which has been coded by using Codec Config OBU with codec_config_id = 0 of Audio Substream with audio_substream_id = 2.
Parameter Block OBU with parameter_id = 31 is providing demixing_info_parameter_data() to be applied for demixing of the Audio Element with audio_element_id = 11.
Parameter Block OBU with parameter_id = 32 is providing mix_gain_parameter_data() to be applied to the rendered Audio Element after rendering according to rendering_config() of the Audio Element with audio_element_id = 11.
Parameter Block OBU with parameter_id = 33 is providing mix_gain_parameter_data() to be applied to the rendered Audio Element after rendering according to rendering_config() of the Audio Element with audio_element_id = 12.
Parameter Block OBU with parameter_id = 34 is providing mix_gain_parameter_data() to be applied to the Rendered Mix Presentation of the two rendered Audio Element s.

10.2. Annex B: Rules for Scalable Channel Audio (Normative)

This Annex specifies normative rules for scalable channel audio with num_layers > 1.

10.2.1. Annex B-1: Down-mix parameter and Loudness

This section describes how to generate down-mix parameters and loudness levels for a given channel audio and a given list of channel layouts for scalability (i.e. num_layers > 1).

The figure below shows a block diagram for the down-mix parameter and loudness module including the down-mixer.

IA Down-mix Parameter and Loudness

For a given channel-based input audio (e.g. 7.1.4ch) and a given list of channel layouts based on the input audio,

Down-mix parameter generator SHALL generate 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k), where k is the frame index) by analyzing the input audio and referring to [AI-CAD-Mixing] .
- It is composed of an Audio Scene Classification module and a Height Energy Quantification module as depicted in Figure 11-2.
- Audio Scene Classification module generates 4 parameters (α(k), β(k), γ(k), δ(k)) by classifying audio scenes of the input audio in three modes.
  - Default scene: Neither Dialog nor Effect
  - Dialog scene: Center-channel oriented and clear dialog/voice sounds
  - Effect scene: Directional and spatially moving sounds.
- The Height Energy Quantification module generates a surround-to-height mixing parameter (w(k)) which is decided according to the relative energy difference between the top and surround channels of the input audio.
  - If the energy of top channels is bigger than that of surround ones, then w_idx_offset(k) is set to 1. Otherwise, it is set to -1. And, w(k) is calculated based on w_idx_offset(k) and conforms to § 7.2 Scalable Channel Audio decoding .
Down-mixer generates down-mixed audio from the input audio according to the list of channel layouts and the down-mix parameters, and outputs down-mixed audio for each channel layout to the Loudness module.
- It is not depicted in the figure but Down-mixer further generates dmixp_mode and recon_gain for each frame to be passed to the OBU packetizer.
Loudness module measures the loudness level ( LKFS ) of each down-mixed audio based on [ITU1770-4] , and passes them to OBU packetizer.

10.2.2. Annex B-2: Down-mix Mechanism

This section specifies the down-mixing mechanism to generate down-mixed audio for scalable channel audio.

For a given channel-based input audio that conforms to loudspeaker_layout , the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementers MAY use another method to get the down-mixed audio from the given input audio, but the down-mixed audio SHALL comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanism is a combination of the following surround down-mixer(s) and top down-mixer(s) as depicted in the figure below.

Surround down-mixers
- S7to5 enc. : Ls5 = α(k) x Lss7 + β(k) x Lrs7 and Rs5 = α(k) x Rss7 + β(k) x Rrs7.
- S5to3 enc. : L3 = L5 + δ(k) x Ls5 and R3 = R5 + δ(k) x Rs5
- S3to2 enc. : L2 = L3 + 0.707 x C and R2 = R3 + 0.707 x C
- S2to1 enc. : Mono = 0.5 x (L2 + R2)
Top Down-mixers
- T4to2 enc. : Ltf2 = Ltf4 + γ(k) x Ltb4 and Rtf2 = Rtf4 + γ(k) x Rtb4.
- T2toTF2 enc. : Ltf3 = Ltf2 + w(k) x δ(k) x Ls5 and Rtf3 = Rtf2 + w(k) x δ(k) x Rs5.

IA Down-mix Mechanism

For example, to get down-mixed audio 3.1.2ch from 7.1.4ch:

S3 of 3.1.2ch is generated by using S7to5 enc. and S5to3 enc. .
TF2 of 3.1.2ch is generated by using T4to2 enc. and T2toTF2 enc. .

10.2.3. Annex B-3: Channel Layout Generation Rule

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of channel-based input audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for scalable channel audio SHALL conform with the following rules:

Si ≤ Si+1 and Wi ≤ Wi+1 and Ti ≤ Ti+1 except Si = Si+1, Wi = Wi+1 and Ti = Ti+1 for i = n-1, n-2, …, 1. Where the ith channel layout CL #i = Si.Wi.Ti.
CL #i is one of loudspeaker_layout s supported in this version of the specification.

Down-mix paths, which conform to the above rule, SHALL be only allowed for scalable channel audio with num_layers > 1 as depicted in the below figure.

IA Down-mix Path

10.2.4. Annex B-4: Recon Gain Generation

This section RECOMMENDs how to generate recon_gain .

NOTE: Recon gain generation is not required when the codec is lossless, i.e., when codec_id is set to ipcm or fLaC .

Recon gain needs to be applied to de-mixed channels. For this, the IA encoder needs to deliver it to IA decoders.

Let’s define the followings:

Level Ok is the signal power for the frame #k of a channel of the down-mixed audio for CL #i.
Level Mk is the signal power for the frame #k of the relevant mixed channel of the down-mixed audio for CL #i-1.
Level Dk is the signal power for the frame #k of the de-mixed channel for CL #i (after demixing in the decoder side).

If 10*log10(level Ok / maxL^2) is less than the first threshold value (-80dB is RECOMMENDED), Recon_Gain (k, i) = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (-6dB is RECOMMENDED), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value (i.e. recon_gain ) to be delivered is floor(255*Recon_Gain).

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.

D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChannelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively.
D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChannelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively.

Recon_Gain for D_Lrs7:

Level Ok is the signal power for the frame #k of Lrs7 in the ith ChannelGroup .
Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup .
Level Dk is the signal power for the frame #k of D_Lrs7.

Recon_Gain for D_Rrs7:

Level Ok is the signal power for the frame #k of Rrs7 in the ith ChannelGroup .
Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup .
Level Dk is the signal power for the frame #k of D_Rrs7.

Recon_Gain for D_Ltb4:

Level Ok is the signal power for the frame #k of Ltf4 in the ith ChannelGroup .
Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup .
Level Dk is the signal power for the frame #k of D_Ltb4.

Recon_Gain for D_Rtb4:

Level Ok is the signal power for the frame #k of Rtf4 in the ith ChannelGroup .
Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup .
Level Dk is the signal power for the frame #k of D_Rtb4.

10.2.5. Annex B-5: ChannelGroup Generation Rule

This section describes the generation rule for ChannelGroup .

For a given channel-based input audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), the CG Generation module outputs the transformed audio (i.e. ChannelGroups) which SHALL conform to the following rules:

It consists of C number of channels and is structured to n number of ChannelGroup s, where C is the number of channels for the input audio.
ChannelGroup #1 (as called BCG): This ChannelGroup is the down-mixed audio itself for CL #1 generated from the input audio. It contains a C1 number of channels.
ChannelGroup #i (as called DCG, i = 2, 3, …, n): This ChannelGroup contains (Ci – Ci-1) number of channels. (Ci – Ci-1) channel(s) consists of as follows:
- (Si – Si-1) surround channel(s) if Si > Si-1 . When S_set = { x | Si-1 < x ≤ Si and x is an integer},
  - If 2 is an element of S_set, the L2 channel is contained in this CG #i.
  - If 3 is an element of S_set, the Center channel is contained in this CG #i.
  - If 5 is an element of S_set, the L5 and R5 channels are contained in this CG #i.
  - If 7 is an element of S_set, the Lss7 and Rss7 channels are contained in this CG #i.
- The LFE channel if Wi > Wi-1.
- (Ti – Ti-1) top channels if Ti > Ti-1 .
  - If Ti-1 = 0, the top channels of the down-mixed audio for CL #i are contained in this ChannelGroup #i.
  - If Ti-1 = 2, the Ltf and Rtf channels of the down-mixed audio for CL #i are contained in this ChannelGroup #i.

The figure below shows one example of a transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

Example of Transformation Matrix with 4 CGs

10.3. Annex C: Consumption of IAMF bitstream (informative)

TODO. Fill in example workflows.

Immersive Audio Model and Formats

AOM Working Group Draft, 10 July 2023

Abstract

1. Introduction

2. Immersive Audio Model

2.1. Terminology

2.2. Architecture

2.3. Bitstream Structure

2.3.1. IA Sequence

2.3.2. Use of OBU

2.3.2.1. Descriptors

2.3.2.2. IA Data

2.4. Timing Model

3. Open Bitstream Unit (OBU) Syntax and Semantics

3.1. Immersive Audio OBU Syntax and Semantics

3.2. OBU Header Syntax and Semantics

3.3. Reserved OBU Syntax and Semantics

3.4. IA Sequence Header OBU Syntax and Semantics

3.5. Codec Config OBU Syntax and Semantics

3.6. Audio Element OBU Syntax and Semantics

3.6.1. Parameter Definition Syntax and Semantics

3.6.2. Scalable Channel Layout Config Syntax and Semantics

3.6.3. Ambisonics Config Syntax and Semantics

3.7. Mix Presentation OBU Syntax and Semantics

3.7.1. Mix Presentation Annotations Syntax and Semantics

3.7.2. Mix Presentation Element Annotations Syntax and Semantics

3.7.3. Rendering Config Syntax and Semantics

3.7.4. Element Mix Config Syntax and Semantics

3.7.5. Output Mix Config Syntax and Semantics

3.7.6. Layout Syntax and Semantics

3.7.7. Loudness Info Syntax and Semantics

3.8. Parameter Block OBU Syntax and Semantics

3.8.1. Mix Gain Parameter Data Syntax and Semantics

3.8.2. Demixing Info Parameter Data Syntax and Semantics

3.8.3. Recon Gain Info Parameter Data Syntax and Semantics

3.9. Audio Frame OBU Syntax and Semantics

3.10. Temporal Delimiter OBU Syntax and Semantics

3.11. Codec Specific

3.11.1. OPUS Specific

3.11.2. AAC-LC Specific

3.11.3. FLAC Specific

3.11.4. LPCM Specific

4. Profiles

4.1. IA Simple Profile

4.2. IA Base Profile

5. Standalone IAMF Representation

5.1. OBU Sequence Order

5.1.1. Descriptor OBUs

5.1.2. IA Data OBUs

5.1.3. Refreshing Descriptor OBUs

6. ISOBMFF IAMF Encapsulation

6.1. General Requirements & Brands

6.2. ISOBMFF IAMF Encapsulation

6.2.1. Requirement of IA Sequence

6.2.2. Encapsulation Scheme

6.2.3. IA Sample Entry

6.2.4. IA Sample Format

6.3. Codecs Parameter String

6.4. ISOBMFF IAMF Decapsulation (Informative)

6.4.1. Decapsulating an ISOBMFF IAMF File with a Single Track

6.4.2. Handling Trimming Information

7. IAMF processing

7.1. Ambisonics decoding

7.2. Scalable Channel Audio decoding

7.2.1. Gain

7.2.2. De-mixer

7.2.3. Recon Gain

7.3. Mix Presentation

7.3.1. Selecting a Mix Presentation

7.3.2. Rendering an Audio Element

7.3.2.1. Rendering a channel-based audio element to loudspeakers

7.3.2.2. Rendering a scene-based audio element to loudspeakers

7.3.2.3. Rendering a channel-based audio element to a binaural output

7.3.2.4. Rendering a scene-based audio element to a binaural output

7.3.3. Mixing Audio Elements

7.4. Animated Parameters

7.5. Post Processing

7.5.1. Loudness Normalization

7.5.2. Limiter

7.6. Down-mix Matrix