Immersive Audio Model and Formats

1. Convention

1.1. Syntax Description

All of syntax elements shall conform to Syntatic Description Language specified in [MP4-Systems] unless it is explicitly described in the specification.

1.1.1. Data types

leb128() syntaxName

leb128() indicates the type of an unsigned integer. It indicates the following unsigned integer syntaxName shall be encoded by leb128() specified in [AV1-Convention].

syntaxName is an unsigned integer which is encoded by leb128() specified in [AV1-Convention].

sleb128() syntaxName

sleb128() indicates the type of an signed integer. It indicates the following signed integer syntaxName shall be encoded by leb128() specified in [AV1-Convention].

syntaxName is an signed integer which is encoded by leb128() specified in [AV1-Convention].

string syntaxName

string indicates the type of a string with ring which is terminated by null of one byte (i.e. 0x00).

syntaxName is a human readable label whose byte representation shall consists of two-letter primary language subtags and two-letter region subtags which are connected by hyphen("-"), and followed by bytes representation of UTF-8_Enc(label).

Where, two-letter primary language subtags and two-letter region subtags shall conform to [BCP47].

1.2. Function

1.2.1. Function templates

When the template keyword is used to decorate the class declaration, it indicates that the code is a template with a placeholder type that can be reused by other classes. Only classes that use the template shall be present in the bitstream; the template itself shall not be present in the bitstream. Classes that use a function template shall pass a data type that is specified in either [MP4-Systems] or § 1.1.1 Data types.

Example

template <class T>
class Foo {
  T t;
}

class Bar {
  Foo<int> f;
}

1.2.2. Mathemetical functions

Clip3(x, y, z)

It shall conform to Clip3 specified in [AV1-Convention].

1.2.3. Function UTF-8 Encoding

UTF-8_Enc(label)

UTF-8_Enc(label) is byted represenation of the encoded label, which is UTF-8 string as defined in [RFC3629], null terminated.

2. Introduction

The IA sequence is a bitstream to represent immersive audio for presentation on a wide range of devices in both dynamic streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems and big screen.

The bitstream comprises a number of coded audio substreams and the metadata that describes how to decode, render and mix the substreams to generate an audio signal for playback. The bitstream format itself is codec-agnostic; any supported audio codec may be used to code the audio substreams.

The immersive audio container (IAC) is the storage format for immersive audio (IA) sequence in one single [ISOBMFF] track.

The figure below shows the conceptual IAC architecture.

Conceptual IAC Architecture

For a given input 3D audio,

Pre-Processor generates Pre-Processed Audio and Codec Agnostic Metadata for immersive audio (IA).
Audio Codec Enc generates Codec-Dependent Bitstream, which consists of the coded streams, coded from Pre-Processed Audio.
File Packager generates IAC File by encapsulating IA sequence, which consists of Codec-Dependent Bitstream and Codec Agnostic Metadata, into [ISOBMFF] tracks.
File Parser reconstructs IA sequence by decapsulating IAC File.
Audio Codec Dec outputs a decoded Pre-Processed Audio after decoding of Codec-Dependent Bitstream.
Post-Processor outputs Immersive 3D Audio by using the decoded Pre-Processed Audio and Codec Agnostic Metadata.

The rest of this specification is formulated as follows:

§ 3 Overview describes the high level IA sequence architecture and introduces its components.
§ 4 Open Bitstream Unit (OBU) Syntax and Semantics specifies the syntax and semantics of the top level IA components and detailed IA components.
§ 5 Profiles specifies the profiles for IA sequences and IA decoders.
§ 6 Standalone IAC Representation specifies the representation of a standalone IA sequence.
§ 7 ISOBMFF IAC Encapsulation specifies the encapsulation of an IA sequence into [ISOBMFF] tracks.
§ 9 IAC processing specifies how the IA sequence should be decoded to generatethe output immersive 3D audio.
§ 10 IAC Generation Process provides a guideline for generating the IA sequence.
§ 11 Consumption of IAC bitstream provides a guideline for consuming the IA sequence, for different use-cases.

3. Overview

3.1. IA sequence Components

The IA sequence includes one or more audio elements, each of which consists of one or more audio substreams. The IA sequence further include mix presentations and parameters.

Audio substream is the actual audio signal, which may be encoded with any compatible audio codec.
Audio element is the 3D representation of the audio signals, and are constructed from one or more audio substreams and the metadata describing them. The audio substreams associated with one audio element use the same audio codec.
Mix presentations contain metadata that describe how the audio elements are rendered and mixed together for playback through physical loudspeakers or headphones. At any given time, only one mix presentation is used for playback. However, multiple mix presentations can be defined as alternatives to each other within the same IA sequence. Furthermore, the choice of which mix presentation to use at playback is left to the user. For example, multi-language support is implemented by defining different mix presentations, where the first mix describes the use of the audio element with English dialogue, and the second mix describes the use of the audio element with French dialogue.
Parameters are the values that are associated with the algorithms used for decoding, reconstructing, rendering and mixing. Parameters may change their values over time and may further be animated; for example, any changes in values may be smoothed over some time interval. Their rate of change is specific to its respective algorithm, and is independent of other algorithms and the frame rates associated with the audio substreams. As such, they may be viewed as a 1D signal that have different metadata specified for different time intervals.

The figure below shows the relationship between the audio substreams, audio elements and mix presentations and the processing flow to obtain the immersive audio playback.

Processing flow to decode, reconstruct, render and mix the audio signals for immersive audio playback.

3.2. Use of OBU Syntax

3.2.1. Descriptors

The descriptor OBUS contains all the information that is required to setup and configure the decoders, reconstruction algorithms, renderers and mixers.

Start Code OBU indicates the start of a full IA sequence description, version and profile version.
Codec Config OBU describes information to set up a decoder for an audio substream.
Audio Element OBU describes information to combine one or more audio substreams to reconstruct an audio element.
Mix Presentation OBU describes information to render and mix one or more audio elements to generate the final audio output.

3.2.2. Data

The data OBUs contain the actual time-varying data that is required in the generation of the final audio output.

The IA sequence supports the description of multiple audio substreams and algorithms, which may have different metadata update rates to each other. The update rate for the audio substreams and audio elements is governed by the frame rates of the audio codec used. Since a single bitstream may support multiple codecs, this may lead to multiple different frame rates. The algorithms for rendering and mixing may have parameters that update at different rates to each other and to the audio frame rates. Therefore, the IA sequence contains information to facilitate the synchronization of the different audio frames and parameters.

Audio Frame OBU provides the raw coded audio frame for an audio substream.
Parameter Block OBU provides the time-varying parameter values for an algorithm used in any of the decoding, reconstruction, rendering or mixing steps.
Sync OBU provides relative timestamp offsets to synchronize audio frames and parameter blocks.
Temporal Delimiter OBU identifies the temporal units.

The below figure shows the linking scheme among obu_ids in obu_header and ids in obu payload.

ID Linking Scheme

In the above figure,

codec config obu is saying that there are two audio elements (audio_element_id = 11 and 12) which are coded by using the codec_config() in the obu.
- The audio element having audio_element_id = 11 is linked to the audio element obu having audio_element_id = 11.
  - The audio element obu is saying that there are two substreams (substream_id = 31 and 32) which composing of this audio element.
    - The audio substream having substream_id = 31 is linked to the audio frame obus having id = 31.
    - The audio substream having substream_id = 32 is linked to the audio frame obus having id = 32.
  - The audio element obu is saying that there are one parameter block (parameter_id = 71) for demixing_info() which is applied to the audio element.
    - The parameter block having parameter_id = 71 is linked to the parameter block obu having parameter_id = 71.
  - IAC decoders applies the parameter block to the audio substreams after decoding by substream decoders.
- The audio element having audio_element_id = 12 is linked to the audio element obu having obu_id = 12.
  - The audio element obu is saying that there are one substream (substream_id = 33) which composing of this audio element.
    - The audio substream having substream_id = 33 is linked to the audio frame obus having id = 33.
  - Substream decoder do decoding the substream.
mix presentation obu is saying that there are two audio elements (audio_element_id = 11 and 12) which need to be mixed.
- The audio element having audio_element_id = 11 and the audio element having audio_element_id = 12 are mixed after decoding each of them.
- Then IAC decoders may do process loudness and drc controls by using mix_loudness_info() and drc_config().

4. Open Bitstream Unit (OBU) Syntax and Semantics

4.1. Top Level OBU Syntax and Semantics

The IA sequence uses the OBU syntax.

This section specifies the top-level OBU syntax elements and their semantics.

4.1.1. Audio OBU Syntax and Semantics

Syntax

class audio_open_bitstream_unit() {
  obu_header();

  if (obu_type == OBU_IA_Start_Code)
    start_code_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Sync)
    sync_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu_with_no_id();
  else if (obu_type >= 9 and <= 30)
    audio_frame_obu(obu_type - 9);
  else if (obu_type == 6 or 7)
    reserved_obu();

  byte_alignment():
}

Semantics

If the syntax element obu_type is equal to OBU_IA_Start_Code, an ordered series of OBUs is presented to the decoding process as a string of bytes.

OBU data shall start on the first (most significant) bit and shall end on the last bit of the given bytes. The payload of an OBU shall lie between the first bit of the given bytes and the last bit before the first zero bit of the byte_alignment().

4.1.2. OBU Header Syntax and Semantics

Syntax

class obu_header() {
  unsigned int (5) obu_type;
  unsigned int (1) obu_redundant_copy;
  unsigned int (1) obu_trimming_status_flag;
  unsigned int (1) obu_extension_flag;
  leb128() obu_size;

  if (obu_trimming_status_flag) {
    leb128() num_samples_to_trim_at_end;
    leb128() num_samples_to_trim_at_start;
  }
  if (obu_extension_flag == 1)
    leb128() extension_header_size;
}

Semantics

OBUs are structured with a header and a payload.

obu_type specifies the type of data structure contained in the OBU payload.

obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Temporal_Delimiter
   5    : OBU_IA_Sync
  6~7   : Reserved
   8    : OBU_IA_Audio_Frame
  9~30  : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21
   31   : OBU_IA_Start_Code

obu_redundant_copy indicates whether this OBU is a redundant copy of the previous OBU in the IA sequence with the same obu_type. A value of 1 shall indicate that it is a redundant copy, while a value of 0 shall indicate that it is not.

It shall always be set to 0 for the following obu_type values:

OBU_IA_Temporal_Delimiter
OBU_IA_Sync
OBU_IA_Audio_Frame
OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21

obu_trimming_status_flag indicates whether this OBU has audio samples to be trimmed or not. If it is set to 1, the num_samples_to_trim_at_start and num_samples_to_trim_at_end fields shall be present.

obu_extension_flag indicates whether the extension_header_size field shall be present. If it set to 0, the extension_header_size field shall not be present. Otherwise, the extension_header_size field shall be present.

This flag shall be set to 0 for the current version of the specification (i.e. version = 0). An IAC-OBU parser which is conformant with the current version of the specification shall be able to parse this flag and extension_header_size.

NOTE: A future version of specification may use this flag to specify an extension header field by setting obu_extension_flag = 1 and setting the size of extended header to extension_header_size.

obu_size shall indicate the size in bytes of the OBU not including the bytes within the obu_header of the preceding fields, i.e. obu_type, obu_trimming_status_flag and obu_extension_flag.

num_samples_to_trim_at_start shall indicate the number of samples that needs to be trimmed from the start of the samples in this Audio Frame OBU.

num_samples_to_trim_at_end shall indicate the number of samples that needs to be trimmed from the end of the samples in this Audio Frame OBU.

extension_header_size shall indicate the size in bytes of the extension header including this field.

4.1.3. Byte Alignment Syntax and Semantics

Syntax

class byte_alignment() {
  while (get_position() & 7)
    unsigned int (1) zero_bit;
}

Semantics

zero_bit shall be equal to 0 and shall be inserted into the bitstream to align the bit position to a multiple of 8 bits.

4.1.4. Reserved OBU Syntax and Semantics

The reserved OBU allows the extension of this specification with additional OBU types in a way that allows IAC-OBU parsers compliant to this version of specification to ignore them.

4.1.5. Start Code OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Start_Code.

For this obu, the obu header (2 bytes) shall be set to 0xF006.

Syntax

class start_code_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) version;
  unsigned in t(8) profile_version
}

Semantics

ia_code shall be a ‘four-character code’ (4CC) to identify the start of the IA sequence. It shall be iamf.

version shall indicate the version of an IA sequence. It shall be set to 0 for this version of the specification. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized specification as backwards compatible with that specification. That is, the version number can be split into "major" and "minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes. For example, an implementation of this specification should accept any stream with a version number of ’15’ or less, and should assume any stream with a version number ’16’ or greater is incompatible.

profile_version shall indicate the profile of an IA sequence. The MSB four bits shall indicate the profile of an IA sequence. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized profile as backwards compatible with that specification. That is, the version number can be split into "profile major" and "profile minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes with the profile major version. The semantic of this field shall be only valid when the MSB four bits of version = 0.

4.1.6. Codec Config OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Codec_Config.

Syntax

class codec_config_obu() {
  leb128() codec_config_id;
  codec_config();
  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    leb128() audio_element_id;
  }
}

class codec_config() {
  unsigned int (32) codec_id;
  decoder_config(codec_id);
  leb128() num_samples_per_frame;
  signed int (16) roll_distance;
}

Semantics

codec_config_id shall indicate a unique ID in an IA sequence for a given codec config.

codec_id shall be a ‘four-character code’ (4CC) to identify the codec used to generate the audio substreams. It shall be opus for IAC-OPUS, mp4a for IAC-AAC-LC, fLaC for IAC-FLAC and lpcm for IAC-LPCM.

For ISOBMFF encapsulation, it shall be the same as the boxtype of its AudioSampleEntry if exist.

decoder_config() specifies the set of codec parameters required to decode an audio substream for the given codec_id. It shall be byte aligned.

The codec_id and decoder_config() for IAC-OPUS shall conform to Codec_Specific_Info of § 4.3.1 IAC-OPUS Specific
The codec_id and decoder_config() for IAC-AAC-LC shall conform to Codec_Specific_Info of § 4.3.2 IAC-AAC-LC Specific.
The codec_id and decoder_config() for IAC-FLAC shall conform to Codec_Specific_Info of § 4.3.3 IAC-FLAC Specific
The codec_id and decoder_config() for IAC-LPCM shall conform to Codec_Specific_Info of § 4.3.4 IAC-LPCM Specific.

num_samples_per_frame shall indicate the frame length, in samples, of the raw coded audio provided in by audio_frame_obu().

roll_distance is a signed integer that gives the number of frames that need to be decoded in order for a frame to be decoded correctly. A negative value indicates the number of frames before the frame to be decoded corrently.

It shall be set to -1 for IAC-AAC-LC and -R (R = 4 when the frame size = 960) for IAC-OPUS. IAC-FLAC may ignore this field. Where, R is the smallest integer greater than or equal to 3840 divided by the frame size.

num_audio_elements shall specify the number of audio elements that refer to this codec config.

audio_element_id shall specify the unique ID associated with the specific audio element that refers to this codec config.

4.1.7. Audio Element OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Audio_Element.

Syntax

class audio_element_obu() {
  leb128() audio_element_id;
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;

  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() parameter_id;
    leb128() parameter_name;
  }

  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  }
  
  
}

Semantics

audio_element_id shall indicate a unique ID in an IA sequence for a given audio element. A Codec Config OBU that refers to that audio element shall use the same value for its audio_element_id field.

audio_element_type shall specify the audio representation of this audio element which is constructed from one or more audio substreams.

audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved

num_substreams shall specify the number of audio substreams that are used to reconstruct this audio element.

audio_substream_id shall specify the unique ID associated with the audio substream that is used to reconstruct this audio element.

num_parameters shall specify the number of parameters that are used by the algorithms specified in this audio element.

parameter_id shall be the unique ID associated with a parameter that is used by the algorithm specified in this audio element.

parameter_name shall specify the name of the parameter.

parameter_name : Parameter name.
       0       : SCALABLE_CHANNEL_LAYOUT_DEMIXING_INFO
       1       : SCALABLE_CHANNEL_LAYOUT_RECON_GAIN_INFO
   the others  : reserved

scalable_channel_layout_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct a scalable channel layout.

ambisonics_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct an Ambisonics layout.

4.1.8. Mix Presentation OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Mix_Presentation.

The metadata in mix_presentation() specifies how to render and process one or more audio elements. The processed audio elements shall then be summed to generate a mixed audio signal. Finally, any additional processing specified by the mix_bus_config() shall be applied to the mixed audio signal in order to generate the final output audio for playback.

Syntax

class mix_presentation_obu() {
  leb128() mix_presentation_id;
  string mix_presentation_friendly_label;
  unsigned int (2) mix_target_layout_type;

  mix_target_layout(mix_target_layout_type);

  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    string audio_element_friendly_label;
    leb128() audio_element_id;
    rendering_config();
    element_mix_config();
  }

  mix_loudness_info();
  mix_bus_config();
}

Semantics

mix_presentation_id shall indicate a unique ID in an IA sequence for a given mix presentation.

mix_presentation_friendly_label shall specify a human-friendly label to describe this mix presentation.

mix_target_layout_type specifies a target layout type for this mix presentation. A value of 0 shall indicate no specific target layout, a value of 1 shall indicate that the target layout is defined using the SP Label of [ITU2051-3], a value of 2 shall indicate that the target layout is defined using the sound system convention of [ITU2051-3] and a value of 3 shall indicate that the target layout is binaural.

mix_target_layout_type : Mix Target layout type
           0           : NOT_DEFINED
           1           : LOUDSPEAKERS_SP_LABEL
           2           : LOUDSPEAKERS_SS_CONVENTION
           3           : BINAURAL

mix_target_layout() is a class that specifies the target playback layout that all referenced audio elements shall be rendered for.

An IA sequence may have one or more mix presentations specified, each with a different mix target layout. The IA parser shall select the appropriate target layout according to the following rules, in order:

The IA parser shall first attempt to select the mix presentation that matches the physical playback layout.
If there is no match, the IA parser shall select the mix presentation with mix_target_layout_type = 0. In this case, the renderer specified in rendering_config() shall render the physical playback layout appropriately.
If there is no mix presentation with mix_target_layout_type = 0, the IA parser should select the mix presentation with the closest specified layout to the physical layout. The renderer specified in rendering_config() shall first render the layout specified by mix_target_layout() and then apply up or down-mixing appropriately. Sections § 10.2.2 Down-mix Mechanism and § 9.5 Down-mix Matrix provide example dynamic and static down-mixing matrices for some common layouts that may be used.

num_audio_elements shall specify the number of audio elements that are used in this mix presentation to generate the final output audio signal for playback.

audio_element_friendly_label shall specify a human-friendly label to describe the referenced audio element.

audio_element_id shall indicate the unique ID associated with a specific audio element that is used in this mix presentation.

rendering_config() is a class that provides the metadata required for rendering the referenced audio element.

element_mix_config() is a class that provides the metadata required for applying any processing to the referenced and rendered audio element before being summed with other processed audio elements.

mix_loudness_info() is a class that provides the loudness information and statistics for the final output audio signal.

mix_bus_config() is a class that provides the metadata required for applying any post-processing to the mixed audio signal to generate the final output audio signal for playback.

4.1.9. Parameter Block OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameter values for an algorithm for an indicated duration, including any animation of the parameter values over this duration. The metadata shall be used in conjunction with a corresponding parameter definition and parameter data specification. The parameter definition shall be specified based on ParamDefinition(). The parameter data shall provide the values to apply in each parameter block. These shall be specified using the AnimatedParameterData() function template if parameter animation is supported.

Syntax

class parameter_block_obu() {
  leb128() parameter_id;
  leb128() duration;
  leb128() num_segments;
  leb128() constant_segment_interval;

  leb128() param_definition_type = get_param_definition_type(parameter_id);

  for (i = 0; i < num_segments; i++) {
    if (constant_segment_interval == 0) {
      leb128() segment_interval;
    }

    if (param_definition_type == PARAMETER_DEFINITION_MIX_GAIN) {
      leb128() animation_type;
      mix_gain_parameter_data(animation_type);
    }
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING_INFO) {
      demixing_info_parameter_data();
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN_INFO) {
      recon_gain_info_parameter_data();
    }
  }
}

Semantics

parameter_id shall indicate the unique ID that is associated with a specific parameter definition. All parameter blocks that provide data for that parameter definition shall have the same parameter_id.

duration shall specify the duration for which this parameter block is valid and applicable. The duration shall be expressed as the number of ticks at the rate indicated by the time_base specified in the corresponding parameter definition.

num_segments shall specify the number of different sets of parameter values specified in this parameter block, where each set describes a different segment of the timeline, contiguously.

constant_segment_interval shall specify the interval of each segment, in the case where all segments have equal intervals. If all segments do not have equal intervals, the value of constant_segment_interval shall be set to 0. This value shall be expressed as the number of ticks at the rate indicated by the time base specified in the corresponding parameter definition.

get_param_definition_type() is a run-time function that shall map the parameter_id to its registered parameter definition type. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.

param_definition_type	Parameter definition type	Parameter definition
0	PARAMETER_DEFINITION_MIX_GAIN	MixGainParamDefinition
0	PARAMETER_DEFINITION_DEMIXING_INFO	DemixingInfoParamDefinition
0	PARAMETER_DEFINITION_RECON_GAIN_INFO	ReconGainInfoParamDefinition

NOTE: param_definition_type is not coded in the bitstream but is inferred at run time based on parameter_id.

segment_interval shall specify the interval for the given segment.

animation_type shall specify the how the parameter values in this parameter block shall be animated.

animation_type : Animation Type
       0       : STEP
       1       : BEZIER

In the case where animation_type is equal to BEZIER, parameters for the linear and quadratic Bezier curves may be defined in this version of the specification.

Classes that take animation_type as an input argument shall use the AnimatedParameterData() function template.

template <class T>
class AnimatedParameterData(animation_type) {
  if (animation_type == STEP) {
    T start_point_value;
  }
  if (animation_type == BEZIER) {
    T start_point_value;
    T end_point_value;
    T control_point_value;
    unsigned int (8) control_point_relative_time;
  }
}

start_point_value shall specify the parameter value that is applied at the start of the segment.

end_point_value shall specify the parameter value that is applied at the end of the segment.

control_point_value shall specify the parameter value of the middle control point of a quadratic Bezier curve, i.e. its y-axis value. If this animation implements a linear Bezier curve, control_point_value shall be ignored by the IA parser.

control_point_relative_time shall specify the time of the middle control point of a quadratic Bezier curve, i.e. its x-axis value. This value is expressed as a fraction of the parameter segment interval with valid values in the range of 0 and 1, inclusively. A value equal to 0 or 1 shall indicate that this animation implements a linear Bezier curve, in which case control_point_value shall be ignored by the IA parser. It is stored as an 8-bit, unsigned, fixed-point value with 8 fractional bits (i.e. Q0.8 in [Q-Format]).

4.1.9.1. Parameter Definition Syntax and Semantics

Parameter definition classes shall inherit from the abstract ParamDefinition() class. They may optionally further provide default parameter values, which are applied when there are no parameter blocks available.

Syntax

abstract class ParamDefinition() {
  leb128() parameter_id;
  leb128() time_base;
}

Semantics

parameter_id shall indicate the unique ID in an IA bitstream for a given parameter.

time_base shall specify the time base used by this parameter, expressed as seconds per tick. Time-related fields associated with this parameter, such as durations and intervals, shall be expressed in the number of ticks.

4.1.10. Audio Frame OBU Syntax and Semantics

This section specifies OBU payloads of OBU_IA_Audio_Frame and OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

The first 22 audio substreams in an IA sequence may use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21, which have predefined audio substream IDs associated with them. This avoids the need to manually specify an audio_substream_id.

Syntax

class audio_frame_obu_with_no_id() {
  leb128() audio_substream_id;
  audio_frame_obu(audio_substream_id);
}

class audio_frame_obu(audio_substream_id) {
  unsigned int (8*coded_frame_size) audio_frame();
}

Semantics

audio_substream_id shall indicate a unique ID in an IA sequence for a given substream. All Audio Frame OBUs of the same substream shall have the same audio_substream_id.

This value must be greater or equal to 22, in order to avoid collision with the reserved IDs for the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

coded_frame_size is the size of audio_frame() in bytes.

audio_frame() is the raw coded audio data for the frame. It shall be opus packet of [RFC6716] for IAC-OPUS, raw_data_block() of [AAC] for IAC-AAC-LC and FRAME of [FLAC] for IAC-FLAC.

For IAC-LPCM, audio_frame() shall be LPCM samples. When more than one byte is used to represent a LPCM sample, the byte order shall be in little endian.

For this version of the specification, all audio frames for a given substream must be gapless.

4.1.11. Temporal Delimiter OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Temporal_Delimiter.

Syntax

class temporal_delimiter_obu() {
}

NOTE: The Temporal Delimiter OBU has an empty payload.

4.1.12. Sync OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Sync.

Syntax

class sync_obu() {
  leb128() global_offset;
  leb128() num_obu_ids;
  for (i = 0; i < num_obu_ids; i++) {
    leb128() obu_id;
    unsigned int (1) obu_data_type;
    unsigned int (1) reinitialize_decoder;
    unsigned int (6) reserved;
    sleb128() relative_offset;
  }
  leb128() concatenation_rule;
}

Semantics

global_offset shall specify the offset that is applied to all substreams and parameters specified in this Sync OBU, in addition to their individual relative offsets.

For this version of the specification, the value of global_offset shall be set to 0.

num_obu_ids shall specify the number of substream and parameter IDs that this Sync OBU specifies the offset for.

obu_id shall specify the unique ID associated with the substream or parameter that is being referred to.

obu_data_type shall specify the type of data that is being referred to.

obu_data_type : Type of OBU data
      0       : SUBSTREAM
      1       : PARAMETER

reinitialize_decoder shall be used to specify the behaviour of a decoder when encountering gaps in the audio substream, where the gap shall be identified as described in § 6.2 Synchronizing Data OBUs. If obu_data_type does not equal SUBSTREAM, an IAC-OBU parser shall ignore this field.

If reinitialize_decoder = 0, the decoder shall not be reinitialized before decoding the audio frames after the gap. This may be used in the case where it is preferable for the decoder to fill the gap with silence instead.

If reinitialize_decoder = 1, the decoder shall be reinitialized before decoding the audio frames after the gap. If a pre-skip is specified in the relevant Codec Config OBU, it is applicable after reinitializing the decoder.

For this version of the specification, the value of reinitialize_decoder shall be set to 0. If a value of 1 is seen, the IA sequence shall be rejected as invalid.

reserved shall be set to 0. Reserved units are for future use and shall be ignored by an IAC-OBU parser.

relative_offset is the offset that is applied to the first audio frame or parameter block with the referenced obu_id that comes after this Sync OBU. It describes the position of audio and parameters in a local frame of reference. The local frame of reference is unique for each Sync OBU.

concatenation_rule shall specify the type of concatenation rule that is applied to position the audio frames and parameters that happened after a Sync OBU with respect to the timeline before the Sync OBU. A value of 0 shall indicate that Concatenation Rule 1 specified in § 6.2 Synchronizing Data OBUs shall be used, while a value of 1 shall indicate that Concatenation Rule 2 shall be used.

4.2. Detailed OBU Syntax and Semantics

4.2.1. Scalable Channel Layout Config Syntax and Semantics

scalable_channel_layout_config() contains information regarding the configuration of scalable channel audio.

Syntax

class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}

class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  signed int (16) loudness(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flag(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}

When an audio element is composed of G(r) number of substreams, scalable channel audio for the audio element shall be layered into num_layers = r number of ChannelGroups.

The order of ChannelGroups in each temporal unit shall be same as the order of channel_audio_layer_config()s in scalable_channel_layout_config().
ChannelGroup is a set of substreams which is able to provide a spatial resolution of audio contents by itself or which is able to provide an enhanced spatial resolution of audio contents by combining with the preceding ChannelGroups within the audio frames.
ChannelGroup #q consists of G(q)-G(q-1) number of substreams. Where, q = 1, 2, ..., r and G(0) = 0.
IA frame shall be a set of audio_frame_obus with the same sync offsets of the single audio element for scalable channel audio. Each of them shall come from each substream.
Every IA frame shall have the same number of audio_frame_obus.
When r > 1, parameter_block_obu may present with IA frame.

Immersive Audio Sequence with scalable channel audio (before OBU packing)

Semantics

num_layers shall indicate the number of ChannelGroups for scalable channel audio. It shall not be set to zero and its maximum number shall be limited to 6.

channel_audio_layer_config() is a class that provides the information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) shall provide information regarding the configuaration of ChannelGroup #i.

loudspeaker_layout shall indicate the channel layout for the channels to be reconstructed from the precedent ChannelGroups and the current ChannelGroup among ChannelGroups for scalable channel audio.

In the current version of the specification, loudspeaker_layout shall indicate one of 9 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch and 3.1.2ch. Where,

Stereo is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System A (0+2+0) of [ITU2051-3].
5.1ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System B (0+5+0) of [ITU2051-3].
5.1.2ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System C (2+5+0) of [ITU2051-3].
5.1.4ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System D (4+5+0) of [ITU2051-3].
7.1ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System I (0+7+0) of [ITU2051-3].
7.1.2ch is the combination of the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System I (0+7+0) of [ITU2051-3] and the left and right top front pair of the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System J (4+7+0) of [ITU2051-3].
7.1.4ch is the loudspeaker configuration as depicted in Loudspeaker configuration for Sound System J (4+7+0) of [ITU2051-3].
3.1.2ch is the front subset (L/C/R/Ltf/Rtf/LFE) of 7.1.4ch.

Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
            others          :     reserved     :

Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, 
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

output_gain_is_present_flag shall indicate if output_gain information fields for the ChannelGroup presents .

0: No output_gain information fields for the ChannelGroup present.
1: output_gain information fields for the ChannelGroup present. In this case, output_gain_flags and output_gain fields present.

recon_gain_is_present_flag shall indicate if recon_gain information fields for the ChannelGroup presents in Recon_Gain_Info().

0: No recon_gain information fields for the ChannelGroup present in Recon_Gain_Info_OBU.
1: recon_gain information fields for the ChannelGroup present in Recon_Gain_Info_OBU. In this case, recon_gain_flags and recon_gain fields present.

loudness shall indicate the loudness value of the downmixed channels, for the channel layout which is indicated by loudspeaker_layout, from the original channel audio. It shall be stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

output_gain_flags shall indicate the channels which output_gian is applied to. If a bit set to 1, output_gain shall be applied to the channel. Otherwise, output_gain shall not be applied to the channel.

Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Rigth Top Front channel (Rtf)

output_gain shall indicate the gain value to be applied to the mixed channels which are indicated by output_gain_flags. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]). Where, each mixed channel is generated by downmixing two or more input channels.

4.2.2. Ambisonics Config Syntax and Semantics

ambisonics_config() contains information regarding the configuration of Ambisonics.

Syntax

class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}

class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}

class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  unsigned int (16 * (N + M) * C) demixing_matrix;
}

Semantics

ambisonics_mode shall specify the method of coding Ambisonics.

ambiosnics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION

If ambisonics_mode is equal to MONO, this shall indicate that the Ambisonics channels are coded as individual mono substreams.

If ambisonics_mode is equal to PROJECTION, this shall indicate that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono substreams.

output_channel_count shall be the same as channel count in [[!RFC8486].

substream_count shall specify the number of audio substreams. It must be the same as num_substreams in its corresponding audio_element().

channel_mapping shall be the same as the one for ChannelMappingFamily = 2 in [RFC8486].

coupled_substream_count shall specify the number of referenced substreams that are coded as coupled stereo channels, where M <= N.

demixing_matrix shall be the same as the one for ChannelMappingFamily = 3 in [RFC8486] except the byte order of each of matrix coefficients shall be converted to big endian.

4.2.3. Demixing Info Syntax and Semantics

demixing_info() specifies demixing parameter mode to be used to reconstruct output channel audio according to its loudspeaker_layout.

Syntax

class demixing_info() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}

Semantics

dmixp_mode shall indicate a mode of pre-defined combinations of five demix parameters.

0: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, -1)
1: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, -1)
2: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, -1)
3: reserved
4: mode1, (alpha, beta, gamma, delta, w_idx_offset) = (1, 1, 0.707, 0.707, 1)
5: mode2, (alpha, beta, gamma, delta, w_idx_offset) = (0.707, 0.707, 0.707, 0.707, 1)
6: mode3, (alpha, beta, gamma, delta, w_idx_offset) = (1, 0.866, 0.866, 0.866, 1)
7: reserved

alpha and beta shall be gain values used for S7to5 down-mixer, gamma for T4to2 down-mixer, delta for S5to3 down-mixer and w_idx_offset shall be the offset to generate a gain value w used for T2toTF2 down-mixer.

IA Down-mix Mechanism

4.2.4. Recon Gain Info Syntax and Semantics

recon_gain_info() contains recon gain values for demixed channels.

Syntax

class recon_gain_info() {
  for (i=0; i< channel_audio_layer; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flag(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}

Semantics

recon_gain_flags shall indicate the channels which recon_gain is applied to.

recon_gain shall indicate the gain value to be applied to the channel, which is indicated by recon_gain_flags, after decoding of the following associated frames.

4.2.5. Mix Target Layout Syntax and Semantics

The mix target layout specifies the list of physical loudspeaker positions according to [ITU2051-3].

Syntax

class mix_target_layout(mix_target_layout_type) {
  if (mix_target_layout_type == LOUDSPEAKERS_SP_LABEL) {
    unsigned int (6) num_loudspeakers;
    for (i = 0; i < num_loudspeakers; i++) {
      unsigned int (8) sp_label;
    }
  } 
  else if (mix_target_layout_type == LOUDSPEAKERS_SS_CONVENTION) {
    unsigned int (4) sound_system;
    unsigned int (2) reserved;
  }
  else if (mix_target_layout_type == BINAURAL or NOT_DEFINED) {
    unsigned int (6) reserved;
  }
}

Semantics

num_loudspeakers shall specify the number of loudspeakers.

sp_label shall define the SP Label as specified in [ITU2051-3].

sp_label	SP label	sp_label	SP label	sp_label	SP label
0	M+000	18	U+000	36	B+000
1	M+022	19	U+022	37	B+022
2	M-022	20	U-022	38	B-022
3	M+SC	21	U+030	39	B+030
4	M-SC	22	U-030	40	B-030
5	M+030	23	U+045	41	B+045
6	M-030	24	U-045	42	B-045
7	M+045	25	U+060	43	B+060
8	M-045	26	U-060	44	B-060
9	M+060	27	U+090	45	B+090
10	M-060	28	U-090	46	B-090
11	M+090	29	U+110	47	B+110
12	M-090	30	U-110	48	B-110
13	M+110	31	U+135	49	B+135
14	M-110	32	U-135	50	B-135
15	M+135	33	U+180	51	B+180
16	M-135	34	UH+180	52	LFE1
17	M+180	35	T+000	53	LFE2
				54 ~ 256	Reserved

sound_system shall specify the sound system A to J as specified in [ITU2051-3] as follows:

0: It shall indicate Loudspeaker configuration for Sound System A (0+2+0)
1: It shall indicate Loudspeaker configuration for Sound System B (0+5+0)
2: It shall indicate Loudspeaker configuration for Sound System C (2+5+0)
3: It shall indicate Loudspeaker configuration for Sound System D (4+5+0)
4: It shall indicate Loudspeaker configuration for Sound System E (4+5+1)
5: It shall indicate Loudspeaker configuration for Sound System F (3+7+0)
6: It shall indicate Loudspeaker configuration for Sound System G (4+9+0)
7: It shall indicate Loudspeaker configuration for Sound System H (9+10+3)
8: It shall indicate Loudspeaker configuration for Sound System I (0+7+0)
9: It shall indicate Loudspeaker configuration for Sound System J (4+7+0)
10 ~ 15: Reserved

4.2.6. Rendering Config Syntax and Semantics

Syntax

class rendering_config() {
  // TODO
}

Semantics

4.2.7. Element Mix Config Syntax and Semantics

element_mix_config() provides a gain value to be applied to the rendered audio element signal.

Syntax

class element_mix_config() {
  MixGainParamDefinition mix_gain;
}

Semantics

mix_gain provides the parameter definition for the gain value that is applied to all channels of the rendered audio element signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

4.2.7.1. Mix Gain Parameter Definition and Data Syntax and Semantics

Syntax

class MixGainParamDefinition extends ParamDefinition() {
  signed int (16) default_mix_gain;
}

class mix_gain_parameter_data(animation_type) {
  AnimatedParameterData<signed int (16)> param_data;
}

Semantics

default_mix_gain shall specify the default mix gain value to apply when there are no mix gain parameter blocks provided. This value is expressed in dB and shall be applied to all channels in the rendered audio element. It is stored as a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

param_data shall use the AnimatedParameterData function template. Each of the values defined within this instance (start_point_value, end_point_value and control_point_value) shall be expressed in dB and shall be applied to all channels in the rendered audio element. They are stored as 16-bit, signed, two’s complement fixed-point values with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

4.2.8. Mix Loudness Info Syntax and Semantics

Syntax

class mix_loudness_info() {
  signed int (16) mix_loudness
}

Semantics

mix_loudness shall indicate the loudness value of the mixed channels, for the mix_target_layout(), from the audio elements which are specified in the mix_presentation_obu(). It is stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and the value should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

4.2.9. Mix Bus Config Syntax and Semantics

Syntax

class mix_bus_config() {
  drc_config();
}

class drc_config() { // TODO }

Semantics

4.3. Codec Specific

This section defines codec specific information for Codec_Specific_Info and Substream.

Codec_Specific_Info shall be composed of Codec_ID and Decoder_Config(). Codec_ID shall indicate the codec which has been used to generate a given substream within IA sequence and Decder_Config() shall indicate the decoding parameters which are applied to the substream within IA sequence.

For legacy codecs, Decoder_Config() shall be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the substream. For future codecs, Decoder_Config() shall include all of decoding parameters which are required to decode Substreams.

Substream shall be a raw coded stream for one or more channels. Substream format shall be exactly the same as the sample format (before packing OBU and except parameter blocks) for the audio file which consists of only one single coded stream by the Codec_ID.

4.3.1. IAC-OPUS Specific

Codec_Specific_Info for IAC-OPUS shall conform to ID Header with ChannelMappingFamily = 0 of [RFC7845] with following constraints:

Channel Count should be set to 2.
Output Gain shall not be used. In other words, it shall be set to 0dB.
The byte order of each field in ID Header shall be converted to big endian.

Substream format shall be opus packet of [RFC6716] which contains only one single frame of mono or stereo channels and which has non-delimiting frame structure.

4.3.2. IAC-AAC-LC Specific

Codec_ID shall be mp4a.

Decoder_Config() for IAC-AAC-LC shall be DecoderConfigDescriptor() of [MP4-Systems], which is a subset of ESDBox for [MP4-Audio], with following constraints:

objectTypeIndication = 0x40
streamType = 0x05 (Audio Stream)
upstream = 0
decSpecificInfo(): The syntax and values shall conform to AudioSpecificConfig() of [MP4-Audio] with following constraints:
- audioObjectType = 2
- channelConfiguration should be set to 2.
- GASpecificConfig(): The syntax and values shall conform to GASpecificConfig() of [MP4-Audio] with following constraints:
  - frameLengthFlag = 0 (1024 lines IMDCT)
  - dependsOnCoreCoder = 0
  - extensionFlag = 0

Substream format shall be one single raw_data_block() of [AAC] which contains only one single frame of mono or stereo channels.

4.3.3. IAC-FLAC Specific

Codec_ID shall be fLaC, the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

Decoder_Config() for IAC-FLAC shall be METADATA_BLOCK of [FLAC].

Substream format shall be FRAME of [FLAC], which is composed of FRAME_HEADER, followd by SUBFRAME(s) (one SUBFRAME per channel) and followed by FRAME_FOOTER.

4.3.4. IAC-LPCM Specific

Codec_ID shall be lpcm.

Decoder_Config() for IAC-LPCM shall be as follows:

class decoder_config(lpcm) {
  unsigned int (32) sample_rate;
  unsigned int (8) sample_size;
}

sample_rate shall indicate the sample rate of the input audio in Hz.

sample_size shall indicate the size of a PCM sample in bit units. The value shall be less than or equal to 24.

Substream format shall be the LPCM audio samples for the frame size.

5. Profiles

The IA Profiles define a set of capabilities that are required to parse, decode and process the corresponding IA sequence.

5.1. IA Simple Profile

This section specifies the conformance points of the simple profile.

Restrictions on the IA sequence:

There shall be only one unique Codec Config OBU.
There shall be only one unique Audio Element OBU.
There shall be only one unique set of Descriptor OBUs.
There shall not be any Temporal Delimiter OBUs present.
version shall be set to 0 for this version of specification.
profile_version shall be set to 0 for this version of specification.
num_layers shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio).
- In this case, demixing_info() and recon_gain_info() may be present in the IA sequence.
- In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), demixing_info() and recon_gain_info() shall not be present in the bitstream.
- When num_layers = 1, OBU_IA_Parameter_Block including demixing_info() may be present in the IA sequence and IA decoders may use the demixing_info() for dynamic down-mixing.
All audio frames shall have aligned frame boundaries.

Capabilities of the IA parser, decoder and processor:

They shall be able to parse an IA sequence with the MSB four bits of profile_version = 0 and the MSB four bits of version = 0 (i.e., profile_version = 0 to 15 and version = 0 to 15).
They shall be able to decode and process up to 16 channels.
They shall be able to reconstruct one audio element.
They may use demixing_info() to do down-mixing.

5.2. IA Base Profile

This section specifies the conformance points of the base profile.

Restrictions on IA sequence:

There shall be only one unique Codec Config OBU.
There shall be at most two unique Audio Element OBUs at any one time.
There may be more than one unique set of Descriptor OBUs.
There may be Temporal Delimiter OBUs present.
version shall be set to 0 for this version of specification.
profile_version shall be set to 16 for this version of specification.
num_layers shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio)
- In this case, demixing_info() and recon_gain_info() may be present in the IA sequence.
- In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), demixing_info() and recon_gain_info() shall not be present in the bitstream.
- When num_layers = 1, OBU_IA_Parameter_Block including demixing_info() may be present in the IA sequence and IA decoders may use the demixing_info() for dynamic down-mixing.
All audio frames shall have aligned frame boundaries.

Capabilities of the IA parser, decoder and processor:

They shall be able to parse an IA sequence with the MSB four bits of profile_version = 0 or 1 and the MSB four bits of version = 0 (i.e., profile_version = 0 to 31 and version = 0 to 15).
They shall be able to support the capabilities of the Simple Profile.
They shall be able to decode and process up to 16 channels.
They shall be able to reconstruct two audio elements.
They shall be able to mix two audio elements.
They shall be able to process short-lived audio elements.

5.3. IA Enhanced Profile

This section specifies the conformance points of the enhanced profile.

Restrictions on IA sequence:

There may be more than one unique Codec Config OBUs.
There may be more than one unique Audio Element OBUs.
There may be more than one unique Mix Presentation OBUs.
There shall not be Temporal Delimiter OBUs present.
version shall be set to 0 for this version of specification.
profile_version shall be set to 32 for this version of specification.
The different Codec Config OBUs may have different codec_ids specified with the following constraints:
- The combination of codec_id = fLaC for one substream and codec_id = opus for another substream shall not be allowed.
- The combination of codec_id = fLaC for one substream and codec_id = mp4a for another substream shall not be allowed.
num_layers shall be set to 1 or up to 6 for Channel-based audio element (i.e. scalable channel audio)
- In this case, demixing_info() and recon_gain_info() may be present.
- In case of simple scalable channel audio (e.g. mono for layer 1 & stereo for layer 2), demixing_info() and recon_gain_info() shall not be present.

Capabilities of the IA parser, decoder and processor:

They shall be able to parse an IA sequence with the MSB four bits of profile_version = 0, 1 or 2 and the MSB four bits of version = 0 (i.e., profile_version = 0 to 47 and version = 0 to 15).
They shall be able to support the capabilities of the base profile.
They shall be able to decode and process up to 36 channels.
They shall be able to decode one or more different audio codecs in the same sequence, with the exception of the following combinations:
- IAC-FLAC and IAC-OPUS
- IAC-FLAC and IAC-AAC-LC
IA decoder which is conformant to this profile shall be able to synchronize two or more audio elements with different frame sizes.

6. Standalone IAC Representation

This section details the order in which the OBUs shall be sequenced in a standalone IAC representation. It further specifies how the Data OBUs shall be synchronized, with the aid of the Sync OBUs.

6.1. OBU Sequence Order

IA sequence shall be composed of a series of OBUs as follows:

One set of Descriptor OBUs shall be placed at the beginning of IA sequence and followed by a sequence of data OBUs.

6.1.1. Descriptor OBUs

A set of Descriptor OBUs shall be placed at the beginning of the bitstream in the following order:

One Start Code OBU
All Codec Config OBUs
All Mix Presentation OBUs
All Audio Element OBUs

6.1.2. Data OBUs

One Sync OBU shall be placed immediately after the Descriptor OBUs. This shall be followed by a sequence of Audio Frame OBUs, Parameter Block OBUs, one or more additional Sync OBUs and one or more Temporal Delimiter OBUs, according to the rules below:

Audio Frame OBUs and Parameter Block OBUs must be ordered by their implied timestamp in the timeline, and may be interleaved.
If there are multiple Audio Frame OBUs that have the same implied start timestamp, they must be grouped by audio elements.
A new Sync OBU may be inserted anywhere in the sequence of data OBUs, as frequently as needed.
Between two Sync OBUs, a sequence of audio frames or parameter blocks must be gapless.
If an Audio Frame OBU or Parameter Block OBU has a substream or parameter ID that is not defined in the most recent Sync OBU, it must not appear in the bitstream, until a new Sync OBU is provided that specifies them.
A Temporal Delimiter OBU may be inserted at the beginning of a temporal unit, defined as a set of all audio frames with the same start timestamp and the same duration from all substreams and all non-redundant parameter blocks with the start timestamp within the duration. A temporal unit may include redundant parameter blocks.
If Temporal Delimiter OBUs are present, they must be inserted at the beginning of every temporal unit.

Additionally, the following constraints apply to the Audio Frame and Parameter Block OBUs:

Audio Frame OBUs must be provided non-redundantly, such that for each substream, there shall not be two audio frames that are overlapping in time.
Parameter Block OBUs may be provided redundantly, such that they contain the same data as a previously provided Parameter Block OBU for the same time region. In this case, the "obu_redundant_copy" field in the OBU header shall be set to 1.
Redundant Parameter Block OBUs do not need to be ordered by their implied timestamp in the timeline. The implied timestamp should be inferred from the initial non-redundant version.
Non-redundant Parameter Block OBUs must not provide data for overlapping time regions.

6.1.3. Refreshing Descriptor OBUs

The above describes the full sequence of OBUs for a given set of descriptor OBUs and their associated data OBUs. If the IAC configuration changes, a new set of descriptor OBUs is required. In that case, a new sequence of the complete set of descriptor OBUs, a Sync OBU and their corresponding data OBUs shall follow, in the same order as described above.

The descriptor OBUs may additionally be repeated redundantly and as frequently as necessary. In this case, the "obu_redundant_copy" field in the OBU header of each of the descriptor OBUs shall be set to 1.

If there is set of descriptor OBUs placed mid-stream, there may be parameter blocks that came before them which are still valid and applicable for the duration after the descriptor OBUs. In this case, these parameter blocks must be redundantly copied and placed after the first Sync OBU that follows the descriptor OBUs. This ensures that any receiver joining mid-stream and encountering a set of descriptor OBU is guaranteed to be able to receive the complete set of metadata that is applicable to all audio frames that come after.

6.2. Synchronizing Data OBUs

The audio frames and parameter data provided in the Data OBUs may be asynchronous; different audio substreams may have different audio frame sizes, parameter blocks may have different durations from the audio frames, or there may be gaps in a parameter’s timeline. This section details how these Data OBUs may be synchronized, based on their duration and the information provided in the Sync OBUs.

The Sync OBU contains two pieces of information that apply to all substream and parameters that follow it:

1) a relative offset for each of the substreams and parameters, and

2) a global offset.

The relative offsets describe how the substreams and parameters are positioned within a local frame of reference, which is unique for each Sync OBU. For example, the Sync OBU given below indicates that Substream 1 has a start timestamp that is 15 units before Substream 2, 10 units after Parameter 1, and 25 units before Parameter 2.

ID (name)	Relative offset
N/A (Global offset)	0
1 (Substream 1)	-5
2 (Substream 2)	+10
3 (Parameter 1)	-15
4 (Parameter 2)	+20

Within a Sync OBU, only the relative information between the relative offsets is meaningful for positioning it within the global frame of reference, where the method of positioning is described further below. This is not affected by any constant shift in the relative offsets. As such, each Sync OBU can have any number of variants, as long as there is a constant difference between the two variants (see the example below). This removes any restrictions on how the absolute values of the relative offsets are selected. For example, some implementations may wish to always set the relative offset of an arbitrarily selected substream or parameter to 0.

ID (name)	Relative offset (variant 1)	Relative offset (variant 2)
N/A (Global offset)	0	0
1 (Substream 1)	-5	+10
2 (Substream 2)	+40	+55
3 (Parameter 1)	-15	0
4 (Parameter 2)	+20	+35

The global offset defines an additional offset that is applied to all substreams and parameters, and can be used to express intentional gaps between the substreams and parameters associated with two Sync OBUs.

The local frame of reference can be positioned in a global frame of reference by using one of the two concatenation rules provided below. These rules specify how two timelines associated with different Sync OBUs shall be aligned.

Concatenation Rule 1

Ignoring the global offset, the new timeline after a Sync OBU is shifted as early as possible such that the earliest substream or parameter in the new timeline concatenates with its counterpart in the previous timeline. Then, the global offset is applied to additionally shift the new timeline.

Concatenation Rule 2

Ignoring the global offset, the new timeline after a Sync OBU is shifted as early as possible such that the earliest substream in the new timeline concatenates with the latest substream in the old timeline. Then, the global offset is applied to additionally shift the new timeline.

Choose if this applies to 1) all audio substreams + params, or to 2) audio substreams only. Option 1) can lead to audio gaps. 2) can lead to overlapping params. See https://github.com/AOMediaCodec/iac/issues/102

The algorithm below may be used to implement the concatenation rules.

// Timestamp at the end of the most recent frame before the Sync OBU, for a
// given substream or parameter ID.
end_timestamp[ID];

// Offset specified by the new Sync OBU.
relative_offset[ID];

if (concatenation_rule_1) {
  // The timestamp for the “zero offset” is computed by applying relativ
  // offsets to the end timestamps, and seeing which one comes latest in time.
  timestamp_for_zero_offset =
    max(end_timestamp[ID] - relative_offset[ID] for each ID)
    + global_offset;
}

if (concatenation_rule_2) {
  timestamp_for_zero_offset =
    max(end_timestamp[ID]) - min(relative_offset[ID])
    + global_offset;
}

// Find the timestamp of each substream and parameter relative to the new
// “zero”.
For each ID:
  next_start_timestamp[ID] = timestamp_for_zero_offset + relative_offset[ID]

The example below illustrates three examples of how the same timeline after (orange and blue) is aligned with different previous timelines (white) when using Concatenation Rule 1.

Aligning timelines before and after a Sync OBU using the concatenation rule.

Add a similar diagram for Concat Rule 2 when its corresponding issue is resolved.

Since the Data OBUs between two Sync OBUs must be gapless, the remainder of the timeline can be inferred from the durations of the audio frames and parameter blocks. The duration of an audio frame is specified by the num_samples_per_frame field in its corresponding Codec Config OBU, while the duration of a parameter block is specified in its duration field.

7. ISOBMFF IAC Encapsulation

7.1. General Requirements & Brands

A file conformant to this specification satisfies the following:

It shall conform to the normative requirements of [ISOBMFF]
It shall have the iamf brand among the compatible brands array of the FileTypeBox
It shall contain at least one track using an IASampleEntry
It SHOULD indicate a structural ISOBMFF brand among the compatible brands array of the FileTypeBox, such as iso6
It MAY indicate other brands not specified in this specification provided that the associated requirements do not conflict with those given in this specification

Parsers shall support the structures required by the iso6 brand and MAY support structures required by further ISOBMFF structural brands.

7.2. ISOBMFF IAC Encapsulation with single track

This section describes the basic data structures used to signal encapsulation of IA sequence in [ISOBMFF] containers.

7.2.1. Requirement of IA sequence

IA sequence shall comply with the bitstream which is specified in [#profiles-simple] or [#profiles-base] for encapsulation of ISOBMFF with single track.

7.2.2. Encapsulation Scheme

During encapsulation process, OBUs of IA sequence are encapsulated into [ISOBMFF] as follows:

Start Code OBU: version and profile version fields shall be moved to IASampleEntry.
Codec Config OBU:
- codec_id and decoder_config() shall move to IASampleEntry.
- num_samples_per_frame shall move to stts.
- roll_distance shall be stored as AudioPreRollEntry having grouping_type, prol.
Mix Presentation OBUs and Audio Element OBUs (with OBU syntax) shall be stored as a new sample group having grouping_type, iagd.
Sync OBU: parse the input timeline using the Sync OBU information and construct a PTRO box that describes the relative offsets for each parameter block.
Each temporal unit:
- Temporal Delimiter OBU: shall be discarded if present.
- Parameter Block OBU for demixing_info() (with OBU syntax) shall be stored as a new sample group having grouping_type, demi.
- Remained OBUs of each temporal unit shall be stored as one sample data without gap among OBUs.
Audio Frame OBUs:
- Select one substream.
- If obu_trimming_status_flag of the first Audio Frame OBU of the substream is set to 1, keep parsing following Audio Frame OBUs of the substream until meets the Audio Frame OBU having obu_trimming_status_flag = 0 and sum num_samples_to_trim_at_start. Then reflect the result of the summation to edts.
- If obu_trimming_status_flag of the last Audio Frame OBU of the substream is set to 1, then reflect num_samples_to_trim_at_end to 'stts.

IAC Encapsulation Scheme

7.2.3. IA Sample Entry

Sample Entry Type: iamf
Container:         Sample Description Box ('stsd')
Mandatory:         Yes
Quantity:          One or more.

The IASampleEntry identifies that the track contains IA Samples, and uses one single codec specific box.

Syntax

class IASampleEntry extends AudioSampleEntry('iamf') {
  unsigned int (8) version;
  unsigned int (8) profile_version;
  CodecSpecificBox config;
}

No optional boxes of AudioSampleEntry shall present.

Sematics

Both channelcount and samplerate fields of AudioSampleEntry shall be ignored.

version and profile_version shall be the same as the ones in start_code_obu.

7.2.4. Codec Specific Box

This section describes a codec specific box for the decoding parameters, which is defined by codec_id of audio_substream_config(), to decode one single substream of IA sequence. iamf shall contain only one single codec specific box regardless of the number of substreams in IA sequence. So, the codec specific box is applied to all of substreams in sample data.

7.2.4.1. OPUS Specific Box

This shal be OpusSpecificBox (dOps) for opus audiosampleentry which is specified in [OPUS-IN-ISOBMFF].

Box Type:  dOps
Container: IA Sample Entry ('iamf')
Mandatory: Yes
Quantity:  One

This box shall be for one single substream.

Syntax

It shall be the same as dOps box for opus with that ChannelMappingFamily shall be set to 0.

Sematics

It shall be the same as the semantics in [OPUS-IN-ISOBMFF] except followings:

OutputChannelCount should be set to 2. OutputChannelCount can be ignored because the real value can be determined from the Audio Element OBU and from the opus packet header.
In case of channel_audio_layer > 1, OutputGain shall be set to 0.
ChannelMappingFamily shall be set to 0.

7.2.4.2. MP4A Specific Box

This shall be ESDBox (esds) for mp4a which is specified in [MP4].

Box Type:  esds
Container: IA Sample Entry ('iamf')
Mandatory: Yes
Quantity:  One of more

This box shall be for one single Substream.

Syntax

It shall be the same as esds box for Low Complexity Profile of [AAC] (AAC-LC).

Semantics

It shall be the same as the semantics in esds except followings:

channelConfiguration field should be set to 2. The real value can be implied from the Audio Element OBU.

We need to add specific boxes for FLAC and LPCM.

7.2.5. IA Sample Format

For tracks using the IASampleEntry, an IA Sample has the following constraints:

The one sample data shall be the remained OBUs of each temporal unit after processing of {#isobmff-singletrack-basicencapsulationscheme}.

7.2.6. IA Sample Group

7.2.6.1. Global Descriptor Sample Group

During encapsulation process, global descriptors shall be discarded from IA bistream. A new sample group for global descriptors shall be defined by using sgpd and sbgp boxes with following requirements:

grouping_type shall be set to iagd.
SampleGroupDescriptionEntry shall be Mix Presentation OBUs and followed by Audio Element OBUs with OBU syntax.

7.2.6.2. Demixing Info Sample Group

During encapsulation process, Parameter Block OBU for demixing_info shall be discarded from IA sequence. A new sample group for demixing_info() shall be defined by using sgpd and sbgp boxes with following requirements:

grouping_type shall be set to demi.
Each SampleGroupDescriptionEntry shall be Parameter Block OBU for demixing_info with OBU syntax.

7.3. Common Encryption

TBA

7.4. Codecs Parameter String

DASH and other applications require defined values for the Codecs parameter specified in [RFC6381] for ISO Media tracks. The codecs parameter string for the AOM IA codec shall be:

For IAC-OPUS

iamf.IAC-specific-needs.Opus

For IAC-AAC-LC

iamf.IAC-specific-needs.mp4a.40.2

For IAC-FLAC

iamf.IAC-specific-needs.fLaC

For IAC-LPCM

iamf.IAC-specific-needs.lpcm

IAC-specific-needs shall be V.PV as follows:

V shall be four digits and shall represent the version of IA sequence.
- The first two digits shall represent the major version within the range 0 to 15.
- The second two digits shall represent the minor version within the range 0 to 15.
PV shall be four digits and shall represent the profile version of IA sequence.
- The first P shall be two digits and shall represent the profile major version within the range 0 to 15.
- The second V shall be two digits and shall represent the profile minor version within the range 0 to 15.

For example, for this version of the specification

The codecs parameter string of IAC-OPUS for the simple profile:

iamf.0000.0000.Opus

The codecs parameter string of IAC-AAC-LC for the base profile:

iamf.0000.0100.mp4a.40.2

8. ISOBMFF IAC Decapsulation

8.1. ISOBMFF IAC Decapsulation with single track

This section provides a guideline for IAC parser to reconstruct IA sequences from IAC file.

When IAC parser feeds the reconstructed IA sequences to IAC-OBU parser, descriptor OBUs shall be placed at the first and followed by Temoral Units.

Below figure shows the mirroring process of the encapsulation scheme of IA sequence specified in § 7 ISOBMFF IAC Encapsulation.

IAC Decapsulation Guideline

During decapsulation process, IAC file is decapsulated into IA sequences which conform to § 4 Open Bitstream Unit (OBU) Syntax and Semantics as follows:

Step1: Reconstruction of descriptor OBUs (one Start Code OBU, one Codec Config OBU, one or more Mix Presentation OBUs and one or more Audio Element OBUs) for the ith IA sequence.
- [Step1-1] Start Code OBU: take version and profile_version fields from iamf sample entry and packetize it with ia_code and the pre-fixed header value (i.e. 0xF006) by OBU.
- [Step1-2] Codec Config OBU: generate codec_id and decoder_config() from CodecSpecificBox of iamf sample entry, num_samples_per_frame from stts box and take roll_distance from AudioPreRollEntry, and packetize it by OBU with obu_type = OBU_IA_Codec_Config.
- [Step1-3] Mix Presentation OBUs and Audio Element OBUs: take the ith SampleGroupDescriptionEntry as it is in SampleGroup with grouping_type, iagd.
- [Step1-4] Figure out the offset (i1) and number (im) of Samples, which the ith SampleGroupDescriptionEntry is applied to, from the SampleGroup.
Step2: Reconstructing of the jth Temporal Unit of the ith IA sequence (j = i1, i2, …, im)
- [Step2-1] Prepare Sync_OBU with obu_type = OBU_IA_Sync.
- [Step2-2] If there is the SampleGroup with grouping_type = demi, then take the parameter block OBU for the demixing_info and jth sample. Otherwise, take jth sample as it is.
  - Parameter block OBU for demixing_info: take the SampleGroupDescriptionEntry as it is, from SampleGrouop with grouping_type = demi, mapped to jth Sample.
- [Step2-3] Place Sync_OBU at the front of the result of Step2-2 without gap to reconstruct the jth Temporal Unit.
Step3: Place descriptor OBUs and followed by Temporal Units in order (j = i1, i2, …, im) without gap to reconstruct the ith IA sequence.

codec_id and decoder_config() for IAC-OPUS is generated as follows:

The syntax and values shall conform to ID Header of [RFC7845] with following constraints.
- OutputChannelCount, PreSkip, InputSampleRate, OutputGain and ChannelMappingFamily are copied from dOps box.

codec_id and decoder_config() for IAC-AAC-LC is generated as follows:

codec_id: mp4a
decoder_config() is generated from DecoderConfigDescriptor() of esds box.

9. IAC processing

This section provides a guideline for IA decoding for a given IA sequence.

IA decoding can be done by using the combination of following decoding processing.

Decoding of a scene-based audio element (Ambisonics decoding)
Decoding of a channel-based audio element (Scalable Channel Audio decoding)
Rendering and mixing of each audio element before mixing of multiple audio elements.
- It may include re-sampling of each audio element.
Mixing of multiple audio elements with synchronization
Post processing such as Loudness, DRC and Limiter.

Abmisonics decoding, it shall conform to [RFC8486] except codec specific processing and shall output Ambisonics channels in ACN (Ambisonics Channel Number) order.

Scalable Channel Audio decoding, it shall output the channel audio (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of OBU parser, Codec decoder, Audio Element Renderer and Post-processor as depicted in below figure.

OBU parser shall depacketize IA sequence to output one or more substreams with one or more Decoder_Config() but one decoder_config() per audio element, descriptors and parameters.
Codec decoder for each substream shall ouptput decoded channels.
Audio Element Renderer reconstructs audio channels from decoded channels of Codec decoders according to the type of audio element which is specified audio element OBU.
- For scene-based audio element, it shall output ambisonics channels.
- For channel-based audio element, it shall output audio channels for the given loudspeaker layout.
Post-processor outputs audio channels according to the target loudspeaker layout after processing optional rendering, mixing and post processing such as DRC, Loudness and Limiter.
- For a given scene-based audio element, one of mix presentations shall be used to render the given scene-based audio element.
- To mix a given multiple audio elements, one of mix presentations shall be used to render each of the given multiple audio elements.

IA Decoder Configuration

9.1. Ambisonics decoding

This section describes the decoding of Ambisonics.

Below figure shows the decoding flowchart of Ambisonics decoding.

OBU parser shall output the substreams for the scene-based audio element in IA sequence.
- OBU parser shall output channel_mapping or demixing_matrix according to ambisonics_mode to Channel_Mapping/Demixing_Matrix module
Codec decoder shall output decoded channels (PCM) in the transmission order as many asoutput_channel_count after decoding of each Substream.
Channel_Mapping/Demixing_Matrix module shall apply channel_mapping or demixing_matrix according to Ambisonics_Mode to the channels (PCM) and outputs channels as many as output_channel_count in ACN order.
Ambisonics to Channel Format module may convert the output channels to channel audio according to the target loudspeaker layout.

Ambisonics Decoding Flowchart

9.2. Scalable Channel Audio decoding

This section describes the decoding of Scalable Channe Audio.

Below figure shows the decoding flowchart of the decoding for Scalable Channel Audio.

Scalable Channel Audio Decoding Flowchart

For a given loudspeaker layout (i.e. CL #i) among the list of loudspeaker_layout in scalable channel layout config,

OBU Parser shall get substreams for ChannelGroup #1 ~ ChannelGroup #i and pass them to Codec decoder with Decoder_Config().
Codec decoder shall output decoded channels (PCM) in the transmission order.
- For non-scalable audio (i.e i = 1), its order shall be converted to the loudspeaker location order for CL #1.
Following are further processed for scalable audio (i.e. i > 1)
- When Output_Gain_Is_Present_Flag(j) for ChanneGroup #j (j = 1, 2, …, i-1) is on, Gain module shall apply Output_Gain(j) to all audio samples of the mixed channels in the ChannelGroup #j indicated by Output_Gain_Flag(j).
- De-Mixer shall output de-mixed channels (PCM) for CL #i generated through de-mixing of the mixed channels from Gain module by using non-mixed channels and demixing parameters for each frame.
- Recon_Gain module shall output smoothed channels (PCM) by appling Recon_Gain to each frame of the de-mixed channels.
- The order for Non-mixed cahnnels and Smoothed channels shall be converted to the loudspeaker location order for CL #i after going through necessary modules such as Gain, De-Mixer, Recon_Gain etc..
Following may be further processed
- Loudness normalization module may output loudness normalized channels at -24 LKFS from non-mixed channels and smoothed channels (if present) by using loudness value for CL #i.
- DRC control module may apply the pre-defined DRC compression to the loudness normalized channels, after that it outputs loudness normalized channels at -16 LKFS.
- Limiter module may limit the true peak of input channels at -1dB.

Following sections, § 9.2.1 Gain, § 9.2.2 De-mixer and § 9.2.3 Recon Gain are only needed for decoding of scalable audio with num_layers > 1.

9.2.1. Gain

Gain module is the mirror process of Attenuation module. It recovers the reduced sample values using Output_Gain when its flag for ChannelGroup #j is on. When its flag is off, then this module shall be bypassed for ChannelGroup #j. Output_Gain(j) for ChannelGroup #j shall be applied to all samples of the mixed channels in the ChannelGroup #j. Where, mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation MUST use the following:

Sample *= pow(10, Output_Gain(j) / (20.0*256))

Where, Output_Gain(j) is the raw 16-bit value for jth layer which is specified in channel_audio_layer_config().

9.2.2. De-mixer

For scalable channel audio with num_layers > 1, some channels of down-mixed audio for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i shall comply with the result by the combination of following surround and top de-mixers:

Surround de-mixers
- S1to2 de-mixer: R2 = 2 x Mono – L2
- S2to3 de-mixer: L3 = L2 – 0.707 x C and R3 = R2 – 0.707 x C
- S3to5 de-mixer: Ls = 1/δ(k) x (L3 – L5) and Rs = 1/δ(k) x (R3 – R5)
- S5to7 de-mixer: Lrs = 1/β(k) x (Ls – α(k) x Lss) and Rrs = 1/β(k) x (Rs – α(k) x Rss)
Top de-mixers
- TF2toT2 de-mixer: Ltf2 = Ltf3 – w(k) x (L3 – L5) and Rtf2 = Rtf3 – w(k) x (R3 – R5)
- T2to4 de-mixer: Ltb = 1/γ(k) x (Ltf2 – Ltf4) and Rtb = 1/γ(k) x (Rtf2 – Rtf4)
Where, Ltf2 / Rtf2 is top channel of x.1.2ch, Ltf3 / Rtf3 is top channel of 3.1.2ch, and Ltf4 / Rtf4 is to channel of x.1.4ch (x = 5 or 7) and w(k) is determined from the value of wIdx(k).

Initially, wIdx(0) = 0 and the value of wIdx(k) shall be derived as follows:

wIdx(k) = Clip3(0, 10, wIdx(k-1) + w_idx_offset(k))

Mapping of wIdx(k) to w(k) should be as follows:

wIdx(k) :   w(k)
   0    :    0
   1    :  0.0179
   2    :  0.0391
   3    :  0.0658
   4    :  0.1038
   5    :  0.25
   6    :  0.3962
   7    :  0.4342
   8    :  0.4609
   9    :  0.4821
   10    : 0.5

When D_set = { x | S1 < x ≤ Si and x is an integer},

If 2 is an element of D_set, the combination shall include S1to2 de-mixer.
If 3 is an element of D_set, the combination shall include S2to3 de-mixer.
If 5 is an element of D_set, the combination shall include S3to5 de-mixer.
If 7 is an element of D_set, the combination shall include S5to7 de-mixer.

When Ti = 2,

If Sj = 3 (j=1,2,…, i-1), the combination shall include TF2toT2 de-mixer.

When Ti = 4,

If Sj = 3 (j=1,2,…, i-1), the combination shall include TF2toT2 de-mixer and T2to4 de-mixer.
Elseif Tj = 2 (j=1,2,…, i-1), the combination shall include T2to4 de-mixer.

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of th down-mixed 5.1.2ch,

The combination includes S2to3 de-mixer, S3to5 de-mixer and [=TF2toF2 de-mixer].
Ls5 and Rs5 are recovered by S2to3 de-mixer and S3to5 de-mixer.
Ltf and Rtf are recovered by S2to3 de-mixer and TF2toT2 de-mixer.

Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).

9.2.3. Recon Gain

Recon_Gain shall be only applied to all of audio samples of the de-mixed channels from De-mixer module.

Recon_Gain_Info() indicates each channel of CL #i which Recon_Gain needs to be applied to and provides Recon_Gain value for each frame of the channel.
- Sample (k,i) *= Smoothed_Recon_Gain (k,i), where k is the frame index and i is the sample index of the frame.
- Smoothed_Recon_Gain (k) = MA_gain (k-1) x e_window + MA_gain (k) x s_window
- MA_gain (k) = 2 / (N+1) x Recon_Gain (k) / 255 + (1 – 2/(N+1)) x MA_gain (k-1), where MA_gain (0) = 1.
- e_window[:ps – olen] = 1, e_window[ps – olen: ps] = hanning[olen:], e_window[ps:flen] = 0.
- s_window[:ps – olen] = 0, s_window[ps – olen: ps] = hanning[:olen], s_window[ps:flen] = 1.
- Where, hanning = np.hanning (2*olen), ps is the pre-skip value, flen is the frame size and olen is the overlab size.
- Recommend values: N = 7

Below figure shows the smoothing scheme of Recon_Gain.

Smoothing Scheme of Recon Gain

Recommend values for specific codecs are as follows

IAC-OPUS: olen = 60, the pre-skip (ps) value is indicated in Codec_Specific_Info for IAC-OPUS.
IAC-AAC-LC: olen = 64, ps = 720.

9.3. Mix Presentation

//To Do: Fill in the text

9.3.1. Rendering for Audio Element

This section provide a guideline by the rendering_config() which is specified in mix presentation OBU.

//To Do: Fill in rendering method for scene-based audio element if any

//To Do: Fill in rendering method for channel-based audio element if any

9.3.2. Mixing for Audio Elements

This section provide a guideline by the element_mix_config() which is specified in mix presentation OBU.

When the output channel audio of scene-based audio element or channel-based audio element does not match with the loudspeaker layout which is indicated by mix_target_layout() in mix presentation OBU.

Down-mixing matrics, which are specified in § 9.5 Down-mix Matrix, are recommended for down-mixing of the output channel audio.

When multiple audio elements are mixed into one channel audio:

Each of them is mixed based on the element_mix_config() before mixing them.
If a sampling rate or an audio sample size of an audio element differs from the target sampling rate or the target audio sample size, then the audio element shall be re-sampled to the target sampling rate or the target audio sample size, respectively.
- If mix presentation OBU does not provide the sampling rate or the sample size, then 48000hz for the sampling rate and 16 bits for the audio sample size are recommended.

After relevant processing, multiple audio elements are mixed into one channel audio according to the target loudspeaker layout with the target sampling rate by considering the synchronization in audio sample by audio sample among them.

//To Do: Fill in the text based on element_mix_config()

9.4. Post Processing

9.4.1. Loudness Normalization

Loudness normalization is done by adjusting a loudness level to -24 LKFS based on the loudness value of the target channel layout (i.e. CL #i) which is signaled in Channel_Audio_Layer_Config() or the loudness value in mix presentation OBU.

Real implementations for § 9.4.1 Loudness Normalization, § 9.4.2 DRC Control and § 9.4.3 Limiter are soly dependent on implementers (i.e., out of scope of this specification) unless mix presentation OBU provide algorithms for those. This specification only recommends the principles for the former.

9.4.2. DRC Control

In this specification, DRC control can be guided by a pre-defined DRC or by the algorithm specified in mix presentation OBU.

For the pre-defined DRC, it is assumed an input loudness of -24 LKFS and targets an output loudness of -16 LKFS and DRC control module applies the pre-defined DRC compression by assuming a target loudness is adjusted to -16 LKFS as follows:

DRC Segment 0
- Threshold: not applicable
- Ratio: 1:1
- Type: Neutral
DRC Segment 1
- Threshold: -16.5 dBFS
- Ratio: 1.5:1
- Type: Compressor
DRC Segment 2
- Threshold: -9 dBFS
- Ratio: 2:1
- Type: Compressor
DRC Segment 3
- Threshold: -6 dBFS
- Ratio: 3:1
- Type: Compressor

Below figure shows the schematic diagram of the pre-defined DRC compression.

Pre-defined DRC Compression Scheme

The below is the equation that represents the pre-defined DRC compression scheme.

Y = D_T(i) + (X - T(i)) / R(i). Where,
X ∈ Seg(i) and D_T (i) = T(0) + ∑ ((T(k+1) - T(k)) / R(k)) (k = 0 to i-1).
Seg(i): ith Segment
 T(i) : Threshold vlaue in dBFS for Seg(i), where T(0) = -96.33
 R(i) : Ratio value for Seg(i)
D_T(i): Threshold value in dBFS for Seg(i) after DRC compression, where D_T(0) = T(0)
  X   : Input sample value in dBFS
  Y   : Output sample value in dBFS

9.4.3. Limiter

This module limits the true peak of input signal at -1dB. The definition of thr true peak is base on [ITU1770-4].

Below is a recommended loudness normalization and DRC control principle according to application.

For AV application, it only applies Limiter at -1dBTP.
For TV application, it only applies Loudness normalization at -24LKFS and Limiter at -1dBTP.
For Mobile application, it applies Loudness normalization at -24LKFS, the pre-defined DRC control and adjusting of target loudness at -16 LKFS, and Limiter at -1dBTP.

NOTE: The definitions of AV, TV and Mobile applications are as follows: .AV application: Sound devices with external speakers such as Soundbar, AV receiver, HiFi speaker etc.. .TV application: Television with built-in speakers such as LCD/OLED slim TV. .Mobile application: Handheld devices with built-in speakers such as smartphone, tablet etc..

9.5. Down-mix Matrix

9.5.1. Static Down-mix Matrix

This section recommends static down-mix matrices.

IAC players need to support any valid channel layout, even if the number of channels does not match the physically connected audio hardware. Players need to perform channel mixing to increase or reduce the number of channels as needed.

Implementations can use the matrices below to implement down-mixing from the output channel audio, which are known to give acceptable results for stereo, 5.1ch, 7.1ch and 3.1.2ch.

Down-mixing can be done directly by using one of the matrices below or a combination of them. For example, stereo down-mixing for 7.1.4ch can be done by the combination of the 7.1ch down-mix matrix for 7.1.4ch, 5.1ch down-mix matrix for 7.1ch and stereo down-mix matrix for 5.1ch.

The figures below shows recommended static down-mix matrices to stereo, 5.1ch and 7.1ch.

7.1ch Down-mix matrix for 7.1.4ch

7.1ch Down-mix matrix for 7.1.2ch

5.1ch Down-mix matrix for 5.1.4ch

5.1ch Down-mix matrix for 5.1.2ch

5.1ch Down-mix matrix for 7.1ch

Stereo Down-mix matrix for 5.1ch

Stereo Down-mix matrix for 3.1.2ch

The figures below show static down-mix matrices to 3.1.2ch.

3.1.2ch Down-mix matrix for 5.1.2ch

3.1.2ch Down-mix matrix for 5.1.4ch

3.1.2ch Down-mix matrix for 7.1.2ch

3.1.2ch Down-mix matrix for 7.1.4ch

Where, p1 = 0.707 and p2 = 0.3535. Implementations may use limiter defined in § 9.4.3 Limiter to preserve energy of audio signals instead of normalization factors.

9.5.2. Dynamic Down-mix Matrix {#processing-downmixmatrix-dynamic}

This section recommends dynamic down-mixing matrics.

The dynamix down-mixing matrics shall comply with the down-mixing mechanisam which is specified in § 10.2.2 Down-mix Mechanism.

10. IAC Generation Process

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:

Ambiosnics format: It shall conform to ChannelMappingFamily = 2 or 3 of [RFC8486].
Channel Audio format: It shall conform to loudspeaker_layout specified in channel_audio_layer_config().
Input Smapling Rate: 48000hz
Bitdepth: 16 bits or 24 bits
- 16 bits are recommended for IAC-OPUS.
Input file format: .wav file (Linear PCM, simply called as PCM)

For a given input audio and user inputs, IA encoder shall output IA sequence which conforms to § 4 Open Bitstream Unit (OBU) Syntax and Semantics.

Input audio shall be one of followings:

Ambisonics format
Channel Audio format

User inputs are:

Ambisonics mode to indicate if ChannelMappingFamily = 2 or 3 of [RFC8486].
List of channel layouts to be supported for scalable channel audio: it shall conform to loudspeaker_layout.

IA encoding can be done by using the combination of following generation processing.

Encoding of an audio element (Ambisonics encoding or Scalable Channel Audio encoding)
Encoding of mix presentation

The below figure shows IA encoder configuration for one single audio element.

The IA encoder is composed of Pre-processor, Codec encoder and OBU packetizer.

Pre-processor outputs one or more ChannelGroups, descriptors and optional parameter blocks based on the input audios and user inputs.
- It outputs one single ChannelGroup for scene-based audio element.
- It outputs one or more ChannelGroups for channel-based audio element.
- It outputs descriptors which are composed of one Start Code, one Codec Config, one Audio Element config, one or more Mix Presentation config.
- It may output paramete blocks
  - For channel-based audio element with num_layers = 1, it may output parameter blocks for demixing info.
  - For channel-based audio element with num_layers > 1, it outputs parameter blocks for demixing info and recon_gain_info.
  - It may further output parameter blocks for post processing such as Loudness and DRC control.
Codec encoder generates one or more substreams from each ChannelGroup based on Codec Config.
- Mono or stereo coding shall be only allowed.
  - Channel Audio format: each pair of coupled channels in the same ChannelGroup shall be coded as stereo mode to generate one single substream and each of non-coupled channels in the same ChannelGroup shall be coded as mono mode to generate one single substream.
    - Coupled channels: L/R, Ls/Rs, Lss/Rss, Lrs/Rrs, Ltf/Rtf, Ltb/Rtb
    - Non-coupled channels: C, LFE, L
OBU packetizer packetize descriptors, parameter blocks and audio frames by OBU, and outputs IA sequence.
- Temporal unit generator generates temporal unit for each frame from audio frame OBUs and parameter block OBUs (if present).

IA Encoder Configuration

The order of substreams in each ChannelGroup shall be as follows:

In ChannelGroup for Ambisonics: The order shall conform to [RFC8486].
In ChannelGroup for Scalable Channel Audio: The order shall conform to following rules:
- Coupled Substreams comes first and followed by non-coupled Substreams.
- Coupled Substreams for surround channels comes first and followed by one(s) for top channels.
- Coupled Substreams for front channels comes first and followed by one(s) for side, rear and back channels.
- Coupled Substreams for side channels comes first and followed by one(s) for rear channels.
- Center channel comes first and followed by LFE and followed by the other one.

Where, non-coupled substream is a coded substream from one of non-coupled channels.

10.1. Ambisonics Encoding

For Ambisonics encoding:

Pre-processor outputs one ChannelGroup and descriptors and it is only composed of Meta Generator.
- Meta generator generates descriptors based on Ambisonics mode and the number of channels for Ambisonics.
  - ambisonics_mode shall be set to 0 for ChannelMappingFamily = 2 of [RFC8486] or 1 for ChannelMappingFamily = 3 of [RFC8486].
  - ambisonics_config is set to as follows:
    - output_channel_count, substream_count and coupled_substream_count shall be set to the number of channels for Ambisonics.
    - channel_mapping for ambisonics_mode = 0 is assigned to according to the order of substreams in ChannelGroup.
    - demixing_matrix for ambisonics_mode = 1 is assigned to according to the order of substreams in ChannelGroup.
Codec Enc. outputs substreams as many as the number of channels which is indicated in substream_count.
Temporal unit shall be composed of audio frame OBUs for substreams.
- It may have the immediately preceding temporal delimiter OBU.
- The order of substreams in ChanngelGroup shall be aligned with channel_mapping for Ambisonics_Mode = 0 or demixing_matrix for Ambisonics_Mode = 1.

10.2. Scalable Channel Audio Encoding

For Scalable Channel Audio encoding:

Pre-processor outputs one or more ChannelGroups, desriptors and parameter blocks. It is composed of Down-mix parameter generator, Down-mixer, Loudness, ChannelGroup generator, Attenation and Meta generator.
- For non-scalable channel audio (i.e. num_layers = 1):
  - Parameter blocks for recon_gain_info is not be generated.
  - Parameter blocks for demixing_info may be generated by implementers who assume it to be recommended for dynamic downmixing in a decoder side.
  - Down-mixer, ChannelGroup generator and Attenuation modules do not needed.
- Down-mix parameter generator shall generate 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k)) by analyzing input channel audio.
- Down-mixer shall generate down-mixed audios according to the list of channel layouts and the down-mix parameters.
- Loudness module should output the loudness level (LKFS) of each down-mixed audio based on [ITU1770-4].
- ChannelGroup generator shall transform the input channel audio to ChannelGroups for scalable channel audio with num_layers > 1 scalability by using the down-mix parameters and the list of channel layouts.
- Attenuation module shall apply a gain to the transformed ChannelGroups to prevent clipping.
- Meta generator generates descriptors, and parameter blocks for each frame.
  - descriptors shall be set to as follows:
    - num_layers shall be set to the number of channel layouts.
    - channel_audio_layer_config() shall be set to as follows:
      - loudspeaker_layout shall be set to the ith list of channel layouts for the ith ChannelGroup.
      - output_gain_is_present_flag shall set to 1 for the ith ChannelGroup if attenuation is applied to the mixed channels of the ith ChannelGroup. Otherwise it shall be set to 0 for the ith ChannelGroup.
      - recon_gain_is_present_flag shall be set to 1 for the ith ChannelGroup if the preceding ChannelGroups has one or more mixed channels from the down-mixed audio for the ith channel layout. Otherwise, it shall be set to 0 for the ith ChannelGroup. Especially, when num_layers = 1, this flag shall be set to 0.
      - substream_count shall be set to the nubmer of substreams composing of the ith ChannelGroup.
      - coupled_substream_count shall be set to the nubmer of coupled substreams among the substreams composing of the ith ChannelGroup.
      - loudness shall be set to the loudness (LKFS) of the down-mixed audio for the ith channel layout for the ith ChannelGroup.
      - Each bit of output_gain_flags shall be set to 1 for the ith ChannelGroup if attenuation is applied to the relevant channel of the ith ChannelGroup. Otherwies it shall be set to 0 for the ith ChannelGroup.
      - output_gain shall be set to the inverse number of the gain which is applied to the channels which are indicated by output_gain_flags.
  - Parameter blocks can be composed of demixing_info() and recon_gain_info(). When recon_gain_is_present_flag = 0 for all ChannelGroup, recon_gain_info shall not present in IA sequence.
    - dmixp_mode of demixing_info for the kth frame shall be set to indicate (α(k), β(k), γ(k), δ(k)) and w_idx_offset(k). Where w_idx_offset(k) = 1 or -1.
    - recon_gain_flags of recon_gain_info shall be set to indicate the de-mixed channels, which need to apply recon_gain among the output channels after demixing for ith channel layout.
    - recon_gain shall be set to the gain value to be applied to the channel which is indicated by recon_gain_flags for the ith ChannelGroup.
Temporal unit for kth frame shall be composed of audio frame OBUs for the kth frames of the substreams and followed by OBUs for zero or more prameter block OBUs.
- It may have the immediately preceding temporal delimiter OBU,
- ChannelGroups in temporal unit shall be placed in order. In other words, ChannelGroup for the first channel layout shall come first, followed by ChannelGroup for the second channel layout, followed by ChannelGroup for the third channel layout and so on.

Below figure shows IA encoding flowchart for Scalable Channel Audio.

For a given Channel Audio and a given list of channel layouts for scalability, PCMs for Channel Audio are passed to CG Generation moddule.
CG Generation module generates the transformed audio according to CG generation rule based on the list of CLs and the down-mix parameters.
- The transformed audio is structured as ChannelGroups.
Non-mixed channels of the transformed audio (i.e., the original channels of the input channel audio) are directly input to Codec encoder, but the mixed channels may be input first to Attenuation module and then to Codec encoder.
The Attenuation module reduces all sample values of the mixed channels in the same CG at a uniform rate (Output_Gain).
- A range of 0dB to -6dB is recommended for the attenuation. (i.e. a range of 0dB to 6dB for Output_Gain)
Codec Enc. generates the coded substreams from PCMs and passes substreams and one single decoder_config to OBU Packetizer.
OBU packetizer generates descriptor OBUs which consists of one Start Code OBU, one Codec Config OBU, one Audio Element OBU and zero or more Mix Presentation OBU.
- Codec Config OBU is generated based on decoder_config().
OBU packetizer generates zero or more parameter block OBUs for each frame which contains demixing_info and recon_gain_info.
OBU packetizer generates audio frame OBUs for each frame of the substreams.
OBU packetizer generates temporal unit for each frame.
- Temporal unit consists of audio frame OBUs and followed by zero or more parameter block OBUs and audio frame OBUs.
  - It may have the immediately preceding temporal delimiter OBU,
OBU Packetizer outputs IA sequence which is composed of OBUs for descriptor OBUs and followed by OBUs for temporal units.

IA Encoding Flowchart for Scalable Channel Audio

Following sections, § 10.2.1 Down-mix parameter and Loudness, § 10.2.2 Down-mix Mechanism, § 10.2.3 Channel Layout Generation Rule, § 10.2.4 Recon Gain Generation and § 10.2.5 ChannelGroup Generation Rule do not needed for non-scalable channel audio (i.e., when num_layers specified in scalable_channel_layout_config() is set to 1).

10.2.1. Down-mix parameter and Loudness

This section describes how to generate down-mix parameters and loudness level for a given channel audio and a given list of channel layouts for scalability.

Below figure shows a block diagram for down-mix parameter and loudness module including down-mixer.

IA Down-mix Parameter and Loudness

For a given Channel Audio (e.g. 7.1.4ch) and a given list of channel layouts based on the Channel Audio,

Down-mix parameter generator shall generate 5 down-mix parameters (α(k), β(k), γ(k), δ(k) and w(k)) by analyzing input Channel Audio, by refering [AI-CAD-Mixing]. Where, k is a frame index.
- It is composed of Audio Scene Classification module and Height Energy Quantification module as depicted in Figure 11-2.
- Audio Scene Classification module generates 4 parameters (α(k), β(k), γ(k), δ(k)) by classifying audio scenes of input channel audio in three modes.
  - Default scene: Neither Dialog nor Effect
  - Dialog scene: Center-channel oriented and clear dialog/voice sounds
  - Effect scene: Directional and spatially moving sounds.
- Height Energy Quantification module generates a surround to height mixing parameter (w(k)) which is decided according to the relative energy difference between the top and surround channels of input channel audio.
  - If the energy of top channels is bigger than that of surround ones, then w_idx_offset(k) is set to 1. Otherwise, it is set to -1. And, w(k) is calculated based on w_idx_offset(k) and conforms to § 9.2 Scalable Channel Audio decoding.
Down-mixer generates down-mixed audios from input Channel Audio according to the list of channel layouts and the down-mix parameters, and outputs down-mixed audio for each channel layout to Loudness module.
- It is not depicted in the figure but Down-mixer further generates Dmixp_Mode and Recon_Gains for each frame to be passed to OBU packetizer.
Loudness module measures the loudness level (LKFS) of each down-mixed audio based on [ITU1770-4], and passes them to OBU packetizer.

10.2.2. Down-mix Mechanism

This section specifies the down-mixing mechanism to generate down-mixed audio for scalable channel audio.

For a given Channel Audio which conforms to [[=loudspeaker_layout]], the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementors may use another method to get the down-mixed audio from the given channel audio, but the down-mixed audio shall comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanisam is a combination of following surround down-mixer(s) and top down-mixer(s) as depicted in below figure.

Surround Down-mixers: S7to5 enc., S5to3 enc., S3to2 enc., S2to1 enc.

S7to5 enc.: Ls5 = α(k) x Lss7 + β(k) x Lrs7 and Rs5 = α(k) x Rss7 + β(k) x Rrs7.
S5to3 enc.: L3 = L5 + δ(k) x Ls5 and R3 = R5 + δ(k) x Rs5
S3to2 enc.: L2 = L3 + 0.707 x C and R2 = R3 + 0.707 x C
S2to1 enc.: Mono = 0.5 x (L2 + R2)

Top Down-mixers: T4to2 enc., T2toTF2 enc.

T4to2 enc.: Ltf2 = Ltf4 + γ(k) x Ltb4  and Rtf2 = Rtf4 + γ(k) x Rtb4.
T2toTF2 enc.: Ltf3 = Ltf2 + w(k) x δ(k) x Ls5 and Rtf3 = Rtf2 + w(k) x δ(k) x Rs5.

IA Down-mix Mechanism

For example, to get down-mixed 3.1.2ch from 7.1.4ch:
- S3 of 3.1.2ch is generated by using S7to5 and S5to3 encs.
- TF2 of 3.1.2ch is generated by using T4to2 and T2toTF2 encs.

10.2.3. Channel Layout Generation Rule

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of input Channel Audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for a scalable channel audio shall comform with following rules:

Si ≤ Si+1 and Wi ≤ Wi+1 and Ti ≤ Ti+1 except Si = Si+1 and Wi = Wi+1 and Ti = Ti+1 for i = n-1, n-2, …, 1. Where ith Channel Layout CL #i = Si.Wi.Ti.
CL #i is one of loudspeaker_layouts supported in this specification.

Down-mix paths, which conform to the above rule, shall be only allowed for scalable channel audio with num_layers > 1 as depicted in below figure.

IA Down-mix Path

10.2.4. Recon Gain Generation

This section describes how to generate Recon_Gain.

Recon_Gain needs to be applied to de-mixed channels. For this, IA encoder needs to deliver it to IA decoders.

Let’s define followings:

Level Ok is the signal power for the frame #k of a channel of the down-mixed audio for CL #i.
Level Mk is the signal power for the frame #k of the relevant mixed channel of the down-mixed audio for CL #i-1.
Level Dk is the signal power for the frame #k of the de-mixed channel for CL #i (after demixing).

If 10*log10(level Ok / maxL^2) is less than the first threshold value (e.g. -80dB), Recon_Gain (k, i) = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (e.g. -6dB), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value to be delivered is floor(255*Recon_Gain).

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.
- D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChanngelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively.
- D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChanngelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively.

Recon_Gain for D_Lrs7:
- Level Ok is the signal power for the frame #k of Lrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Lrs7.
Recon_Gain for D_Rrs7:
- Level Ok is the signal power for the frame #k of Rrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rrs7.
Recon_Gain for D_Ltb4:
- Level Ok is the signal power for the frame #k of Ltf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Ltb4.
Recon_Gain for D_Rtb4:
- Level Ok is the signal power for the frame #k of Rtf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rtb4.

10.2.5. ChannelGroup Generation Rule

This section describes the generation rule for ChannelGroup.

For a given Channel Audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), CG Generation module outputs the transformed audio (i.e. ChannelGroups) which shall conform to following rules:

It consists of C number of channels and is structured to n number of CGs, where C is the number of channels for the Channel Audio.
CG #1 (as called BCG): This CG is the down-mixed audio itself for CL #1 generated from the Channel Audio. It contains C1 number of channels.
CG #i (as called DCG, i = 2, 3, …, n): This CG contains (Ci – Ci-1) number of channels. (Ci – Ci-1) channel(s) consists of as follows:
- (Si – Si-1) surround channel(s) if Si > Si-1 . When S_set = { x | Si-1 < x ≤ Si and x is an integer},
  - If 2 is an element of S_set, the L2 channel is contained in this CG #i.
  - If 3 is an element of S_set, the Center channel is contained in this CG #i.
  - If 5 is an element of S_set, the L5 and R5 channels are contained in this CG #i.
  - If 7 is an element of S_set, the Lss7 and Rss7 channels are contained in this CG #i.
- The LFE channel if Wi > Wi-1 .
- (Ti – Ti-1) top channels if Ti > Ti-1 .
  - If Ti-1 = 0, the top channels of the down-mixed audio for CL #i are contained in this CG #i.
  - If Ti-1 = 2, the Ltf and Rtf channels of the down-mixed audio for CL #i are contained in this CG #i.

Below figure shows one example of transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

Example of Transformation Matrix with 4 CGs

10.2.6. Mix Presentation Encoding

//To Do: Fill in the text

10.2.6.1. Rendering Config

This section provide a guideline to generate rendering_config().

//To Do: Fill in how to generate rendering_config() for scene-based audio element

//To Do: Fill in how to generate rendering_config() for channel-based audio element

10.2.6.2. Element Mix Config

This section provide a guideline to generate element_mix_config().

//To Do: Fill in how to generate element_mix_config() for scene-based audio element

//To Do: Fill in how to generate element_mix_config() for channel-based audio element

10.2.7. Multiple Audio Elements Encoding

This section provide a guideline to generate IA sequence having multiple audio elements from two IA simple or base profiles.

10.2.7.1. Multiple Audio Elements with One Codec Config

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple profiles with the same codec config OBU. However, the result shall comply with the base profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:

Start Code OBU: get the larger version field and the larger profile version field, respectively.
Codec Config OBU
take just one codec_id and codec_config()
update num_audio_elements and audio_element_id
Mix Presentation OBUs: just take all of them and generate ones which are used after mixing of multiple audio elements if needed except following:
audio_element_ids are updated to be aligned according to Codec Config OBU.
Audio Element OBUs: just take all of them except followings:
audio_element_ids are updated to be aligned according to Mix Presentation OBUs.
audio_substream_ids are updated to be unique in all of Audio Element OBUs.
{{"audio_element_obu()"/parameter_id]s are updated to be unique in all of Audio Element OBUs.

Step2: ith temporal unit is generated as follows:

Just take all of temporal units for ith frames from each audio element and keep the order of temporal units as the order of audio element OBUs in descriptor OBUs except following:
obu_types are updated to be aligned according to audio_element_ids specified in Audio Element OBUs.
{{"parameter_block_obu()"/parameter_id]s in Parameter Block OBUs are updated to be aligned according to {{"audio_element_obu()"/parameter_id]s in Audio Element OBUs.
Add Sync OBU in front of ith temporal unit if needed.
New Sync OBU is generated based on Sync OBUs of each IA sequence and updated audio_substream_ids and {{"parameter_block_obu()"/parameter_id]s.
It may have the immediately preceding temporal delimiter OBU for each temporal unit.

Step3: Generate IA sequence which start descriptor OBUs and followed by temporal units in order.

10.2.7.2. Multiple Audio Elements with Multiple Codec Config

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple or base profiles with the different codec config OBUs. However, the result shall comply with the enhanced profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:

Start Code OBU: get the larger version field and the larger profile version field, respectively.
Codec Config OBU: if some of multiple Codec Config OBUs are same, then merge multiple Codec Config OBUs into one Codec Config OBU, and take each of the others.
Update audio_element_ids to be unique in all of Codec Config OBUs.
Mix Presentation OBUs: just take all of them and generate ones which are used after mixing of multiple audio elements if needed except following:
audio_element_ids are updated to be aligned according to Codec Config OBU.
Audio Element OBUs: just take all of them except followings:
audio_element_ids are updated to be aligned according to Mix Presentation OBUs.
audio_substream_ids are updated to be unique in all of Audio Element OBUs.
{{"audio_element_obu()"/parameter_id]s are updated to be unique in all of Audio Element OBUs.

Step2: Data OBUs are generated as follows:

Place Temporal Units from multiple audio elements in timing order.
Add Sync OBU in front of Temporal Unit, frequently.
New Sync OBU is generated based on Sync OBUs from each IA sequence and updated audio_substream_ids and {{"parameter_block_obu()"/parameter_id]s.
It may have the immediately preceding temporal delimiter OBU for each audio element,

Step3: Generate IA sequence which start descriptor OBUs and followed by Temporal Units in order.

10.2.8. Post Processing

This section provides a guideline to generate algorithms for post processing.

10.2.8.1. Loudness Config

This section provide a guideline to generate loudness_config().

//To Do: Fill in how to generate loudness_config()

10.2.8.2. DRC Config

This section provide a guideline to generate drc_config().

//To Do: Fill in how to generate drc_config()

11. Consumption of IAC bitstream

TODO. Fill in example workflows.

12. Annex A: Audio Substream Gaps

This annex describes a number of scenarios where a gap may exist in the audio signals, where a gap is defined as no audio frames for some period of time.

12.1. A gap within an audio substream

A gap within an audio substream may be expressed via the Sync OBU offsets. A decoder encountering such a gap may either:

insert silent audio frames in the gap without reinitializing, or
reinitialize before decoding the audio frames after the gap.

The appropriate behaviour in this case is signalled via the reinitialize_decoder field in the Sync OBU.

In this version of the specification, gaps within an audio substream are not supported.

12.2. A gap between two audio substreams

A gap may occur if there is a period of time between the end of one substream and the start of another. Such a gap may be expressed via the Sync OBU offsets. Similar to the case of a gap within an audio substream, the behaviour of the decoder is determined by the reinitialize_decoder field in the Sync OBU.

A gap may further occur if there is a period of time between the end of all substreams and the start of another. This case may be expressed by setting a non-zero value for the global_offset field in the Sync OBU.

12.3. A gap due to packet loss

In the case where a gap is not signalled by the Sync OBUs, any unexpected absence of audio frames shall be interpreted as packet loss. The IAC parser is unable to guarantee the correctness of following OBUs received until the next set of Descriptor OBUs.

In this version of the specification, gaps in the audio substreams are not supported so if a gap is encountered, it can always be interpreted as packet loss.

Immersive Audio Model and Formats

AOM Working Group Draft, 28 November 2022

Abstract

1. Convention

1.1. Syntax Description

1.1.1. Data types

1.2. Function

1.2.1. Function templates

1.2.2. Mathemetical functions

1.2.3. Function UTF-8 Encoding

2. Introduction

3. Overview

3.1. IA sequence Components

3.2. Use of OBU Syntax

3.2.1. Descriptors

3.2.2. Data

4. Open Bitstream Unit (OBU) Syntax and Semantics

4.1. Top Level OBU Syntax and Semantics

4.1.1. Audio OBU Syntax and Semantics

4.1.2. OBU Header Syntax and Semantics

4.1.3. Byte Alignment Syntax and Semantics

4.1.4. Reserved OBU Syntax and Semantics

4.1.5. Start Code OBU Syntax and Semantics

4.1.6. Codec Config OBU Syntax and Semantics

4.1.7. Audio Element OBU Syntax and Semantics

4.1.8. Mix Presentation OBU Syntax and Semantics

4.1.9. Parameter Block OBU Syntax and Semantics

4.1.9.1. Parameter Definition Syntax and Semantics

4.1.10. Audio Frame OBU Syntax and Semantics

4.1.11. Temporal Delimiter OBU Syntax and Semantics

4.1.12. Sync OBU Syntax and Semantics

4.2. Detailed OBU Syntax and Semantics

4.2.1. Scalable Channel Layout Config Syntax and Semantics

4.2.2. Ambisonics Config Syntax and Semantics

4.2.3. Demixing Info Syntax and Semantics

4.2.4. Recon Gain Info Syntax and Semantics

4.2.5. Mix Target Layout Syntax and Semantics

4.2.6. Rendering Config Syntax and Semantics

4.2.7. Element Mix Config Syntax and Semantics

4.2.7.1. Mix Gain Parameter Definition and Data Syntax and Semantics

4.2.8. Mix Loudness Info Syntax and Semantics

4.2.9. Mix Bus Config Syntax and Semantics

4.3. Codec Specific

4.3.1. IAC-OPUS Specific

4.3.2. IAC-AAC-LC Specific

4.3.3. IAC-FLAC Specific

4.3.4. IAC-LPCM Specific

5. Profiles

5.1. IA Simple Profile

5.2. IA Base Profile

5.3. IA Enhanced Profile

6. Standalone IAC Representation

6.1. OBU Sequence Order

6.1.1. Descriptor OBUs

6.1.2. Data OBUs

6.1.3. Refreshing Descriptor OBUs

6.2. Synchronizing Data OBUs

7. ISOBMFF IAC Encapsulation

7.1. General Requirements & Brands

7.2. ISOBMFF IAC Encapsulation with single track

7.2.1. Requirement of IA sequence

7.2.2. Encapsulation Scheme

7.2.3. IA Sample Entry

7.2.4. Codec Specific Box

7.2.4.1. OPUS Specific Box

7.2.4.2. MP4A Specific Box

7.2.5. IA Sample Format

7.2.6. IA Sample Group

7.2.6.1. Global Descriptor Sample Group

7.2.6.2. Demixing Info Sample Group

7.3. Common Encryption

7.4. Codecs Parameter String

8. ISOBMFF IAC Decapsulation

8.1. ISOBMFF IAC Decapsulation with single track

9. IAC processing

9.1. Ambisonics decoding

9.2. Scalable Channel Audio decoding

9.2.1. Gain

9.2.2. De-mixer

9.2.3. Recon Gain