<

Immersive Audio Model and Formats

AOM Working Group Draft,

This version:
https://aomediacodec.github.io/iamf/
Issue Tracking:
GitHub
Editors:
(Samsung)
(Google)
Not Ready For Implementation

This spec is not yet ready for implementation. It exists in this repository to record the ideas and promote discussion.

Before attempting to implement this spec, please contact the editors.

Copyright 2022, AOM

Licensing information is available at http://aomedia.org/license/

The MATERIALS ARE PROVIDED “AS IS.” The Alliance for Open Media, its members, and its contributors expressly disclaim any warranties (express, implied, or otherwise), including implied warranties of merchantability, non-infringement, fitness for a particular purpose, or title, related to the materials. The entire risk as to implementing or otherwise using the materials is assumed by the implementer and user. IN NO EVENT WILL THE ALLIANCE FOR OPEN MEDIA, ITS MEMBERS, OR CONTRIBUTORS BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS DELIVERABLE OR ITS GOVERNING AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER MEMBER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Abstract

This document specifies an immersive audio (IA) architecture and model, a standalone IA sequence format and an [ISOBMFF]-based IA container format.

1. Convention

1.1. Syntax Description

All of syntax elements shall conform to Syntatic Description Language specified in [MP4-Systems] unless it is explicitly described in the specification.

1.1.1. Data types

leb128() syntaxName

leb128() indicates the type of an unsigned integer. It indicates the following unsigned integer syntaxName shall be encoded by leb128() specified in [AV1-Convention].

syntaxName is an unsigned integer which is encoded by leb128() specified in [AV1-Convention].

sleb128() syntaxName

sleb128() indicates the type of an signed integer. It indicates the following signed integer syntaxName shall be encoded by leb128() specified in [AV1-Convention].

syntaxName is an signed integer which is encoded by leb128() specified in [AV1-Convention].

string syntaxName

string indicates the type of a string with ring which is terminated by null of one byte (i.e. 0x00).

syntaxName is a human readable label whose byte representation shall consists of two-letter primary language subtags and two-letter region subtags which are connected by hyphen("-"), and followed by bytes representation of UTF-8_Enc(label).

Where, two-letter primary language subtags and two-letter region subtags shall conform to [BCP47].

1.2. Function

1.2.1. Function templates

When the template keyword is used to decorate the class declaration, it indicates that the code is a template with a placeholder type that can be reused by other classes. Only classes that use the template shall be present in the bitstream; the template itself shall not be present in the bitstream. Classes that use a function template shall pass a data type that is specified in either [MP4-Systems] or § 1.1.1 Data types.

Example

template <class T>
class Foo {
  T t;
}

class Bar {
  Foo<int> f;
}

1.2.2. Mathemetical functions

Clip3(x, y, z)

It shall conform to Clip3 specified in [AV1-Convention].

1.2.3. Function UTF-8 Encoding

UTF-8_Enc(label)

UTF-8_Enc(label) is byted represenation of the encoded label, which is UTF-8 string as defined in [RFC3629], null terminated.

2. Introduction

The IA sequence is a bitstream to represent immersive audio for presentation on a wide range of devices in both dynamic streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems and big screen.

The bitstream comprises a number of coded audio substreams and the metadata that describes how to decode, render and mix the substreams to generate an audio signal for playback. The bitstream format itself is codec-agnostic; any supported audio codec may be used to code the audio substreams.

The immersive audio container (IAC) is the storage format for immersive audio (IA) sequence in one single [ISOBMFF] track.

The figure below shows the conceptual IAC architecture.

Conceptual IAC Architecture

For a given input 3D audio,

The rest of this specification is formulated as follows:

3. Overview

3.1. IA sequence Components

The IA sequence includes one or more audio elements, each of which consists of one or more audio substreams. The IA sequence further include mix presentations and parameters.

The figure below shows the relationship between the audio substreams, audio elements and mix presentations and the processing flow to obtain the immersive audio playback.

Processing flow to decode, reconstruct, render and mix the audio signals for immersive audio playback.

3.2. Use of OBU Syntax

3.2.1. Descriptors

The descriptor OBUS contains all the information that is required to setup and configure the decoders, reconstruction algorithms, renderers and mixers.

3.2.2. Data

The data OBUs contain the actual time-varying data that is required in the generation of the final audio output.

The IA sequence supports the description of multiple audio substreams and algorithms, which may have different metadata update rates to each other. The update rate for the audio substreams and audio elements is governed by the frame rates of the audio codec used. Since a single bitstream may support multiple codecs, this may lead to multiple different frame rates. The algorithms for rendering and mixing may have parameters that update at different rates to each other and to the audio frame rates. Therefore, the IA sequence contains information to facilitate the synchronization of the different audio frames and parameters.

The below figure shows the linking scheme among obu_ids in obu_header and ids in obu payload.

ID Linking Scheme

In the above figure,

4. Open Bitstream Unit (OBU) Syntax and Semantics

4.1. Top Level OBU Syntax and Semantics

The IA sequence uses the OBU syntax.

This section specifies the top-level OBU syntax elements and their semantics.

4.1.1. Audio OBU Syntax and Semantics

Syntax

class audio_open_bitstream_unit() {
  obu_header();

  if (obu_type == OBU_IA_Start_Code)
    start_code_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Sync)
    sync_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu_with_no_id();
  else if (obu_type >= 9 and <= 30)
    audio_frame_obu(obu_type - 9);
  else if (obu_type == 6 or 7)
    reserved_obu();

  byte_alignment():
}

Semantics

If the syntax element obu_type is equal to OBU_IA_Start_Code, an ordered series of OBUs is presented to the decoding process as a string of bytes.

OBU data shall start on the first (most significant) bit and shall end on the last bit of the given bytes. The payload of an OBU shall lie between the first bit of the given bytes and the last bit before the first zero bit of the byte_alignment().

4.1.2. OBU Header Syntax and Semantics

Syntax

class obu_header() {
  unsigned int (5) obu_type;
  unsigned int (1) obu_redundant_copy;
  unsigned int (1) obu_trimming_status_flag;
  unsigned int (1) obu_extension_flag;
  leb128() obu_size;

  if (obu_trimming_status_flag) {
    leb128() num_samples_to_trim_at_end;
    leb128() num_samples_to_trim_at_start;
  }
  if (obu_extension_flag == 1)
    leb128() extension_header_size;
}

Semantics

OBUs are structured with a header and a payload.

obu_type specifies the type of data structure contained in the OBU payload.

obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Temporal_Delimiter
   5    : OBU_IA_Sync
  6~7   : Reserved
   8    : OBU_IA_Audio_Frame
  9~30  : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21
   31   : OBU_IA_Start_Code

obu_redundant_copy indicates whether this OBU is a redundant copy of the previous OBU in the IA sequence with the same obu_type. A value of 1 shall indicate that it is a redundant copy, while a value of 0 shall indicate that it is not.

It shall always be set to 0 for the following obu_type values:

obu_trimming_status_flag indicates whether this OBU has audio samples to be trimmed or not. If it is set to 1, the num_samples_to_trim_at_start and num_samples_to_trim_at_end fields shall be present.

obu_extension_flag indicates whether the extension_header_size field shall be present. If it set to 0, the extension_header_size field shall not be present. Otherwise, the extension_header_size field shall be present.

This flag shall be set to 0 for the current version of the specification (i.e. version = 0). An IAC-OBU parser which is conformant with the current version of the specification shall be able to parse this flag and extension_header_size.

NOTE: A future version of specification may use this flag to specify an extension header field by setting obu_extension_flag = 1 and setting the size of extended header to extension_header_size.

obu_size shall indicate the size in bytes of the OBU not including the bytes within the obu_header of the preceding fields, i.e. obu_type, obu_trimming_status_flag and obu_extension_flag.

num_samples_to_trim_at_start shall indicate the number of samples that needs to be trimmed from the start of the samples in this Audio Frame OBU.

num_samples_to_trim_at_end shall indicate the number of samples that needs to be trimmed from the end of the samples in this Audio Frame OBU.

extension_header_size shall indicate the size in bytes of the extension header including this field.

4.1.3. Byte Alignment Syntax and Semantics

Syntax

class byte_alignment() {
  while (get_position() & 7)
    unsigned int (1) zero_bit;
}

Semantics

zero_bit shall be equal to 0 and shall be inserted into the bitstream to align the bit position to a multiple of 8 bits.

4.1.4. Reserved OBU Syntax and Semantics

The reserved OBU allows the extension of this specification with additional OBU types in a way that allows IAC-OBU parsers compliant to this version of specification to ignore them.

4.1.5. Start Code OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Start_Code.

For this obu, the obu header (2 bytes) shall be set to 0xF006.

Syntax

class start_code_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) version;
  unsigned in t(8) profile_version
}

Semantics

ia_code shall be a ‘four-character code’ (4CC) to identify the start of the IA sequence. It shall be iamf.

version shall indicate the version of an IA sequence. It shall be set to 0 for this version of the specification. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized specification as backwards compatible with that specification. That is, the version number can be split into "major" and "minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes. For example, an implementation of this specification should accept any stream with a version number of ’15’ or less, and should assume any stream with a version number ’16’ or greater is incompatible.

profile_version shall indicate the profile of an IA sequence. The MSB four bits shall indicate the profile of an IA sequence. Implementations should treat IA sequences where the MSB four bits of the version number match that of a recognized profile as backwards compatible with that specification. That is, the version number can be split into "profile major" and "profile minor" version sub-fields, with changes to the minor sub-field (in the LSB four bits) signaling compatible changes with the profile major version. The semantic of this field shall be only valid when the MSB four bits of version = 0.

4.1.6. Codec Config OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Codec_Config.

Syntax

class codec_config_obu() {
  leb128() codec_config_id;
  codec_config();
  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    leb128() audio_element_id;
  }
}

class codec_config() {
  unsigned int (32) codec_id;
  decoder_config(codec_id);
  leb128() num_samples_per_frame;
  signed int (16) roll_distance;
}

Semantics

codec_config_id shall indicate a unique ID in an IA sequence for a given codec config.

codec_id shall be a ‘four-character code’ (4CC) to identify the codec used to generate the audio substreams. It shall be opus for IAC-OPUS, mp4a for IAC-AAC-LC, fLaC for IAC-FLAC and lpcm for IAC-LPCM.

For ISOBMFF encapsulation, it shall be the same as the boxtype of its AudioSampleEntry if exist.

decoder_config() specifies the set of codec parameters required to decode an audio substream for the given codec_id. It shall be byte aligned.

num_samples_per_frame shall indicate the frame length, in samples, of the raw coded audio provided in by audio_frame_obu().

roll_distance is a signed integer that gives the number of frames that need to be decoded in order for a frame to be decoded correctly. A negative value indicates the number of frames before the frame to be decoded corrently.

num_audio_elements shall specify the number of audio elements that refer to this codec config.

audio_element_id shall specify the unique ID associated with the specific audio element that refers to this codec config.

4.1.7. Audio Element OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Audio_Element.

Syntax

class audio_element_obu() {
  leb128() audio_element_id;
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;

  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() parameter_id;
    leb128() parameter_name;
  }

  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  }
  
  
}

Semantics

audio_element_id shall indicate a unique ID in an IA sequence for a given audio element. A Codec Config OBU that refers to that audio element shall use the same value for its audio_element_id field.

audio_element_type shall specify the audio representation of this audio element which is constructed from one or more audio substreams.

audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved

num_substreams shall specify the number of audio substreams that are used to reconstruct this audio element.

audio_substream_id shall specify the unique ID associated with the audio substream that is used to reconstruct this audio element.

num_parameters shall specify the number of parameters that are used by the algorithms specified in this audio element.

parameter_id shall be the unique ID associated with a parameter that is used by the algorithm specified in this audio element.

parameter_name shall specify the name of the parameter.

parameter_name : Parameter name.
       0       : SCALABLE_CHANNEL_LAYOUT_DEMIXING_INFO
       1       : SCALABLE_CHANNEL_LAYOUT_RECON_GAIN_INFO
   the others  : reserved

scalable_channel_layout_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct a scalable channel layout.

ambisonics_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct an Ambisonics layout.

4.1.8. Mix Presentation OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Mix_Presentation.

The metadata in mix_presentation() specifies how to render and process one or more audio elements. The processed audio elements shall then be summed to generate a mixed audio signal. Finally, any additional processing specified by the mix_bus_config() shall be applied to the mixed audio signal in order to generate the final output audio for playback.

Syntax

class mix_presentation_obu() {
  leb128() mix_presentation_id;
  string mix_presentation_friendly_label;
  unsigned int (2) mix_target_layout_type;

  mix_target_layout(mix_target_layout_type);

  leb128() num_audio_elements;
  for (i = 0; i < num_audio_elements; i++) {
    string audio_element_friendly_label;
    leb128() audio_element_id;
    rendering_config();
    element_mix_config();
  }

  mix_loudness_info();
  mix_bus_config();
}

Semantics

mix_presentation_id shall indicate a unique ID in an IA sequence for a given mix presentation.

mix_presentation_friendly_label shall specify a human-friendly label to describe this mix presentation.

mix_target_layout_type specifies a target layout type for this mix presentation. A value of 0 shall indicate no specific target layout, a value of 1 shall indicate that the target layout is defined using the SP Label of [ITU2051-3], a value of 2 shall indicate that the target layout is defined using the sound system convention of [ITU2051-3] and a value of 3 shall indicate that the target layout is binaural.

mix_target_layout_type : Mix Target layout type
           0           : NOT_DEFINED
           1           : LOUDSPEAKERS_SP_LABEL
           2           : LOUDSPEAKERS_SS_CONVENTION
           3           : BINAURAL

mix_target_layout() is a class that specifies the target playback layout that all referenced audio elements shall be rendered for.

An IA sequence may have one or more mix presentations specified, each with a different mix target layout. The IA parser shall select the appropriate target layout according to the following rules, in order:

  1. The IA parser shall first attempt to select the mix presentation that matches the physical playback layout.

  2. If there is no match, the IA parser shall select the mix presentation with mix_target_layout_type = 0. In this case, the renderer specified in rendering_config() shall render the physical playback layout appropriately.

  3. If there is no mix presentation with mix_target_layout_type = 0, the IA parser should select the mix presentation with the closest specified layout to the physical layout. The renderer specified in rendering_config() shall first render the layout specified by mix_target_layout() and then apply up or down-mixing appropriately. Sections § 10.2.2 Down-mix Mechanism and § 9.5 Down-mix Matrix provide example dynamic and static down-mixing matrices for some common layouts that may be used.

num_audio_elements shall specify the number of audio elements that are used in this mix presentation to generate the final output audio signal for playback.

audio_element_friendly_label shall specify a human-friendly label to describe the referenced audio element.

audio_element_id shall indicate the unique ID associated with a specific audio element that is used in this mix presentation.

rendering_config() is a class that provides the metadata required for rendering the referenced audio element.

element_mix_config() is a class that provides the metadata required for applying any processing to the referenced and rendered audio element before being summed with other processed audio elements.

mix_loudness_info() is a class that provides the loudness information and statistics for the final output audio signal.

mix_bus_config() is a class that provides the metadata required for applying any post-processing to the mixed audio signal to generate the final output audio signal for playback.

4.1.9. Parameter Block OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameter values for an algorithm for an indicated duration, including any animation of the parameter values over this duration. The metadata shall be used in conjunction with a corresponding parameter definition and parameter data specification. The parameter definition shall be specified based on ParamDefinition(). The parameter data shall provide the values to apply in each parameter block. These shall be specified using the AnimatedParameterData() function template if parameter animation is supported.

Syntax

class parameter_block_obu() {
  leb128() parameter_id;
  leb128() duration;
  leb128() num_segments;
  leb128() constant_segment_interval;

  leb128() param_definition_type = get_param_definition_type(parameter_id);

  for (i = 0; i < num_segments; i++) {
    if (constant_segment_interval == 0) {
      leb128() segment_interval;
    }

    if (param_definition_type == PARAMETER_DEFINITION_MIX_GAIN) {
      leb128() animation_type;
      mix_gain_parameter_data(animation_type);
    }
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING_INFO) {
      demixing_info_parameter_data();
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN_INFO) {
      recon_gain_info_parameter_data();
    }
  }
}

Semantics

parameter_id shall indicate the unique ID that is associated with a specific parameter definition. All parameter blocks that provide data for that parameter definition shall have the same parameter_id.

duration shall specify the duration for which this parameter block is valid and applicable. The duration shall be expressed as the number of ticks at the rate indicated by the time_base specified in the corresponding parameter definition.

num_segments shall specify the number of different sets of parameter values specified in this parameter block, where each set describes a different segment of the timeline, contiguously.

constant_segment_interval shall specify the interval of each segment, in the case where all segments have equal intervals. If all segments do not have equal intervals, the value of constant_segment_interval shall be set to 0. This value shall be expressed as the number of ticks at the rate indicated by the time base specified in the corresponding parameter definition.

get_param_definition_type() is a run-time function that shall map the parameter_id to its registered parameter definition type. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.

param_definition_type Parameter definition type Parameter definition
0 PARAMETER_DEFINITION_MIX_GAIN MixGainParamDefinition
0 PARAMETER_DEFINITION_DEMIXING_INFO DemixingInfoParamDefinition
0 PARAMETER_DEFINITION_RECON_GAIN_INFO ReconGainInfoParamDefinition

NOTE: param_definition_type is not coded in the bitstream but is inferred at run time based on parameter_id.

segment_interval shall specify the interval for the given segment.

animation_type shall specify the how the parameter values in this parameter block shall be animated.

animation_type : Animation Type
       0       : STEP
       1       : BEZIER

In the case where animation_type is equal to BEZIER, parameters for the linear and quadratic Bezier curves may be defined in this version of the specification.

Classes that take animation_type as an input argument shall use the AnimatedParameterData() function template.

template <class T>
class AnimatedParameterData(animation_type) {
  if (animation_type == STEP) {
    T start_point_value;
  }
  if (animation_type == BEZIER) {
    T start_point_value;
    T end_point_value;
    T control_point_value;
    unsigned int (8) control_point_relative_time;
  }
}

start_point_value shall specify the parameter value that is applied at the start of the segment.

end_point_value shall specify the parameter value that is applied at the end of the segment.

control_point_value shall specify the parameter value of the middle control point of a quadratic Bezier curve, i.e. its y-axis value. If this animation implements a linear Bezier curve, control_point_value shall be ignored by the IA parser.

control_point_relative_time shall specify the time of the middle control point of a quadratic Bezier curve, i.e. its x-axis value. This value is expressed as a fraction of the parameter segment interval with valid values in the range of 0 and 1, inclusively. A value equal to 0 or 1 shall indicate that this animation implements a linear Bezier curve, in which case control_point_value shall be ignored by the IA parser. It is stored as an 8-bit, unsigned, fixed-point value with 8 fractional bits (i.e. Q0.8 in [Q-Format]).

4.1.9.1. Parameter Definition Syntax and Semantics

Parameter definition classes shall inherit from the abstract ParamDefinition() class. They may optionally further provide default parameter values, which are applied when there are no parameter blocks available.

Syntax

abstract class ParamDefinition() {
  leb128() parameter_id;
  leb128() time_base;
}

Semantics

parameter_id shall indicate the unique ID in an IA bitstream for a given parameter.

time_base shall specify the time base used by this parameter, expressed as seconds per tick. Time-related fields associated with this parameter, such as durations and intervals, shall be expressed in the number of ticks.

4.1.10. Audio Frame OBU Syntax and Semantics

This section specifies OBU payloads of OBU_IA_Audio_Frame and OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

The first 22 audio substreams in an IA sequence may use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21, which have predefined audio substream IDs associated with them. This avoids the need to manually specify an audio_substream_id.

Syntax

class audio_frame_obu_with_no_id() {
  leb128() audio_substream_id;
  audio_frame_obu(audio_substream_id);
}
class audio_frame_obu(audio_substream_id) {
  unsigned int (8*coded_frame_size) audio_frame();
}

Semantics

audio_substream_id shall indicate a unique ID in an IA sequence for a given substream. All Audio Frame OBUs of the same substream shall have the same audio_substream_id.

This value must be greater or equal to 22, in order to avoid collision with the reserved IDs for the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

coded_frame_size is the size of audio_frame() in bytes.

audio_frame() is the raw coded audio data for the frame. It shall be opus packet of [RFC6716] for IAC-OPUS, raw_data_block() of [AAC] for IAC-AAC-LC and FRAME of [FLAC] for IAC-FLAC.

For IAC-LPCM, audio_frame() shall be LPCM samples. When more than one byte is used to represent a LPCM sample, the byte order shall be in little endian.

For this version of the specification, all audio frames for a given substream must be gapless.

4.1.11. Temporal Delimiter OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Temporal_Delimiter.

Syntax

class temporal_delimiter_obu() {
}

NOTE: The Temporal Delimiter OBU has an empty payload.

4.1.12. Sync OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Sync.

Syntax

class sync_obu() {
  leb128() global_offset;
  leb128() num_obu_ids;
  for (i = 0; i < num_obu_ids; i++) {
    leb128() obu_id;
    unsigned int (1) obu_data_type;
    unsigned int (1) reinitialize_decoder;
    unsigned int (6) reserved;
    sleb128() relative_offset;
  }
  leb128() concatenation_rule;
}

Semantics

global_offset shall specify the offset that is applied to all substreams and parameters specified in this Sync OBU, in addition to their individual relative offsets.

For this version of the specification, the value of global_offset shall be set to 0.

num_obu_ids shall specify the number of substream and parameter IDs that this Sync OBU specifies the offset for.

obu_id shall specify the unique ID associated with the substream or parameter that is being referred to.

obu_data_type shall specify the type of data that is being referred to.

obu_data_type : Type of OBU data
      0       : SUBSTREAM
      1       : PARAMETER

reinitialize_decoder shall be used to specify the behaviour of a decoder when encountering gaps in the audio substream, where the gap shall be identified as described in § 6.2 Synchronizing Data OBUs. If obu_data_type does not equal SUBSTREAM, an IAC-OBU parser shall ignore this field.

If reinitialize_decoder = 0, the decoder shall not be reinitialized before decoding the audio frames after the gap. This may be used in the case where it is preferable for the decoder to fill the gap with silence instead.

If reinitialize_decoder = 1, the decoder shall be reinitialized before decoding the audio frames after the gap. If a pre-skip is specified in the relevant Codec Config OBU, it is applicable after reinitializing the decoder.

For this version of the specification, the value of reinitialize_decoder shall be set to 0. If a value of 1 is seen, the IA sequence shall be rejected as invalid.

reserved shall be set to 0. Reserved units are for future use and shall be ignored by an IAC-OBU parser.

relative_offset is the offset that is applied to the first audio frame or parameter block with the referenced obu_id that comes after this Sync OBU. It describes the position of audio and parameters in a local frame of reference. The local frame of reference is unique for each Sync OBU.

concatenation_rule shall specify the type of concatenation rule that is applied to position the audio frames and parameters that happened after a Sync OBU with respect to the timeline before the Sync OBU. A value of 0 shall indicate that Concatenation Rule 1 specified in § 6.2 Synchronizing Data OBUs shall be used, while a value of 1 shall indicate that Concatenation Rule 2 shall be used.

4.2. Detailed OBU Syntax and Semantics

4.2.1. Scalable Channel Layout Config Syntax and Semantics

scalable_channel_layout_config() contains information regarding the configuration of scalable channel audio.

Syntax

class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}

class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  signed int (16) loudness(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flag(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}

When an audio element is composed of G(r) number of substreams, scalable channel audio for the audio element shall be layered into num_layers = r number of ChannelGroups.

Immersive Audio Sequence with scalable channel audio (before OBU packing)

Semantics

num_layers shall indicate the number of ChannelGroups for scalable channel audio. It shall not be set to zero and its maximum number shall be limited to 6.

channel_audio_layer_config() is a class that provides the information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) shall provide information regarding the configuaration of ChannelGroup #i.

loudspeaker_layout shall indicate the channel layout for the channels to be reconstructed from the precedent ChannelGroups and the current ChannelGroup among ChannelGroups for scalable channel audio.

In the current version of the specification, loudspeaker_layout shall indicate one of 9 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch and 3.1.2ch. Where,

Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
            others          :     reserved     :
Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, 
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

output_gain_is_present_flag shall indicate if output_gain information fields for the ChannelGroup presents .

recon_gain_is_present_flag shall indicate if recon_gain information fields for the ChannelGroup presents in Recon_Gain_Info().

loudness shall indicate the loudness value of the downmixed channels, for the channel layout which is indicated by loudspeaker_layout, from the original channel audio. It shall be stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

output_gain_flags shall indicate the channels which output_gian is applied to. If a bit set to 1, output_gain shall be applied to the channel. Otherwise, output_gain shall not be applied to the channel.

Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Rigth Top Front channel (Rtf)

output_gain shall indicate the gain value to be applied to the mixed channels which are indicated by output_gain_flags. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]). Where, each mixed channel is generated by downmixing two or more input channels.

4.2.2. Ambisonics Config Syntax and Semantics

ambisonics_config() contains information regarding the configuration of Ambisonics.

Syntax

class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}

class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}

class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  unsigned int (16 * (N + M) * C) demixing_matrix;
}

Semantics

ambisonics_mode shall specify the method of coding Ambisonics.

ambiosnics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION

If ambisonics_mode is equal to MONO, this shall indicate that the Ambisonics channels are coded as individual mono substreams.

If ambisonics_mode is equal to PROJECTION, this shall indicate that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono substreams.

output_channel_count shall be the same as channel count in [[!RFC8486].

substream_count shall specify the number of audio substreams. It must be the same as num_substreams in its corresponding audio_element().

channel_mapping shall be the same as the one for ChannelMappingFamily = 2 in [RFC8486].

coupled_substream_count shall specify the number of referenced substreams that are coded as coupled stereo channels, where M <= N.

demixing_matrix shall be the same as the one for ChannelMappingFamily = 3 in [RFC8486] except the byte order of each of matrix coefficients shall be converted to big endian.

4.2.3. Demixing Info Syntax and Semantics

demixing_info() specifies demixing parameter mode to be used to reconstruct output channel audio according to its loudspeaker_layout.

Syntax

class demixing_info() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}

Semantics

dmixp_mode shall indicate a mode of pre-defined combinations of five demix parameters.

alpha and beta shall be gain values used for S7to5 down-mixer, gamma for T4to2 down-mixer, delta for S5to3 down-mixer and w_idx_offset shall be the offset to generate a gain value w used for T2toTF2 down-mixer.

IA Down-mix Mechanism

4.2.4. Recon Gain Info Syntax and Semantics

recon_gain_info() contains recon gain values for demixed channels.

Syntax

class recon_gain_info() {
  for (i=0; i< channel_audio_layer; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flag(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}

Semantics

recon_gain_flags shall indicate the channels which recon_gain is applied to.

recon_gain shall indicate the gain value to be applied to the channel, which is indicated by recon_gain_flags, after decoding of the following associated frames.

4.2.5. Mix Target Layout Syntax and Semantics

The mix target layout specifies the list of physical loudspeaker positions according to [ITU2051-3].

Syntax

class mix_target_layout(mix_target_layout_type) {
  if (mix_target_layout_type == LOUDSPEAKERS_SP_LABEL) {
    unsigned int (6) num_loudspeakers;
    for (i = 0; i < num_loudspeakers; i++) {
      unsigned int (8) sp_label;
    }
  } 
  else if (mix_target_layout_type == LOUDSPEAKERS_SS_CONVENTION) {
    unsigned int (4) sound_system;
    unsigned int (2) reserved;
  }
  else if (mix_target_layout_type == BINAURAL or NOT_DEFINED) {
    unsigned int (6) reserved;
  }
}

Semantics

num_loudspeakers shall specify the number of loudspeakers.

sp_label shall define the SP Label as specified in [ITU2051-3].

sp_label SP label sp_label SP label sp_label SP label
0 M+000 18 U+000 36 B+000
1 M+022 19 U+022 37 B+022
2 M-022 20 U-022 38 B-022
3 M+SC 21 U+030 39 B+030
4 M-SC 22 U-030 40 B-030
5 M+030 23 U+045 41 B+045
6 M-030 24 U-045 42 B-045
7 M+045 25 U+060 43 B+060
8 M-045 26 U-060 44 B-060
9 M+060 27 U+090 45 B+090
10 M-060 28 U-090 46 B-090
11 M+090 29 U+110 47 B+110
12 M-090 30 U-110 48 B-110
13 M+110 31 U+135 49 B+135
14 M-110 32 U-135 50 B-135
15 M+135 33 U+180 51 B+180
16 M-135 34 UH+180 52 LFE1
17 M+180 35 T+000 53 LFE2
54 ~ 256 Reserved

sound_system shall specify the sound system A to J as specified in [ITU2051-3] as follows:

4.2.6. Rendering Config Syntax and Semantics

Syntax

class rendering_config() {
  // TODO
}

Semantics

4.2.7. Element Mix Config Syntax and Semantics

element_mix_config() provides a gain value to be applied to the rendered audio element signal.

Syntax

class element_mix_config() {
  MixGainParamDefinition mix_gain;
}

Semantics

mix_gain provides the parameter definition for the gain value that is applied to all channels of the rendered audio element signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

4.2.7.1. Mix Gain Parameter Definition and Data Syntax and Semantics

Syntax

class MixGainParamDefinition extends ParamDefinition() {
  signed int (16) default_mix_gain;
}

class mix_gain_parameter_data(animation_type) {
  AnimatedParameterData<signed int (16)> param_data;
}

Semantics

default_mix_gain shall specify the default mix gain value to apply when there are no mix gain parameter blocks provided. This value is expressed in dB and shall be applied to all channels in the rendered audio element. It is stored as a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

param_data shall use the AnimatedParameterData function template. Each of the values defined within this instance (start_point_value, end_point_value and control_point_value) shall be expressed in dB and shall be applied to all channels in the rendered audio element. They are stored as 16-bit, signed, two’s complement fixed-point values with 8 fractional bits (i.e. Q7.8 in [Q-Format]).

4.2.8. Mix Loudness Info Syntax and Semantics

Syntax

class mix_loudness_info() {
  signed int (16) mix_loudness
}

Semantics

mix_loudness shall indicate the loudness value of the mixed channels, for the mix_target_layout(), from the audio elements which are specified in the mix_presentation_obu(). It is stored in fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format]) and the value should be LKFS based on [ITU1770-4], so it shall be to represent zero or negative value.

4.2.9. Mix Bus Config Syntax and Semantics

Syntax

class mix_bus_config() {
  drc_config();
}

class drc_config() { // TODO }

Semantics

4.3. Codec Specific

This section defines codec specific information for Codec_Specific_Info and Substream.

For legacy codecs, Decoder_Config() shall be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the substream. For future codecs, Decoder_Config() shall include all of decoding parameters which are required to decode Substreams.

4.3.1. IAC-OPUS Specific

Codec_Specific_Info for IAC-OPUS shall conform to ID Header with ChannelMappingFamily = 0 of [RFC7845] with following constraints:

Substream format shall be opus packet of [RFC6716] which contains only one single frame of mono or stereo channels and which has non-delimiting frame structure.

4.3.2. IAC-AAC-LC Specific

Codec_ID shall be mp4a.

Decoder_Config() for IAC-AAC-LC shall be DecoderConfigDescriptor() of [MP4-Systems], which is a subset of ESDBox for [MP4-Audio], with following constraints:

Substream format shall be one single raw_data_block() of [AAC] which contains only one single frame of mono or stereo channels.

4.3.3. IAC-FLAC Specific

Codec_ID shall be fLaC, the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

Decoder_Config() for IAC-FLAC shall be METADATA_BLOCK of [FLAC].

Substream format shall be FRAME of [FLAC], which is composed of FRAME_HEADER, followd by SUBFRAME(s) (one SUBFRAME per channel) and followed by FRAME_FOOTER.

4.3.4. IAC-LPCM Specific

Codec_ID shall be lpcm.

Decoder_Config() for IAC-LPCM shall be as follows:

class decoder_config(lpcm) {
  unsigned int (32) sample_rate;
  unsigned int (8) sample_size;
}

sample_rate shall indicate the sample rate of the input audio in Hz.

sample_size shall indicate the size of a PCM sample in bit units. The value shall be less than or equal to 24.

Substream format shall be the LPCM audio samples for the frame size.

5. Profiles

The IA Profiles define a set of capabilities that are required to parse, decode and process the corresponding IA sequence.

5.1. IA Simple Profile

This section specifies the conformance points of the simple profile.

Restrictions on the IA sequence:

Capabilities of the IA parser, decoder and processor:

5.2. IA Base Profile

This section specifies the conformance points of the base profile.

Restrictions on IA sequence:

Capabilities of the IA parser, decoder and processor:

5.3. IA Enhanced Profile

This section specifies the conformance points of the enhanced profile.

Restrictions on IA sequence:

Capabilities of the IA parser, decoder and processor:

6. Standalone IAC Representation

This section details the order in which the OBUs shall be sequenced in a standalone IAC representation. It further specifies how the Data OBUs shall be synchronized, with the aid of the Sync OBUs.

6.1. OBU Sequence Order

IA sequence shall be composed of a series of OBUs as follows:

6.1.1. Descriptor OBUs

A set of Descriptor OBUs shall be placed at the beginning of the bitstream in the following order:
  1. One Start Code OBU

  2. All Codec Config OBUs

  3. All Mix Presentation OBUs

  4. All Audio Element OBUs

6.1.2. Data OBUs

One Sync OBU shall be placed immediately after the Descriptor OBUs. This shall be followed by a sequence of Audio Frame OBUs, Parameter Block OBUs, one or more additional Sync OBUs and one or more Temporal Delimiter OBUs, according to the rules below:

Additionally, the following constraints apply to the Audio Frame and Parameter Block OBUs:

6.1.3. Refreshing Descriptor OBUs

The above describes the full sequence of OBUs for a given set of descriptor OBUs and their associated data OBUs. If the IAC configuration changes, a new set of descriptor OBUs is required. In that case, a new sequence of the complete set of descriptor OBUs, a Sync OBU and their corresponding data OBUs shall follow, in the same order as described above.

The descriptor OBUs may additionally be repeated redundantly and as frequently as necessary. In this case, the "obu_redundant_copy" field in the OBU header of each of the descriptor OBUs shall be set to 1.

If there is set of descriptor OBUs placed mid-stream, there may be parameter blocks that came before them which are still valid and applicable for the duration after the descriptor OBUs. In this case, these parameter blocks must be redundantly copied and placed after the first Sync OBU that follows the descriptor OBUs. This ensures that any receiver joining mid-stream and encountering a set of descriptor OBU is guaranteed to be able to receive the complete set of metadata that is applicable to all audio frames that come after.

6.2. Synchronizing Data OBUs

The audio frames and parameter data provided in the Data OBUs may be asynchronous; different audio substreams may have different audio frame sizes, parameter blocks may have different durations from the audio frames, or there may be gaps in a parameter’s timeline. This section details how these Data OBUs may be synchronized, based on their duration and the information provided in the Sync OBUs.

The Sync OBU contains two pieces of information that apply to all substream and parameters that follow it:

1) a relative offset for each of the substreams and parameters, and

2) a global offset.

The relative offsets describe how the substreams and parameters are positioned within a local frame of reference, which is unique for each Sync OBU. For example, the Sync OBU given below indicates that Substream 1 has a start timestamp that is 15 units before Substream 2, 10 units after Parameter 1, and 25 units before Parameter 2.

ID (name) Relative offset
N/A (Global offset) 0
1 (Substream 1) -5
2 (Substream 2) +10
3 (Parameter 1) -15
4 (Parameter 2) +20

Within a Sync OBU, only the relative information between the relative offsets is meaningful for positioning it within the global frame of reference, where the method of positioning is described further below. This is not affected by any constant shift in the relative offsets. As such, each Sync OBU can have any number of variants, as long as there is a constant difference between the two variants (see the example below). This removes any restrictions on how the absolute values of the relative offsets are selected. For example, some implementations may wish to always set the relative offset of an arbitrarily selected substream or parameter to 0.

ID (name) Relative offset (variant 1) Relative offset (variant 2)
N/A (Global offset) 0 0
1 (Substream 1) -5 +10
2 (Substream 2) +40 +55
3 (Parameter 1) -15 0
4 (Parameter 2) +20 +35

The global offset defines an additional offset that is applied to all substreams and parameters, and can be used to express intentional gaps between the substreams and parameters associated with two Sync OBUs.

The local frame of reference can be positioned in a global frame of reference by using one of the two concatenation rules provided below. These rules specify how two timelines associated with different Sync OBUs shall be aligned.

Concatenation Rule 1

Ignoring the global offset, the new timeline after a Sync OBU is shifted as early as possible such that the earliest substream or parameter in the new timeline concatenates with its counterpart in the previous timeline. Then, the global offset is applied to additionally shift the new timeline.

Concatenation Rule 2

Ignoring the global offset, the new timeline after a Sync OBU is shifted as early as possible such that the earliest substream in the new timeline concatenates with the latest substream in the old timeline. Then, the global offset is applied to additionally shift the new timeline.

Choose if this applies to 1) all audio substreams + params, or to 2) audio substreams only. Option 1) can lead to audio gaps. 2) can lead to overlapping params. See https://github.com/AOMediaCodec/iac/issues/102

The algorithm below may be used to implement the concatenation rules.

// Timestamp at the end of the most recent frame before the Sync OBU, for a
// given substream or parameter ID.
end_timestamp[ID];

// Offset specified by the new Sync OBU.
relative_offset[ID];

if (concatenation_rule_1) {
  // The timestamp for the “zero offset” is computed by applying relativ
  // offsets to the end timestamps, and seeing which one comes latest in time.
  timestamp_for_zero_offset =
    max(end_timestamp[ID] - relative_offset[ID] for each ID)
    + global_offset;
}

if (concatenation_rule_2) {
  timestamp_for_zero_offset =
    max(end_timestamp[ID]) - min(relative_offset[ID])
    + global_offset;
}

// Find the timestamp of each substream and parameter relative to the new
// “zero”.
For each ID:
  next_start_timestamp[ID] = timestamp_for_zero_offset + relative_offset[ID]

The example below illustrates three examples of how the same timeline after (orange and blue) is aligned with different previous timelines (white) when using Concatenation Rule 1.

Aligning timelines before and after a Sync OBU using the concatenation rule.

Add a similar diagram for Concat Rule 2 when its corresponding issue is resolved.

Since the Data OBUs between two Sync OBUs must be gapless, the remainder of the timeline can be inferred from the durations of the audio frames and parameter blocks. The duration of an audio frame is specified by the num_samples_per_frame field in its corresponding Codec Config OBU, while the duration of a parameter block is specified in its duration field.

7. ISOBMFF IAC Encapsulation

7.1. General Requirements & Brands

A file conformant to this specification satisfies the following:

Parsers shall support the structures required by the iso6 brand and MAY support structures required by further ISOBMFF structural brands.

7.2. ISOBMFF IAC Encapsulation with single track

This section describes the basic data structures used to signal encapsulation of IA sequence in [ISOBMFF] containers.

7.2.1. Requirement of IA sequence

IA sequence shall comply with the bitstream which is specified in [#profiles-simple] or [#profiles-base] for encapsulation of ISOBMFF with single track.

7.2.2. Encapsulation Scheme

During encapsulation process, OBUs of IA sequence are encapsulated into [ISOBMFF] as follows:

IAC Encapsulation Scheme

7.2.3. IA Sample Entry

Sample Entry Type: iamf
Container:         Sample Description Box ('stsd')
Mandatory:         Yes
Quantity:          One or more.

The IASampleEntry identifies that the track contains IA Samples, and uses one single codec specific box.

Syntax

class IASampleEntry extends AudioSampleEntry('iamf') {
  unsigned int (8) version;
  unsigned int (8) profile_version;
  CodecSpecificBox config;
}

No optional boxes of AudioSampleEntry shall present.

Sematics

Both channelcount and samplerate fields of AudioSampleEntry shall be ignored.

version and profile_version shall be the same as the ones in start_code_obu.

7.2.4. Codec Specific Box

This section describes a codec specific box for the decoding parameters, which is defined by codec_id of audio_substream_config(), to decode one single substream of IA sequence. iamf shall contain only one single codec specific box regardless of the number of substreams in IA sequence. So, the codec specific box is applied to all of substreams in sample data.

7.2.4.1. OPUS Specific Box

This shal be OpusSpecificBox (dOps) for opus audiosampleentry which is specified in [OPUS-IN-ISOBMFF].

Box Type:  dOps
Container: IA Sample Entry ('iamf')
Mandatory: Yes
Quantity:  One

This box shall be for one single substream.

Syntax

It shall be the same as dOps box for opus with that ChannelMappingFamily shall be set to 0.

Sematics

It shall be the same as the semantics in [OPUS-IN-ISOBMFF] except followings:

7.2.4.2. MP4A Specific Box

This shall be ESDBox (esds) for mp4a which is specified in [MP4].

Box Type:  esds
Container: IA Sample Entry ('iamf')
Mandatory: Yes
Quantity:  One of more

This box shall be for one single Substream.

Syntax

It shall be the same as esds box for Low Complexity Profile of [AAC] (AAC-LC).

Semantics

It shall be the same as the semantics in esds except followings:

We need to add specific boxes for FLAC and LPCM.

7.2.5. IA Sample Format

For tracks using the IASampleEntry, an IA Sample has the following constraints:

7.2.6. IA Sample Group

7.2.6.1. Global Descriptor Sample Group

During encapsulation process, global descriptors shall be discarded from IA bistream. A new sample group for global descriptors shall be defined by using sgpd and sbgp boxes with following requirements:

7.2.6.2. Demixing Info Sample Group

During encapsulation process, Parameter Block OBU for demixing_info shall be discarded from IA sequence. A new sample group for demixing_info() shall be defined by using sgpd and sbgp boxes with following requirements:

7.3. Common Encryption

TBA

7.4. Codecs Parameter String

DASH and other applications require defined values for the Codecs parameter specified in [RFC6381] for ISO Media tracks. The codecs parameter string for the AOM IA codec shall be:
iamf.IAC-specific-needs.Opus
iamf.IAC-specific-needs.mp4a.40.2
iamf.IAC-specific-needs.fLaC
iamf.IAC-specific-needs.lpcm

IAC-specific-needs shall be V.PV as follows:

For example, for this version of the specification

iamf.0000.0000.Opus
iamf.0000.0100.mp4a.40.2

8. ISOBMFF IAC Decapsulation

8.1. ISOBMFF IAC Decapsulation with single track

This section provides a guideline for IAC parser to reconstruct IA sequences from IAC file.

When IAC parser feeds the reconstructed IA sequences to IAC-OBU parser, descriptor OBUs shall be placed at the first and followed by Temoral Units.

Below figure shows the mirroring process of the encapsulation scheme of IA sequence specified in § 7 ISOBMFF IAC Encapsulation.

IAC Decapsulation Guideline

During decapsulation process, IAC file is decapsulated into IA sequences which conform to § 4 Open Bitstream Unit (OBU) Syntax and Semantics as follows:

codec_id and decoder_config() for IAC-OPUS is generated as follows:

codec_id and decoder_config() for IAC-AAC-LC is generated as follows:

9. IAC processing

This section provides a guideline for IA decoding for a given IA sequence.

IA decoding can be done by using the combination of following decoding processing.

Abmisonics decoding, it shall conform to [RFC8486] except codec specific processing and shall output Ambisonics channels in ACN (Ambisonics Channel Number) order.

Scalable Channel Audio decoding, it shall output the channel audio (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of OBU parser, Codec decoder, Audio Element Renderer and Post-processor as depicted in below figure.

IA Decoder Configuration

9.1. Ambisonics decoding

This section describes the decoding of Ambisonics.

Below figure shows the decoding flowchart of Ambisonics decoding.

Ambisonics Decoding Flowchart

9.2. Scalable Channel Audio decoding

This section describes the decoding of Scalable Channe Audio.

Below figure shows the decoding flowchart of the decoding for Scalable Channel Audio.

Scalable Channel Audio Decoding Flowchart

For a given loudspeaker layout (i.e. CL #i) among the list of loudspeaker_layout in scalable channel layout config,

Following sections, § 9.2.1 Gain, § 9.2.2 De-mixer and § 9.2.3 Recon Gain are only needed for decoding of scalable audio with num_layers > 1.

9.2.1. Gain

Gain module is the mirror process of Attenuation module. It recovers the reduced sample values using Output_Gain when its flag for ChannelGroup #j is on. When its flag is off, then this module shall be bypassed for ChannelGroup #j. Output_Gain(j) for ChannelGroup #j shall be applied to all samples of the mixed channels in the ChannelGroup #j. Where, mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation MUST use the following:

Sample *= pow(10, Output_Gain(j) / (20.0*256))

Where, Output_Gain(j) is the raw 16-bit value for jth layer which is specified in channel_audio_layer_config().

9.2.2. De-mixer

For scalable channel audio with num_layers > 1, some channels of down-mixed audio for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i shall comply with the result by the combination of following surround and top de-mixers:

Initially, wIdx(0) = 0 and the value of wIdx(k) shall be derived as follows:

Mapping of wIdx(k) to w(k) should be as follows:

wIdx(k) :   w(k)
   0    :    0
   1    :  0.0179
   2    :  0.0391
   3    :  0.0658
   4    :  0.1038
   5    :  0.25
   6    :  0.3962
   7    :  0.4342
   8    :  0.4609
   9    :  0.4821
   10    : 0.5

When D_set = { x | S1 < x ≤ Si and x is an integer},

When Ti = 2,

When Ti = 4,

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of th down-mixed 5.1.2ch,

Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).

9.2.3. Recon Gain

Recon_Gain shall be only applied to all of audio samples of the de-mixed channels from De-mixer module.

Below figure shows the smoothing scheme of Recon_Gain.

Smoothing Scheme of Recon Gain

Recommend values for specific codecs are as follows

9.3. Mix Presentation

//To Do: Fill in the text

9.3.1. Rendering for Audio Element

This section provide a guideline by the rendering_config() which is specified in mix presentation OBU.

//To Do: Fill in rendering method for scene-based audio element if any

//To Do: Fill in rendering method for channel-based audio element if any

9.3.2. Mixing for Audio Elements

This section provide a guideline by the element_mix_config() which is specified in mix presentation OBU.

When the output channel audio of scene-based audio element or channel-based audio element does not match with the loudspeaker layout which is indicated by mix_target_layout() in mix presentation OBU.

Down-mixing matrics, which are specified in § 9.5 Down-mix Matrix, are recommended for down-mixing of the output channel audio.

When multiple audio elements are mixed into one channel audio:

After relevant processing, multiple audio elements are mixed into one channel audio according to the target loudspeaker layout with the target sampling rate by considering the synchronization in audio sample by audio sample among them.

//To Do: Fill in the text based on element_mix_config()

9.4. Post Processing

9.4.1. Loudness Normalization

Loudness normalization is done by adjusting a loudness level to -24 LKFS based on the loudness value of the target channel layout (i.e. CL #i) which is signaled in Channel_Audio_Layer_Config() or the loudness value in mix presentation OBU.

Real implementations for § 9.4.1 Loudness Normalization, § 9.4.2 DRC Control and § 9.4.3 Limiter are soly dependent on implementers (i.e., out of scope of this specification) unless mix presentation OBU provide algorithms for those. This specification only recommends the principles for the former.

9.4.2. DRC Control

In this specification, DRC control can be guided by a pre-defined DRC or by the algorithm specified in mix presentation OBU.

For the pre-defined DRC, it is assumed an input loudness of -24 LKFS and targets an output loudness of -16 LKFS and DRC control module applies the pre-defined DRC compression by assuming a target loudness is adjusted to -16 LKFS as follows:

Below figure shows the schematic diagram of the pre-defined DRC compression.

Pre-defined DRC Compression Scheme

The below is the equation that represents the pre-defined DRC compression scheme.

Y = D_T(i) + (X - T(i)) / R(i). Where,
X ∈ Seg(i) and D_T (i) = T(0) + ∑ ((T(k+1) - T(k)) / R(k)) (k = 0 to i-1).
Seg(i): ith Segment
 T(i) : Threshold vlaue in dBFS for Seg(i), where T(0) = -96.33
 R(i) : Ratio value for Seg(i)
D_T(i): Threshold value in dBFS for Seg(i) after DRC compression, where D_T(0) = T(0)
  X   : Input sample value in dBFS
  Y   : Output sample value in dBFS

9.4.3. Limiter

This module limits the true peak of input signal at -1dB. The definition of thr true peak is base on [ITU1770-4].

Below is a recommended loudness normalization and DRC control principle according to application.

NOTE: The definitions of AV, TV and Mobile applications are as follows: .AV application: Sound devices with external speakers such as Soundbar, AV receiver, HiFi speaker etc.. .TV application: Television with built-in speakers such as LCD/OLED slim TV. .Mobile application: Handheld devices with built-in speakers such as smartphone, tablet etc..

9.5. Down-mix Matrix

9.5.1. Static Down-mix Matrix

This section recommends static down-mix matrices.

IAC players need to support any valid channel layout, even if the number of channels does not match the physically connected audio hardware. Players need to perform channel mixing to increase or reduce the number of channels as needed.

Implementations can use the matrices below to implement down-mixing from the output channel audio, which are known to give acceptable results for stereo, 5.1ch, 7.1ch and 3.1.2ch.

Down-mixing can be done directly by using one of the matrices below or a combination of them. For example, stereo down-mixing for 7.1.4ch can be done by the combination of the 7.1ch down-mix matrix for 7.1.4ch, 5.1ch down-mix matrix for 7.1ch and stereo down-mix matrix for 5.1ch.

The figures below shows recommended static down-mix matrices to stereo, 5.1ch and 7.1ch.

7.1ch Down-mix matrix for 7.1.4ch
7.1ch Down-mix matrix for 7.1.2ch
5.1ch Down-mix matrix for 5.1.4ch
5.1ch Down-mix matrix for 5.1.2ch
5.1ch Down-mix matrix for 7.1ch
Stereo Down-mix matrix for 5.1ch
Stereo Down-mix matrix for 3.1.2ch

The figures below show static down-mix matrices to 3.1.2ch.

3.1.2ch Down-mix matrix for 5.1.2ch
3.1.2ch Down-mix matrix for 5.1.4ch
3.1.2ch Down-mix matrix for 7.1.2ch
3.1.2ch Down-mix matrix for 7.1.4ch

Where, p1 = 0.707 and p2 = 0.3535. Implementations may use limiter defined in § 9.4.3 Limiter to preserve energy of audio signals instead of normalization factors.

9.5.2. Dynamic Down-mix Matrix {#processing-downmixmatrix-dynamic}

This section recommends dynamic down-mixing matrics.

The dynamix down-mixing matrics shall comply with the down-mixing mechanisam which is specified in § 10.2.2 Down-mix Mechanism.

10. IAC Generation Process

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:

For a given input audio and user inputs, IA encoder shall output IA sequence which conforms to § 4 Open Bitstream Unit (OBU) Syntax and Semantics.

Input audio shall be one of followings:

User inputs are:

IA encoding can be done by using the combination of following generation processing.

The below figure shows IA encoder configuration for one single audio element.

The IA encoder is composed of Pre-processor, Codec encoder and OBU packetizer.

IA Encoder Configuration

The order of substreams in each ChannelGroup shall be as follows:

Where, non-coupled substream is a coded substream from one of non-coupled channels.

10.1. Ambisonics Encoding

For Ambisonics encoding:

10.2. Scalable Channel Audio Encoding

For Scalable Channel Audio encoding:

Below figure shows IA encoding flowchart for Scalable Channel Audio.

IA Encoding Flowchart for Scalable Channel Audio

Following sections, § 10.2.1 Down-mix parameter and Loudness, § 10.2.2 Down-mix Mechanism, § 10.2.3 Channel Layout Generation Rule, § 10.2.4 Recon Gain Generation and § 10.2.5 ChannelGroup Generation Rule do not needed for non-scalable channel audio (i.e., when num_layers specified in scalable_channel_layout_config() is set to 1).

10.2.1. Down-mix parameter and Loudness

This section describes how to generate down-mix parameters and loudness level for a given channel audio and a given list of channel layouts for scalability.

Below figure shows a block diagram for down-mix parameter and loudness module including down-mixer.

IA Down-mix Parameter and Loudness

For a given Channel Audio (e.g. 7.1.4ch) and a given list of channel layouts based on the Channel Audio,

10.2.2. Down-mix Mechanism

This section specifies the down-mixing mechanism to generate down-mixed audio for scalable channel audio.

For a given Channel Audio which conforms to [[=loudspeaker_layout]], the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementors may use another method to get the down-mixed audio from the given channel audio, but the down-mixed audio shall comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanisam is a combination of following surround down-mixer(s) and top down-mixer(s) as depicted in below figure.

S7to5 enc.: Ls5 = α(k) x Lss7 + β(k) x Lrs7 and Rs5 = α(k) x Rss7 + β(k) x Rrs7.
S5to3 enc.: L3 = L5 + δ(k) x Ls5 and R3 = R5 + δ(k) x Rs5
S3to2 enc.: L2 = L3 + 0.707 x C and R2 = R3 + 0.707 x C
S2to1 enc.: Mono = 0.5 x (L2 + R2)
T4to2 enc.: Ltf2 = Ltf4 + γ(k) x Ltb4  and Rtf2 = Rtf4 + γ(k) x Rtb4.
T2toTF2 enc.: Ltf3 = Ltf2 + w(k) x δ(k) x Ls5 and Rtf3 = Rtf2 + w(k) x δ(k) x Rs5.
IA Down-mix Mechanism
For example, to get down-mixed 3.1.2ch from 7.1.4ch:
- S3 of 3.1.2ch is generated by using S7to5 and S5to3 encs.
- TF2 of 3.1.2ch is generated by using T4to2 and T2toTF2 encs.

10.2.3. Channel Layout Generation Rule

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of input Channel Audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for a scalable channel audio shall comform with following rules:

Down-mix paths, which conform to the above rule, shall be only allowed for scalable channel audio with num_layers > 1 as depicted in below figure.

IA Down-mix Path

10.2.4. Recon Gain Generation

This section describes how to generate Recon_Gain.

Recon_Gain needs to be applied to de-mixed channels. For this, IA encoder needs to deliver it to IA decoders.

Let’s define followings:

If 10*log10(level Ok / maxL^2) is less than the first threshold value (e.g. -80dB), Recon_Gain (k, i) = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (e.g. -6dB), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value to be delivered is floor(255*Recon_Gain).

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.
- D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChanngelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively.
- D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChanngelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively.

Recon_Gain for D_Lrs7:
- Level Ok is the signal power for the frame #k of Lrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Lrs7.
Recon_Gain for D_Rrs7:
- Level Ok is the signal power for the frame #k of Rrs7 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rrs7.
Recon_Gain for D_Ltb4:
- Level Ok is the signal power for the frame #k of Ltf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Ltb4.
Recon_Gain for D_Rtb4:
- Level Ok is the signal power for the frame #k of Rtf4 in the ith ChanngGroup.
- Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup.
- Level Dk is the signal power for the frame #k of D_Rtb4.

10.2.5. ChannelGroup Generation Rule

This section describes the generation rule for ChannelGroup.

For a given Channel Audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), CG Generation module outputs the transformed audio (i.e. ChannelGroups) which shall conform to following rules:

Below figure shows one example of transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

Example of Transformation Matrix with 4 CGs

10.2.6. Mix Presentation Encoding

//To Do: Fill in the text

10.2.6.1. Rendering Config

This section provide a guideline to generate rendering_config().

//To Do: Fill in how to generate rendering_config() for scene-based audio element

//To Do: Fill in how to generate rendering_config() for channel-based audio element

10.2.6.2. Element Mix Config

This section provide a guideline to generate element_mix_config().

//To Do: Fill in how to generate element_mix_config() for scene-based audio element

//To Do: Fill in how to generate element_mix_config() for channel-based audio element

10.2.7. Multiple Audio Elements Encoding

This section provide a guideline to generate IA sequence having multiple audio elements from two IA simple or base profiles.

10.2.7.1. Multiple Audio Elements with One Codec Config

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple profiles with the same codec config OBU. However, the result shall comply with the base profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:

Step2: ith temporal unit is generated as follows:

Step3: Generate IA sequence which start descriptor OBUs and followed by temporal units in order.

10.2.7.2. Multiple Audio Elements with Multiple Codec Config

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple or base profiles with the different codec config OBUs. However, the result shall comply with the enhanced profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:

Step2: Data OBUs are generated as follows:

Step3: Generate IA sequence which start descriptor OBUs and followed by Temporal Units in order.

10.2.8. Post Processing

This section provides a guideline to generate algorithms for post processing.

10.2.8.1. Loudness Config

This section provide a guideline to generate loudness_config().

//To Do: Fill in how to generate loudness_config()

10.2.8.2. DRC Config

This section provide a guideline to generate drc_config().

//To Do: Fill in how to generate drc_config()

11. Consumption of IAC bitstream

TODO. Fill in example workflows.

12. Annex A: Audio Substream Gaps

This annex describes a number of scenarios where a gap may exist in the audio signals, where a gap is defined as no audio frames for some period of time.

12.1. A gap within an audio substream

A gap within an audio substream may be expressed via the Sync OBU offsets. A decoder encountering such a gap may either:

  1. insert silent audio frames in the gap without reinitializing, or

  2. reinitialize before decoding the audio frames after the gap.

The appropriate behaviour in this case is signalled via the reinitialize_decoder field in the Sync OBU.

In this version of the specification, gaps within an audio substream are not supported.

12.2. A gap between two audio substreams

A gap may occur if there is a period of time between the end of one substream and the start of another. Such a gap may be expressed via the Sync OBU offsets. Similar to the case of a gap within an audio substream, the behaviour of the decoder is determined by the reinitialize_decoder field in the Sync OBU.

A gap may further occur if there is a period of time between the end of all substreams and the start of another. This case may be expressed by setting a non-zero value for the global_offset field in the Sync OBU.

12.3. A gap due to packet loss

In the case where a gap is not signalled by the Sync OBUs, any unexpected absence of audio frames shall be interpreted as packet loss. The IAC parser is unable to guarantee the correctness of following OBUs received until the next set of Descriptor OBUs.

In this version of the specification, gaps in the audio substreams are not supported so if a gap is encountered, it can always be interpreted as packet loss.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[AAC]
Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC). Standard. URL: https://www.iso.org/standard/43345.html
[AV1-Convention]
Conventions. Spec. URL: https://aomedia.org/av1/specification/conventions/
[BCP47]
BCP 47. Best Practice. URL: https://www.rfc-editor.org/info/bcp47
[FLAC]
Free Lossless Audio Codec. Best Practice. URL: https://xiph.org/flac/format.html
[ISOBMFF]
Information technology — Coding of audio-visual objects — Part 12: ISO Base Media File Format. December 2015. International Standard. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip
[ITU1770-4]
Algorithms to measure audio programme loudness and true-peak audio level. Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf
[ITU2051-3]
Advance sound system for programme production. Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf
[MP4]
Information technology — Coding of audio-visual objects — Part 14: MP4 file format. January 2020. Published. URL: https://www.iso.org/standard/79110.html
[MP4-Audio]
Information technology — Coding of audio-visual objects — Part 3: Audio. Standard. URL: https://www.iso.org/standard/76383.html
[MP4-Systems]
Information technology — Coding of audio-visual objects — Part 1: Systems. Standard. URL: https://www.iso.org/standard/55688.html
[OPUS-IN-ISOBMFF]
Encapsulation of Opus in ISO Base Media File Format. Best Practice. URL: https://opus-codec.org/docs/opus_in_isobmff.html
[Q-Format]
Q (number format). Best Practice. URL: https://en.wikipedia.org/wiki/Q_(number_format)
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[RFC3629]
F. Yergeau. UTF-8, a transformation format of ISO 10646. November 2003. Internet Standard. URL: https://www.rfc-editor.org/rfc/rfc3629
[RFC6381]
R. Gellens; D. Singer; P. Frojdh. The 'Codecs' and 'Profiles' Parameters for "Bucket" Media Types. August 2011. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6381
[RFC6716]
JM. Valin; K. Vos; T. Terriberry. Definition of the Opus Audio Codec. September 2012. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6716
[RFC7845]
T. Terriberry; R. Lee; R. Giles. Ogg Encapsulation for the Opus Audio Codec. April 2016. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc7845
[RFC8486]
J. Skoglund; M. Graczyk. Ambisonics in an Ogg Opus Container. October 2018. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc8486

Informative References

[AI-CAD-Mixing]
AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework. Paper. URL: https://www.aes.org/e-lib/browse.cfm?elib=21489