Immersive Audio Model and Formats

AOM Working Group Draft,

This version:
https://aomediacodec.github.io/iamf/
Issue Tracking:
GitHub
Editors:
( Samsung )
( Google )
Not Ready For Implementation

This spec is not yet ready for implementation. It exists in this repository to record the ideas and promote discussion.

Before attempting to implement this spec, please contact the editors.

Copyright 2023, AOM

Licensing information is available at http://aomedia.org/license/

The MATERIALS ARE PROVIDED “AS IS.” The Alliance for Open Media, its members, and its contributors expressly disclaim any warranties (express, implied, or otherwise), including implied warranties of merchantability, non-infringement, fitness for a particular purpose, or title, related to the materials. The entire risk as to implementing or otherwise using the materials is assumed by the implementer and user. IN NO EVENT WILL THE ALLIANCE FOR OPEN MEDIA, ITS MEMBERS, OR CONTRIBUTORS BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS DELIVERABLE OR ITS GOVERNING AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER MEMBER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Abstract

This document specifies an immersive audio (IA) model, a standalone IA sequence format and an [ISOBMFF] -based IA container format.

1. Introduction

This specification defines an immersive audio model and formats (IAMF) to provide immersive audio experience to end-users.

This specification defines a model for representing Immersive Audio contents based on Audio Substream s contributing to Audio Element s meant to be rendered and mixed to form one or more presentations as depicted in the figure below.

Processing flow to decode, reconstruct, render and mix the 3D audio signals for immersive audio playback.

The model comprises a number of coded Audio Substream s and the metadata that describes how to decode, render and mix the Audio Substream s for playback. The model itself is codec-agnostic; any supported audio codec may be used to code the Audio Substream s.

The model includes one or more Audio Element s, each of which consists of one or more Audio Substream s. The Audio Substream s, which composing of an Audio Element , are grouped into one or more ChannelGroup s. The model further includes Mix Presentation s and Parameter Substream s.

IAMF can be used to provide Immersive Audio contents for presentation on a wide range of devices in both streaming and offline applications. These applications include internet audio streaming, multicasting/broadcasting services, file download, gaming, communication, virtual and augmented reality, and others. In these applications, audio may be played back on a wide range of devices, e.g. headsets, mobile phones, tablets, TVs, sound bars, home theater systems and big screens.

Here are some typical IAMF use cases and examples of how to instantiate the model for the use cases.

Example 1: 3D audio signal = 3.1.2ch of UC1,

Example 2: Two 3D audio signal s = 5.1.2ch and Stereo of UC2,

Example 3: Two 3D audio signal s = FOA and Stereo of UC3,

NOTE: Big-screen TVs select Mix Presentation 1 and Mobile devices select Mix Presentation 2.

Based on the model, this specification defines a hypothetical immersive audio model and format ( IAMF ) architecture as depicted in the figure below.

Hypothetical IAMF Architecture

For a given input 3D audio,

1.1. IA Sequence

An IA Sequence is a bitstream to represent Immersive Audio contents and consists of Descriptors and IA Data .

Each of Descriptors and each of IA Data are packetized by Open Bitstream Unit (OBU), respectively. The term Open Bitstream Unit (OBU) is the concrete, physical unit used to represent the components in the model.

1.2. Use of OBU Syntax

1.2.1. Descriptors

Descriptors contain all the information that is required to setup and configure the decoders, reconstruction algorithm, renderers and mixers. Descriptors do not contain audio signal.

1.2.2. IA Data

IA Data contains the actual time-varying data that is required in the generation of the final 3D audio output.

1.3. Timing Model

An Audio Substream is made of consecutive Audio Frame OBU s. Each Audio Frame OBU is made of audio samples at a given sample rate. The decode duration of an Audio Frame OBU is the number of audio samples divided by the sample rate. The presentation duration of an Audio Frame OBU is the number of audio samples remaining after trimming divided by the sample rate. The decode start time (respectively presentation start time) of an Audio Frame OBU is the sum of the decode durations (respectively presentation durations) of previous Audio Frame OBU s in the IA Sequence, or 0 otherwise. The decode duration (respectively presentation duration) of an Audio Substream is the sum of the decode durations (respectively presentation durations) all its Audio Frame OBU s. The start time of an Audio Substream is the presentation start time of its first Audio Frame OBU .

A Parameter Substream is made of consective Parameter Block OBU s. Each Parameter Block OBU is made of parameter values at a given sample rate. The decode duration of a Parameter Block OBU is the number of parameter values divided by the sample rate. The decode start time of a Parameter Block OBU s is the sum of the decode duration of previous Parameter Block OBU s, if any, 0 otherwise. The decode duration of a Parameter Substream is the sum of the decode duration of all its Parameter Block OBU s. The start time of an Parameter Substream is the decode start time of its first Audio Frame OBU . When all parameter values of Parameter Substream are constant, no Parameter Block OBU s may present in IA sequence.

Within an Audio Element , the presentation start times of all Audio Substream s coincide and is the presentation start time of the Audio Element . All Audio Substream s have the same presentation duration which is the presentation duration of the Audio Element .

Within an Mix Presentation , the presentation start time of all Audio Element s coincide and all Audio Element s have the same duration defining the duration of the Mix Presentation .

Within an IA Sequence , all Mix Presentation s have the same duration, defining the duration of the IA Sequence and have the same presentation start time defining the presentation start time of the IA Sequence .

The term Temporal Unit means a set of all Audio Frame OBU s with the same decode start time and the same duration from all Audio Substream s and all non-redundant Parameter Block OBU s with the decode start time within the duration.

The below figure shows an example of Timing Model in terms of the decode start times and durations of Audio Substream and Parameter Substream .

Example of IAMF Timing Model

2. Open Bitstream Unit (OBU) Syntax and Semantics

The IA Sequence uses the OBU syntax.

This section specifies the OBU syntax elements and their semantics.

2.1. Immersive Audio OBU Syntax and Semantics

Syntax

class ia_open_bitstream_unit() {
  obu_header();
  if (obu_type == OBU_IA_Sequence_Header)
    ia_sequence_header_obu();
  else if (obu_type == OBU_IA_Codec_Config)
    codec_config_obu();
  else if (obu_type == OBU_IA_Audio_Element)
    audio_element_obu();
  else if (obu_type == OBU_IA_Mix_Presentation)
    mix_presentation_obu();
  else if (obu_type == OBU_IA_Parameter_Block)
    parameter_block_obu();
  else if (obu_type == OBU_IA_Temporal_Delimiter)
    temporal_delimiter_obu();
  else if (obu_type == OBU_IA_Audio_Frame)
    audio_frame_obu(true);
  else if (obu_type >= 9 and <= 30)
    audio_frame_obu(false);
  else if (obu_type == 5 or 6 or 7)
    reserved_obu();
  byte_alignment():
}

Semantics

If the syntax element obu_type is equal to OBU_IA_Sequence_Header, an ordered series of OBUs is presented to the decoding process as a string of bytes.

2.2. OBU Header Syntax and Semantics

Syntax

class obu_header() {
  unsigned int (5) obu_type;
  unsigned int (1) obu_redundant_copy;
  unsigned int (1) obu_trimming_status_flag;
  unsigned int (1) obu_extension_flag;
  leb128() obu_size;
  if (obu_trimming_status_flag) {
    leb128() num_samples_to_trim_at_end;
    leb128() num_samples_to_trim_at_start;
  }
  if (obu_extension_flag == 1)
    leb128() extension_header_size;
}

Semantics

OBUs are structured with a header and a payload.

obu_type specifies the type of data structure contained in the OBU payload.

obu_type: Name of obu_type
   0    : OBU_IA_Codec_Config
   1    : OBU_IA_Audio_Element
   2    : OBU_IA_Mix_Presentation
   3    : OBU_IA_Parameter_Block
   4    : OBU_IA_Temporal_Delimiter
  5~7   : Reserved
   8    : OBU_IA_Audio_Frame
  9~30  : OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21
   31   : OBU_IA_Sequence_Header

obu_redundant_copy indicates whether this OBU is a redundant copy of the previous OBU in the IA sequence with the same obu_type. A value of 1 indicates that it is a redundant copy, while a value of 0 indicates that it is not.

It shall always be set to 0 for the following obu_type values:

obu_trimming_status_flag indicates whether this OBU has audio samples to be trimmed or not. It SHALL be set only when obu_type is set to OBU_IA_Audio_Frame or OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

For a given substream,

NOTE: Because of coding dependency, discarding a sample can sometimes mean decoding the entire audio frame.

NOTE: This means that if one of the value is set to the the number of samples (i.e. num_samples_per_frame ) in the Audio Frame OBU, the other value is set to 0.

obu_extension_flag indicates whether the extension_header_size field presents or not. If it set to 0, the extension_header_size field shall not be present. Otherwise, the extension_header_size field shall be present.

This flag shall be set to 0 for this version of the specification (i.e. version = 0). An OBU parser which is conformant with the current version of the specification shall be able to parse this flag and extension_header_size .

NOTE: A future version of specification may use this flag to specify an extension header field by setting obu_extension_flag = 1 and setting the size of extended header to extension_header_size .

obu_size indicates the size in bytes of the OBU immediately following the obu_size field of the OBU.

num_samples_to_trim_at_start indicates the number of samples that needs to be trimmed from the start of the samples in this Audio Frame OBU.

num_samples_to_trim_at_end indicates the number of samples that needs to be trimmed from the end of the samples in this Audio Frame OBU.

extension_header_size indicates the size in bytes of the extension header including this field.

2.3. Byte Alignment Syntax and Semantics

Syntax

class byte_alignment() {
  while (get_position() & 7)
    unsigned int (1) zero_bit;
}

Semantics

zero_bit shall be equal to 0 and is inserted into the bitstream to align the bit position to a multiple of 8 bits.

2.4. Reserved OBU Syntax and Semantics

Reserved OBUs shall be ignored by parsers compliant to this version of the specification. Future versions of the specification may define semantics for these reserved OBUs that would only be supported by parsers compliant to these future versions.

2.5. IA Sequence Header OBU Syntax and Semantics

This section specifies obu payload of OBU_IA_Sequence_Header.

This OBU is used to indicate the start of IA Sequence. So, the first OBU of IA Sequence shall be OBU_IA_Sequence_Header.

NOTE: When an IA Sequence is stored in a file, the IA Sequence Header OBU can be used to detect that the file contains an IA Sequence.

This OBU may be placed frequently within one single IA Sequence for an application such as broadcasting or multicasting scenario. In that case, the other IA Sequence Header OBUs except the first one shall be marked as redundant (i.e. obu_redundant_copy = 1).

Syntax

class ia_sequence_header_obu() {
  unsigned int (32) ia_code;
  unsigned int (8) profile_name;
  unsigned int (8) profile_compatible;
}

Semantics

ia_code indicates a ‘four-character code’ (4CC), ‘iamf’.

NOTE: When IA OBUs are delivered over a protocol that does not provide explicit IA Sequence boundaries, a parser may locate the IA Sequence start by searching for the code iamf preceded by specific OBU header values, e.g. assuming obu_extension_flag is set to 0 and because obu_trimming_status_flag is set to 0 for a IA Sequence Header OBU, the OBU header can be 0xF806 or 0xFC06

profile_name indicates that this IA sequence is able to be processed by IA decoders that support the profile indicated in this field. The IA decoders shall be able to parse all OBUs explicitly listed in that profile and can still encounter reserved OBUs that it should skip and it is acceptable to do so. This allows future versions of the specification to define new profiles that can be backwards compatible to old profiles.

profile_compatible indicates an additional profile of this specification to which this IA sequence complies. If a sequence only complies with the profile_name profile, this field shall be set to same profile_name value.

NOTE: For example, an IA sequence with profile_name = 0 and profile_compatible = 1 means that the IA sequence complies to Base profile of specification and IA decoder to support Simple profile can playback the IA sequence. When a future profile is defined with profile_compatible = 2 as called Enhanced profile, an IA sequence with profile_name = 1 and profile_compatible = 2 will mean that the IA sequence will comply with Enhanced profile and IA decoder to support Base profile of specification will be able to playback it by ignoring unknown OBUs and/or unknown syntaxes if present.

2.6. Codec Config OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Codec_Config.

Syntax

class codec_config_obu() {
  leb128() codec_config_id;  
  codec_config();
}
class codec_config() {
  unsigned int (32) codec_id;
  leb128() num_samples_per_frame;
  signed int (16) audio_roll_distance;
  decoder_config(codec_id);
}

Semantics

codec_config_id defines an identifier for a codec configuration. Within an IA Sequence, there shall be exactly one non-redundant Codec Config OBU with a given identifier. Audio Elements that need a decoder configuration based on this codec configuration refer to this identifier.

codec_id indicates a ‘four-character code’ (4CC) to identify the codec used to generate the audio substreams. For version 0 of this specification, it shall be set to one of four codec_id values defined below:

NOTE: ipcm should not be confused with lpcm which is another 4CC to identify codecs in other container formats (e.g. QuickTime).

num_samples_per_frame indicates the frame length, in samples, of the audio_frame() provided in by audio_frame_obu(). It shall not be set to zero. If the decoder_config() structure for a given codec specifies a value for the frame length, the two values shall be equal.

audio_roll_distance is a signed integer that gives the number of frames that need to be decoded in order for a frame to be decoded correctly. A negative value indicates the number of frames before the frame to be decoded correctly.

decoder_config() specifies the set of codec parameters required to decode the payload an substream for the given codec_id. It is byte aligned.

2.7. Audio Element OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Audio_Element.

Syntax

class audio_element_obu() {
  leb128() audio_element_id;
  unsigned int (3) audio_element_type;
  unsigned int (5) reserved;
  
  leb128() codec_config_id;  
  leb128() num_substreams;
  for (i = 0; i < num_substreams; i++) {
    leb128() audio_substream_id;
  }
  
  leb128() num_parameters;
  for (i = 0; i < num_parameters; i++) {
    leb128() param_definition_type;
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
        DemixingParamDefinition demixing_info;
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
        ReconGainParamDefinition recon_gain_info;
    }
  }
  if (audio_element_type == CHANNEL_BASED) {
    scalable_channel_layout_config();
  } else if (audio_element_type == SCENE_BASED) {
    ambisonics_config();
  }  
}
class DemixingParamDefinition() extends ParamDefinition() {
  default_demixing_info_parameter_data();
  unsigned int (4) default_w;
  unsigned int (4) reserved;
}
class ReconGainParamDefinition() extends ParamDefinition() {
}

Semantics

audio_element_id defines an identifier for an Audio Element. Within an IA Sequence, there shall be exactly one non-redundant Audio Element OBU with a given identifier. Mix Presentations that use an Audio Element refer to this identifier.

audio_element_type specifies the audio representation of this audio element which is constructed from one or more audio substreams.

audio_element_type: The type of audio representation.
   0    : CHANNEL_BASED
   1    : SCENE_BASED
  2~7   : Reserved

codec_config_id indicates the identifier for the codec configuration which this Audio Element refers to.

num_substreams specifies the number of audio substreams that are used to reconstruct this audio element.

audio_substream_id indicates the identifier for a Substream which this Audio Element refers to.

Let a particular ChannelGroup’s substream be indexed as [ c , n_c ], where

[[1, 1], [1, 2], ..., [1, N_1], [2, 1], [2, 2], ..., [2, N_2], ..., [C, 1], [C, 2], ..., [C, N_c]]

A ChannelGroup is defined in § 7 IAMF Generation Process (Informative) . The order of the substreams in each ChannelGroup., i.e. the semantics of n_c, is specified in § 2.7.2 Scalable Channel Layout Config Syntax and Semantics .

num_parameters specifies the number of parameters that are used by the algorithms specified in this audio element.

param_definition_type specifies the type of the parameter definition. All parameter definition types described in this version of the specification are listed in the table below, along with their associated parameter definitions.

param_definition_type Parameter definition type Parameter definition
0 PARAMETER_DEFINITION_MIX_GAIN MixGainParamDefinition
1 PARAMETER_DEFINITION_DEMIXING DemixingParamDefinition
2 PARAMETER_DEFINITION_RECON_GAIN ReconGainParamDefinition

demixing_info provides the parameter definition for the demixing information to reconstruct channel audios according to loudspeaker_layout from scalable channel audio. The parameter definition is provided by DemixingParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in demixing_info_parameter_data().

recon_gain_info provides the parameter definition for the gain value to reconstruct channel audios according to loudspeaker_layout from scalable channel audio. The parameter definition is provided by ReconGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in recon_gain_info_parameter_data().

scalable_channel_layout_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct a scalable channel layout.

ambisonics_config() is a class that provides the metadata required for combining the substreams identified here in order to reconstruct an Ambisonics layout.

default_demixing_info_parameter_data() and default_w specify the default parameter data for demixing to apply to all audio samples when there are no Parameter Block OBUs (with parameter_id defined in this DemixingParamDefinition()) provided.

Mapping of default_w to w(k) should be as follows:

default_w :   w(k)
   0      :    0
   1      :  0.0179
   2      :  0.0391
   3      :  0.0658
   4      :  0.1038
   5      :  0.25
   6      :  0.3962
   7      :  0.4342
   8      :  0.4609
   9      :  0.4821
   10     :  0.5
   11     :  reserved

A default recon gain value of 0db is implied when there are no Parameter Block OBUs (with parameter_id defined in this ReconGainParamDefinition()) provided.

2.7.1. Parameter Definition Syntax and Semantics

Parameter definition classes inherits from the abstract ParamDefinition() class.

For a given parameter, its timeline is fully aligned with that of the Audio Element which the given parameter is applied to. Where, the timeline of the Audio Element is on post coding domain (i.e. before trimming data). So, when we assume the same sample rate between the given Parameter and the Audio Element In other words, the start timestamp and the duration of the given Parameter are same as those of the Audio Elememnt, respectively.

Syntax

abstract class ParamDefinition() {
  leb128() parameter_id;
  leb128() parameter_rate;
  unsigned int (1) param_definition_mode;
  unsigned int (7) reserved;
  if (param_definition_mode == 0) {
    leb128() duration;
    leb128() num_subblocks;
    leb128() constant_subblock_duration;
    if (constant_subblock_duration == 0) {
      for (i=0; i< num_subblocks; i++) {
        leb128() subblock_duration;
      }
    }
    
  }
}

Semantics

parameter_id indicates the identifier for the Parameter which this parameter definition refers to.

parameter_rate specifies the rate used by this parameter, expressed as ticks per second. Time-related fields associated with this parameter, such as durations, shall be expressed in the number of ticks.

param_definition_mode indicates if this parameter definition specifies duration, num_subblocks, constant_subblock_duration and subblock_duration fields for the parameter blocks associated to the parameter_id .

duration specifies the duration for which all of parameter blocks associated to this parameter definition are valid and applicable.

num_subblocks specifies the number of different sets of parameter values specified in all of parameter blocks associated to this parameter definition, where each set describes a different subblock of the timeline, contiguously.

constant_subblock_duration specifies the duration of each subblock, in the case where all subblocks except the last subblock have equal durations. If all subblocks except the last subblock do not have equal durations, the value of constant_subblock_duration shall be set to 0.

When it defines D = the value of duration , NS = the value of num_subblocks , CSI = the value of constant_subblock_duration and SI = the value of subblock_duration .

subblock_duration specifies the duration for the given subblock.

Each value of duration , constant_subblock_duration and subblock_duration shall be expressed as the number of ticks at the parameter_rate specified in the corresponding parameter definition.

2.7.2. Scalable Channel Layout Config Syntax and Semantics

scalable_channel_layout_config() contains information regarding the configuration of scalable channel audio.

Syntax

class scalable_channel_layout_config() {
  unsigned int (3) num_layers;
  unsigned int (5) reserved;
  for (i = 1; i <= num_layers; i++) {
    channel_audio_layer_config(i);
  }
}
class channel_audio_layer_config(i) {
  unsigned int (4) loudspeaker_layout(i);
  unsigned int (1) output_gain_is_present_flag(i);
  unsigned int (1) recon_gain_is_present_flag(i);
  unsigned int (2) reserved;
  unsigned int (8) substream_count(i);
  unsigned int (8) coupled_substream_count(i);
  if (output_gain_is_present_flag(i) == 1) {
    unsigned int (6) output_gain_flag(i);
    unsigned int (2) reserved;
    signed int (16) output_gain(i);
  }
}

When an audio element is composed of G(r) number of substreams, scalable channel audio for the audio element is layered into num_layers = r number of ChannelGroups.

Immersive Audio Sequence with scalable channel audio (before OBU packing)

The IA decoder shall select one of one or more channel audios provided by scalable channel audio. The IA decoder should select the appropriate channel audio according to the following rules, in order:

Semantics

num_layers indicates the number of ChannelGroups for scalable channel audio. It shall not be set to zero and its maximum number shall be limited to 6.

channel_audio_layer_config() is a class that provides the information regarding the configuration of ChannelGroup for scalable channel audio. channel_audio_layer_config(i) provides information regarding the configuration of ChannelGroup #i.

loudspeaker_layout indicates the channel layout for the channels to be reconstructed from the precedent ChannelGroups and the current ChannelGroup among ChannelGroups for scalable channel audio.

In the current version of the specification, loudspeaker_layout indicates one of 10 channel layouts including Mono, Stereo, 5.1ch, 5.1.2ch, 5.1.4ch, 7.1ch, 7.1.2ch, 7.1.4ch, 3.1.2ch and Binaural. Where,

Loudspeaker Layout (4 bits) :  Channel Layout  : Loudspeaker Location Ordering
             0000           :       Mono       : C
             0001           :      Stereo      : L/R
             0010           :      5.1ch       : L/C/R/Ls/Rs/LFE
             0011           :     5.1.2ch      : L/C/R/Ls/Rs/Ltf/Rtf/LFE
             0100           :     5.1.4ch      : L/C/R/Ls/Rs/Ltf/Rtf/Ltr/Rtr/LFE
             0101           :      7.1ch       : L/C/R/Lss/Rss/Lrs/Rrs/LFE
             0110           :     7.1.2ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/LFE
             0111           :     7.1.4ch      : L/C/R/Lss/Rss/Lrs/Rrs/Ltf/Rtf/Ltb/Rtb/LFE
             1000           :     3.1.2ch      : L/C/R//Ltf/Rtf/LFE
             1001           :     Binaural     : L/R
            others          :     reserved     :
Where, C: Center, L: Left, R: Right, Ls: Left Surround, Lss: Left Side Surround, 
Rs: Right Surround, Rss: Right Side Surround, Lrs: Left Rear Surroud, Rrs: Right Rear Surround
Ltf: Left Top Front, Rtf: Right Top Front, Ltr: Left Top Rear, Rtr: Right Top Rear, 
Ltb: Left Top Back, Rtb: Right Top Back, LFE: Low-Frequency Effects

For a given input audio with audio_element_type = CHANNEL_BASED, if the input audio has height channels (e.g 7.1.4ch or 5.1.2ch), each of the list of channel layouts for scalable channel audio is recommended to have height channels (i.e it is recommended to be higher than or equal to 3.1.2ch).

NOTE: Contents providers may be satisfied with the down-mixed audio having no height channels even though the down-mix mechanism, specified in § 9.2.2 Annex B-2: Down-mix Mechanism , drops height channels when it does down-mix from input channel audio with height channels to surround channels for example from 7.14ch to Mono, Stereo, 5.1ch or 7.1ch. In that case, encoder may generate a scalable audio with the down-mixed audio without having height channels from the input channel audio with height channels. In other words, this specification does not disallow for scalable audios to have a down-mixed audio without having height channels from input channel audio having height channels.

NOTE: The Ltr and Rtr of 5.1.4ch down-mixed from 7.1.4ch is within the range of Ltb and Rtb of 7.1.4ch.

output_gain_is_present_flag indicates if output_gain information fields for the ChannelGroup presents .

recon_gain_is_present_flag indicates if recon_gain information fields for the ChannelGroup presents in recon_gain_info_parameter_data().

substream_count specifies the number of audio substreams. It must be the same as num_substreams in its corresponding audio_element().

coupled_substream_count specifies the number of referenced substreams that are coded as coupled stereo channels.

Mono or stereo coding shall be only allowed for the version of this specification.

The order of substreams in each ChannelGroup shall be as follows:

output_gain_flags indicates the channels which output_gain is applied to. If a bit set to 1, output_gain shall be applied to the channel. Otherwise, output_gain shall not be applied to the channel.

Bit position : Channel Name
    b5(MSB)  : Left channel (L1, L2, L3)
      b4     : Right channel (R2, R3)
      b3     : Left Surround channel (Ls5)
      b2     : Right Surround channel (Rs5)
      b1     : Left Top Front channel (Ltf)
      b0     : Right Top Front channel (Rtf)

output_gain indicates the gain value to be applied to the mixed channels which are indicated by output_gain_flags. It is 20*log10 of the factor by which to scale the mixed channels. It is stored in a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format] ). Where, each mixed channel is generated by downmixing two or more input channels.

2.7.3. Ambisonics Config Syntax and Semantics

ambisonics_config() contains information regarding the configuration of Ambisonics. In this specification, the [AmbiX] format is adopted, which uses Ambisonics Channel Number (ACN) channel ordering and normalizes the channels with Schmidt Semi-Normalization (SN3D).

Syntax

class ambisonics_config() {
  leb128() ambisonics_mode;
  if (ambisonics_mode == MONO) {
    ambisonics_mono_config();
  } else if (ambisonics_mode == PROJECTION) {
    ambisonics_projection_config();
  }
}
class ambisonics_mono_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8 * C) channel_mapping;
}
class ambisonics_projection_config() {
  unsigned int (8) output_channel_count (C);
  unsigned int (8) substream_count (N);
  unsigned int (8) coupled_substream_count (M);
  signed int (16 * (N + M) * C) demixing_matrix;
}

Semantics

ambisonics_mode specifies the method of coding Ambisonics.

ambisonics_mode: Method of coding Ambisonics.
   0    : MONO
   1    : PROJECTION

If ambisonics_mode is equal to MONO, this indicates that the Ambisonics channels are coded as individual mono substreams. For LPCM, ambisonics_mode shall be equal to MONO.

If ambisonics_mode is equal to PROJECTION, this indicates that the Ambisonics channels are first linearly projected onto another subspace before coding as a mix of coupled stereo and mono substreams.

output_channel_count complies with channel count in [RFC8486] with following restrictions:

substream_count specifies the number of audio substreams. It must be the same as num_substreams in its corresponding audio_element().

channel_mapping complies with the one for ChannelMappingFamily = 2 in [RFC8486] .

coupled_substream_count specifies the number of referenced substreams that are coded as coupled stereo channels, where M <= N.

demixing_matrix complies with the one for ChannelMappingFamily = 3 in [RFC8486] except the byte order of each of matrix coefficients is converted to big endian.

The order of substreams in ChannelGroup shall conform to [RFC8486] .

2.8. Mix Presentation OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Mix_Presentation.

The metadata in mix_presentation() specifies how to render, process and mix one or more audio elements, with details provided in § 6.3 Mix Presentation .

An IA sequence may have one or more mix presentations specified. The IA parser shall select the appropriate mix presentation to process according to the rules specified in § 6.3.1 Selecting a Mix Presentation .

A mix presentation may contain one or more sub-mixes. Common use-cases may specify only one sub-mix, which includes all rendered and processed audio elements used in the mix presentation. The use-case for specifying more than one sub-mix arises if an IA multiplexer is merging two or more IA sequences. In this case, it may choose to capture the loudness information from the original IA sequences in multiple sub-mixes, instead of recomputing the loudness information for the final mix.

Syntax

class mix_presentation_obu() {
  leb128() mix_presentation_id;
  leb128() count_label;
  for (i = 0; i < count_label; i++) {
    string language_label;
  }
  for (i = 0; i < count_label; i++) {
    mix_presentation_annotations();
  }
  leb128() num_sub_mixes;
  for (i = 0; i < num_sub_mixes; i++) {	  
    leb128() num_audio_elements;
    for (j = 0; j < num_audio_elements; j++) {
      leb128() audio_element_id;
      for (i = 0; i < count_label; i++) {
        mix_presentation_element_annotations();
      }
      rendering_config();
      element_mix_config();
    }
    output_mix_config();
    
    leb128() num_layouts;
    for (j = 0; j < num_layouts; j++) {
      layout loudness_layout;
      loudness_info loudness; 
    }
  }
}  

Semantics

mix_presentation_id defines an identifier for a Mix Presentation. Within an IA Sequence, there shall be exactly one non-redundant Mix Presentation OBU with a given identifier. This identifier may be used by the application to select which Mix Presentation(s) to offer.

count_label indicates the number of labels in different languages.

language_label specifies the language which both mix_presentation_friendly_label and audio_element_friendly_label are written in. It shall comform to [BCP47] . The same language shall not be duplicated in this loop.

mix_presentation_annotations() is a class that provides informational metadata that an IA parser should refer to when selecting the mix presentation to use. The metadata may also be used by the playback system to display information to the user, but is not used in the rendering or mixing process to generate the final output audio signal.

num_sub_mixes specifies the number of sub-mixes.

num_audio_elements specifies the number of audio elements that are used in this mix presentation to generate the final output audio signal for playback.

audio_element_id indicates the identifier for an Audio Element which this Mix Presentation refers to.

mix_presentation_element_annotations() is a class that provides informational metadata that an IA parser should refer to when selecting the referenced audio element to use. The metadata may also be used by the playback system to display information to the user, but is not used in the rendering or mixing process to generate the final output audio signal.

rendering_config() is a class that provides the metadata required for rendering the referenced audio element.

element_mix_config() is a class that provides the metadata required for applying any processing to the referenced and rendered audio element before being summed with other processed audio elements.

output_mix_config() is a class that provides the metadata required for post-processing the mixed audio signal to generate the audio signal for playback.

num_layouts specifies the number of layouts for this sub-mix which the loudness informations were measured on.

loudness_layout identifies the layout that was used to measure the loudness information provided in this sub-mix.

loudness provides the loudness information which was measured on loudness_layout for the mixed audio element by this sub-mix.

The layout specified in loudness_layout should not be higher than the highest layout among layouts provided by the audio elements except zero-order Ambisonics or Mono. In other words, rendering from an audio element with the highest layout (except zero-order Ambisonics or Mono) to the loudness_layout should not require an upmix.

If one sub-mix of Mix Presentation OBU includes only one single scalable channel audio, then it complies with as follows:

The highest loudness_layout specified in one sub-mix except zero-order Ambisonics or Mono is the layout which was used for authoring the sub-mix.

Each sub-mix shall include loudness_layout to identify Loudspeaker configuration for Sound System A (0+2+0) (i.e. Stereo). In other words, each sub-mix shall include loudness_info() for Stereo.

2.8.1. Mix Presentation Annotations Syntax and Semantics

Syntax

class mix_presentation_annotations() {
  string mix_presentation_friendly_label;
}

Semantics

mix_presentation_friendly_label specifies a human-friendly label to describe this mix presentation.

2.8.2. Mix Presentation Element Annotations Syntax and Semantics

Syntax

class mix_presentation_element_annotations() {
  string audio_element_friendly_label;
}

Semantics

audio_element_friendly_label specifies a human-friendly label to describe the referenced audio element.

2.8.3. Rendering Config Syntax and Semantics

During playback, an audio element should be rendered using a pre-defined renderer according to § 6.3.2 Rendering an Audio Element . In this version of the specification, no additional metadata is required to configure the renderers, and as such, rendering_config() has an empty payload.

Syntax

class rendering_config() {
}

Semantics

2.8.4. Element Mix Config Syntax and Semantics

element_mix_config() provides a gain value to be applied to the rendered audio element signal.

Syntax

class element_mix_config() {
  MixGainParamDefinition mix_gain;
}
class MixGainParamDefinition() extends ParamDefinition() {
  signed int (16) default_mix_gain;
}

Semantics

mix_gain provides the parameter definition for the gain value that is applied to all channels of the rendered audio element signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

default_mix_gain specifies the default mix gain value to apply when there are no mix gain parameter blocks provided. This value is expressed in dB and shall be applied to all channels in the rendered audio element. It is stored as a 16-bit, signed, two’s complement fixed-point value with 8 fractional bits (i.e. Q7.8 in [Q-Format] ).

2.8.5. Output Mix Config Syntax and Semantics

output_mix_config() provides a gain value to be applied to the mixed audio signal.

Syntax

class output_mix_config() {
  MixGainParamDefinition output_mix_gain;
}

Semantics

output_mix_gain provides the parameter definition for the gain value that is applied to all channels of the mixed audio signal. The parameter definition is provided by MixGainParamDefinition() and the corresponding parameter data to be provided in parameter blocks is specified in mix_gain_parameter_data().

2.8.6. Layout Syntax and Semantics

The layout class specifies either a binaural system or the list of physical loudspeaker positions according to [ITU2051-3] .

Syntax

class layout() {
  unsigned int (2) layout_type;
  
  if (layout_type == LOUDSPEAKERS_SP_LABEL) {
    unsigned int (6) num_loudspeakers;
    for (i = 0; i < num_loudspeakers; i++) {
      unsigned int (8) sp_label;
    }
  } 
  else if (layout_type == LOUDSPEAKERS_SS_CONVENTION) {
    unsigned int (4) sound_system;
    unsigned int (2) reserved;
  }
  else if (layout_type == BINAURAL or NOT_DEFINED) {
    unsigned int (6) reserved;
  }
}

Semantics

layout_type specifies the layout type.

layout_type : Layout type
     0      : NOT_DEFINED
     1      : LOUDSPEAKERS_SP_LABEL
     2      : LOUDSPEAKERS_SS_CONVENTION
     3      : BINAURAL

num_loudspeakers specifies the number of loudspeakers.

sp_label defines the SP Label as specified in [ITU2051-3] .

sp_label SP label sp_label SP label sp_label SP label
0 M+000 18 U+000 36 B+000
1 M+022 19 U+022 37 B+022
2 M-022 20 U-022 38 B-022
3 M+SC 21 U+030 39 B+030
4 M-SC 22 U-030 40 B-030
5 M+030 23 U+045 41 B+045
6 M-030 24 U-045 42 B-045
7 M+045 25 U+060 43 B+060
8 M-045 26 U-060 44 B-060
9 M+060 27 U+090 45 B+090
10 M-060 28 U-090 46 B-090
11 M+090 29 U+110 47 B+110
12 M-090 30 U-110 48 B-110
13 M+110 31 U+135 49 B+135
14 M-110 32 U-135 50 B-135
15 M+135 33 U+180 51 B+180
16 M-135 34 UH+180 52 LFE1
17 M+180 35 T+000 53 LFE2
54 ~ 256 Reserved

sound_system specifies the sound system A to J as specified in [ITU2051-3] , 7.1.2ch and 3.1.2ch of loudspeaker_layout as follows:

2.8.7. Loudness Info Syntax and Semantics

loudness_info() provides loudness information for a given audio signal.

All signed values are stored as signed Q7.8 fixed-point values (in [Q-Format] ).

Syntax

class loudness_info() {
  unsigned int (8) info_type;
  signed int (16) integrated_loudness;
  signed int (16) digital_peak;
  if (info_type & 1) {
    signed int (16) true_peak;
  }
  if (info_type & 2) {
    unsigned int (8) num_anchored_loudness;
    for (i = 0; i < num_anchored_loudness; i++) {
      unsigned int (8) anchor_element;
      signed int (16) anchored_loudness;
    }
  }
}

Semantics

info_type is a bitmask that specifies the type of optional loudness information provided. The bits are set as follows, where the first bit is the LSB:

Bit : Type of information provided
 0  : True peak
 1  : Anchored Loudness (one or more)
2~7 : Reserved

integrated_loudness provides the program integrated loudness information, specified in LKFS as defined in [ITU1770-4] , and measured according to [ITU1770-4] .

digital_peak specifies the digital (sampled) peak value of the audio signal, specified in dBFS.

true_peak specifies the true peak of the audio signal, specified in dBFS and measured according to [ITU1770-4] .

anchor_element specifies the anchor element used in computation of the anchored_loudness which follows, as defined in [ISOIEC-23091-3-2018] , as follows:

  0   : Unknown
  1   : Dialogue
  2   : Album
3~255 : Reserved

There shall be no duplicate values of anchor_element within one loudness_info().

anchored_loudness specifies the loudness information according to the anchor element, specified in LKFS as defined in [ITU1770-4] .

NOTE: [ITU1770-4] adopts the convention of using the dBov unit for dBFS, where the RMS value of a full-scale square wave is 0 dBov. The same convention is adopted here.

2.9. Parameter Block OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Parameter_Block.

The metadata specified in this OBU defines the parameter values for an algorithm for an indicated duration, including any animation of the parameter values over this duration. The metadata are used in conjunction with a corresponding parameter definition and parameter data specification. The parameter definition is specified based on ParamDefinition() . The parameter data provides the values to apply in each parameter block. These are specified using the AnimatedParameterData() function template if parameter animation is supported.

Syntax

class parameter_block_obu() {
  leb128() parameter_id;
  
  (param_definition_type, param_definition_mode, duration, num_subblocks, constant_subblock_duration, subblock_duration) = get_param_definition(parameter_id);
  
  if (param_definition_mode) {
    leb128() duration;
    leb128() num_subblocks;
    leb128() constant_subblock_duration;
  }
  for (i = 0; i < num_subblocks; i++) {
    if (param_definition_mode) {
      if (constant_subblock_duration == 0) {
        leb128() subblock_duration;
      }
    }
    if (param_definition_type == PARAMETER_DEFINITION_MIX_GAIN) {
      mix_gain_parameter_data();
    }
    if (param_definition_type == PARAMETER_DEFINITION_DEMIXING) {
      demixing_info_parameter_data();
    }
    if (param_definition_type == PARAMETER_DEFINITION_RECON_GAIN) {
      recon_gain_info_parameter_data();
    }
  }
}

Semantics

parameter_id defines an identifier for a Parameter. Within an IA Sequence, there shall be exactly one non-redundant Audio Element OBU with this identifier. Parameter Block OBU refer to the Parameter through this identifier.

get_param_definition() is a run-time function to get the parameter definition type, the parameter definition mode, duration, num_subblocks, constant_subblock_duration and subblock_duration mapped to the parameter_id.

duration specifies the duration for which this parameter block is valid and applicable.

num_subblocks specifies the number of different sets of parameter values specified in this parameter block, where each set describes a different subblock of the timeline, contiguously.

constant_subblock_duration specifies the duration of each subblock, in the case where all subblocks except the last subblock have equal durations. If all subblocks except the last subblock do not have equal durations, the value of constant_subblock_duration shall be set to 0.

Audio Element OBU and/or Mix Presentation OBU is mapping a parameter_id to the parameter definition type. So, IA decoders can know the definition type mapped to the parameter_id.

subblock_duration specifies the duration for the given subblock.

Each value of duration, constant_subblock_duration and subblock_duration shall be expressed as the number of ticks at the parameter_rate specified in the corresponding parameter definition.

2.9.1. Mix Gain Parameter Data Syntax and Semantics

Syntax

class mix_gain_parameter_data() {
  leb128() animation_type;
  AnimatedParameterData<signed int (16)> param_data;
}

Semantics

animation_type specifies the type of animation applied to the parameter values.

param_data uses the AnimatedParameterData function template. Each of the values defined within this instance (start_point_value, end_point_value and control_point_value) is expressed in dB and shall be applied to all channels in the rendered audio element. They are stored as 16-bit, signed, two’s complement fixed-point values with 8 fractional bits (i.e. Q7.8 in [Q-Format] ).

animation_type : Animation Type
       0       : STEP
       1       : LINEAR
       2       : BEZIER

Classes that take animation_type as an input argument use the AnimatedParameterData() function template. The method of applying the animation is described in § 6.4 Animated Parameters .

template <class T>
class AnimatedParameterData(animation_type) {
  if (animation_type == STEP) {
    T start_point_value;
  }
  if (animation_type == LINEAR) {
    T start_point_value;
    T end_point_value;
  }
  if (animation_type == BEZIER) {
    T start_point_value;
    T end_point_value;
    T control_point_value;
    unsigned int (8) control_point_relative_time;
  }
}

start_point_value specifies the parameter value that is applied at the start of the subblock.

end_point_value specifies the parameter value that is applied at the end of the subblock.

control_point_value specifies the parameter value of the middle control point of a quadratic Bezier curve, i.e. its y-axis value.

control_point_relative_time specifies the time of the middle control point of a quadratic Bezier curve, i.e. its x-axis value. This value is expressed as a fraction of the parameter subblock duration with valid values in the range of 0 and 1, inclusively. A value equal to 0 or 1 indicates that this animation implements a linear Bezier curve, in which case control_point_value shall be ignored by the IA parser. It is stored as an 8-bit, unsigned, fixed-point value with 8 fractional bits (i.e. Q0.8 in [Q-Format] ).

2.9.2. Demixing Info Parameter Data Syntax and Semantics

demixing_info_parameter_data() specifies demixing parameter mode to be used to reconstruct output channel audio according to its loudspeaker_layout .

Syntax

class demixing_info_parameter_data() {
  unsigned int (3) dmixp_mode;
  unsigned int (5) reserved;
}

Semantics

dmixp_mode indicates a mode of pre-defined combinations of five demix parameters.

alpha and beta are gain values used for S7to5 down-mixer, gamma for T4to2 down-mixer, delta for S5to3 down-mixer and w_idx_offset is the offset to generate a gain value w used for T2toTF2 down-mixer.

IA Down-mix Mechanism

2.9.3. Recon Gain Info Parameter Data Syntax and Semantics

recon_gain_info_parameter_data() contains recon gain values for demixed channels.

NOTE: recon_gain_info_parameter_data() is required to compensate the propagated errors by De-mixer and Gain modules specified in § 6.2.2 De-mixer and § 6.2.1 Gain due to the error caused by lossy codecs such as OPUS and AAC-LC. However, it is not required for lossless codecs such as FLAC and LPCM because the propagated errors are negligible.

Syntax

class recon_gain_info_parameter_data() {
  for (i=0; i< num_layers; i++) {
    if (recon_gain_is_present_flag(i) == 1) {
      leb128() recon_gain_flags(i);
      for (j=0; j< n(i); j++) {
        if (recon_gain_flag(i)(j) == 1)
          unsigned int (8) recon_gain;
      }
    }
  }
}

Semantics

recon_gain_flags indicates the channels which recon_gain is applied to.

Table for Recon Gain Flags

The each bit of recon_gain_flags indicates the presence of recon_gain applied to the channel as depicted in the above figure.

n(i) indicates the number of bits for recon_gain_flag(i). It shall be 7 or 12 as depicted in the above figure. Where, i = 0, 1, ..., num_layers - 1.

recon_gain indicates the gain value to be applied to the channel, which is indicated by recon_gain_flags , after decoding of the associated frames and demixing operation. Where, the channel is indicated by recon_gain_flags. Detailed operation by using this value is specified in § 6.2.3 Recon Gain .

2.10. Audio Frame OBU Syntax and Semantics

This section specifies OBU payloads of OBU_IA_Audio_Frame and OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21.

audio_substream_id is an identifier for the substream associated with this audio frame. Within an IA Sequence, there shall be exactly one non-redundant Audio Element OBU with a audio_substream_id .

Syntax

class audio_frame_obu(audio_substream_id_in_bitstream) {
  if(audio_substream_id_in_bitstream) {
     leb128() explicit_audio_substream_id;
  }
  unsigned int (8*coded_frame_size) audio_frame();
}

Semantics

explicit_audio_substream_id defines the audio_substream_id of this frame. The value shall be greater than 21. When this field is not present audio_substream_id is implicit and is defined as a value from 0 to 21 for OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21 respectively.

NOTE: The first 22 audio substreams in an IA sequence can use the OBU types OBU_IA_Audio_Frame_ID0 to OBU_IA_Audio_Frame_ID21, which have predefined audio substream identifiers associated with them. This reduces bitrate by avoiding the extra explicit_audio_substream_id field in the bitstream.

coded_frame_size is the size of audio_frame() in bytes.

audio_frame() is the raw coded audio data for the frame. It shall be opus packet of [RFC6716] for OPUS, raw_data_block() of [AAC] for AAC-LC and FRAME of [FLAC] for FLAC.

For LPCM, audio_frame() shall be LPCM samples. When more than one byte is used to represent a LPCM sample, the byte order is indicated in sample_format_flags .

2.11. Temporal Delimiter OBU Syntax and Semantics

This section specifies the OBU payload of OBU_IA_Temporal_Delimiter.

Syntax

class temporal_delimiter_obu() {
}

NOTE: The Temporal Delimiter OBU has an empty payload.

2.12. Codec Specific

This section defines codec specific information for Codec_Specific_Info and Substream.

For legacy codecs, decoder_config() shall be exactly the same information as the conventional file parser feeds to the codec decoders for decoding of the Substream. For future codecs, decoder_config() shall include all of decoding parameters which are required to decode the Substream.

2.12.1. OPUS Specific

codec_id shall be Opus .

decoder_config() for OPUS conforms to ID Header with ChannelMappingFamily = 0 of [RFC7845] with following constraints:

The payload format of Substream is opus packet of [RFC6716] which contains only one single frame of mono or stereo channels and which has non-delimiting frame structure.

The sample rate used for computing offsets shall be 48 kHz.

2.12.2. AAC-LC Specific

codec_id shall be mp4a .

decoder_config() for AAC-LC is DecoderConfigDescriptor() of [MP4-Systems] , which is a subset of ESDBox for [MP4-Audio] , with following constraints:

The payload format of Substream is one single raw_data_block() of [AAC] which contains only one single frame of mono or stereo channels.

The sample rate used for computing offsets shall be the rate indicated by the samplingFrequencyIndex in GASpecificConfig()

2.12.3. FLAC Specific

codec_id shall be fLaC , the FLAC stream marker in ASCII, meaning byte 0 of the stream is 0x66, followed by 0x4C 0x61 0x43.

decoder_config() for FLAC is METADATA_BLOCK of [FLAC] .

The payload format of Substream is FRAME of [FLAC] , which is composed of FRAME_HEADER , followed by SUBFRAME (s) (one SUBFRAME per channel) and followed by FRAME_FOOTER .

The sample rate used for computing offsets shall be the sampling rate indicated in the METADATA_BLOCK .

2.12.4. LPCM Specific

codec_id shall be ipcm .

decoder_config() for LPCM is as follows:

class decoder_config(ipcm) {
  unsigned int (8) sample_format_flags;
  unsigned int (8) sample_size;
  unsigned int (32) sample_rate;
}
sample_format_flags complies with format_flags specified in [MP4-PCM] . In other words, 0x01 indicates little-endian PCM sample format and 0x00 indicates big-endian PCM sample format.

sample_size complies with PCM_sample_size specified in [MP4-PCM] . In other words, it shall take a value from the set 16, 24 and 32.

sample_rate indicates the sample rate of the input audio in Hz. It shall take a value from the set 44.1k, 16k, 32k, 48k and 96k.

The payload format of Substream is one audio_frame() of Audio_Frame_OBU which contains only one single PCM audio frame of mono or stereo channels.

The sample rate used for computing offsets shall be sample_rate .

3. Profiles

The IA Profiles define a set of capabilities that are required to parse, decode and process the corresponding IA sequence.

NOTE: In this section and subsections, the meaning of an unique OBU is that it is still unique if it only varies by reduntant flag.

Common restrictions on the IA sequence for all profiles specified in this version of specification:

3.1. IA Simple Profile

This section specifies the conformance points of the simple profile.

Restrictions on the IA sequence:

Capabilities of the IA parser, decoder and processor:

3.2. IA Base Profile

This section specifies the conformance points of the base profile.

Restrictions on IA sequence:

Capabilities of the IA parser, decoder and processor:

4. Standalone IAMF Representation

This section details the order in which the OBUs are sequenced in a standalone IAMF representation.

4.1. OBU Sequence Order

An IA sequence is composed of a series of OBUs in the sequence of a set of Descriptor OBUs followed by their associated Data OBUs.

NOTE: In typically case, the first Descriptor OBUs of the IA sequence are all non-redundant (i.e. obu_redundant_copy = 0).

The descriptor OBUs may additionally be repeated redundantly and as frequently as necessary. In this case, the "obu_redundant_copy" field in the OBU header of each of the descriptor OBUs shall be set to 1.

The below figure shows an example of IA sequence.

Example of Immersive Audio Sequence

4.1.1. Descriptor OBUs

A set of Descriptor OBUs shall be placed in the following order regardless of where they appear in the bitstream:
  1. One IA Sequence Header OBU

  2. All Codec Config OBUs

  3. All Audio Element OBUs

  4. All Mix Presentation OBUs

4.1.2. Data OBUs

Data OBUs consists of a sequence of Audio Frame OBUs, Parameter Block OBUs and Temporal Delimiter OBUs (if present), according to the rules below:

Additionally, the following constraints apply to the Audio Frame and Parameter Block OBUs:

4.1.3. Refreshing Descriptor OBUs

The above describes the full sequence of OBUs for a given set of Descriptor OBUs and their associated Data OBUs.

If the IAMF configuration changes, a new set of Descriptor OBUs is required. In that case, a new IA sequence of the complete set of Descriptor OBUs and their corresponding Data OBUs shall follow, in the same order as described above.

5. ISOBMFF IAMF Encapsulation

5.1. General Requirements & Brands

A file conformant to this specification satisfies the following:

Parsers shall support the structures required by the iso6 brand and MAY support structures required by further ISOBMFF structural brands.

5.2. ISOBMFF IAMF Encapsulation

This section describes the basic data structures used to signal encapsulation of IA sequence in [ISOBMFF] containers.

5.2.1. Requirement of IA sequence

Even though an IA sequence can theoretically group audio data coded with different codecs, potentially with different timing properties, which would require multiple tracks, this version of the specification only supports storing an IA Sequence as a single track thanks to the restrictions of the selected profiles.

5.2.2. Encapsulation Scheme

The result of encapsulating an IA Sequence into an [ISOBMFF] file is as follows:

NOTE: Multiple sample entries may be used in a track, for example when the track is the concatenation of multiple tracks or multiple IA Sequences and some IA samples have different configOBUs values.

5.2.3. IA Sample Entry

Sample Entry Type: iamf
Container:         Sample Description Box ('stsd')
Mandatory:         Yes
Quantity:          One or more.

IASampleEntry identifies that the track contains IA Samples , and contains configOBUs.

Syntax

class IASampleEntry extends AudioSampleEntry('iamf') {
  unsigned int (8) configOBUs[];
}

The channelcount and samplerate fields of AudioSampleEntry are unused.

None of AudioSampleEntry’s optional boxes shall be present.

Semantics

configOBUs shall contain the following OBUs in order.

5.2.4. IA Sample Format

For tracks using the IASampleEntry , an IA Sample has the following constraints:

NOTE: Per the restriction of the profiles carried in an IA track, all Audio Frame OBUs in an IA Sample have the same duration and have the same trimming information. If Audio Frame OBUs in the IA sample contain trimming information, the corresponding audio samples SHALL be removed from presentation using edit list information.

NOTE: In typical case, when a track contains a single IA Sequence, trimming can only happen at the beginning or end of the IA sequence and therefore at the beginning or end of the track and the edit list can describe the start and end trimming with a single edit entry. Track storing consecutive IA Sequences may need multiple edits in the edit list.

5.3. Codecs Parameter String

DASH and other applications require defined values for the Codecs parameter specified in [RFC6381] for ISO Media tracks. The codecs parameter string for the AOM IA codec shall be:
iamf.IAMF-specific-needs.Opus
iamf.IAMF-specific-needs.mp4a.40.2
iamf.IAMF-specific-needs.fLaC
iamf.IAMF-specific-needs.ipcm

IAMF-specific-needs shall be PC as follows:

For example, for this version of the specification

iamf.000.Opus
iamf.001.mp4a.40.2

5.4. ISOBMFF IAMF Decapsulation (Informative)

5.4.1. ISOBMFF IAMF Decapsulation with single track

This section provides a guideline for IAMF parser to reconstruct IA sequences from IAMF file.

When IAMF parser feeds the reconstructed IA sequences to OBU parser, descriptor OBUs shall be placed at the first and followed by Temporal Units.

During decapsulation process, IAMF file is decapsulated into IA sequences which conform to § 2 Open Bitstream Unit (OBU) Syntax and Semantics as follows:

5.4.2. Recommended handling of Trimming Information

This section recommends how to handle trimming information of ISOBMFF file.

Recommendation for ISOBMFF Trimming Information Handling

As depicted in the above figure,

Where, PTS1 is the presentation time stamp of the first audio sample before trimming and PTS2 is the presentation time stamp of the first audio sample after trimming.

6. IAMF processing

This section provides processes for IA decoding for a given IA sequence .

IA decoding can be done by using the combination of following decoding processing.

Ambisonics decoding , it shall conform to [RFC8486] except codec specific processing and shall output Ambisonics channels in ACN (Ambisonics Channel Number) order.

Scalable Channel Audio decoding , it shall output the channel audio (e.g. 3.1.2ch or 7.1.4ch) for the target channel layout.

IA decoder is composed of OBU parser, Codec decoder, Audio Element Renderer and Post-processor as depicted in below figure.

IA Decoder Configuration

6.1. Ambisonics decoding

This section describes the decoding of Ambisonics.

Below figure shows the decoding flowchart of Ambisonics decoding.

Ambisonics Decoding Flowchart

6.2. Scalable Channel Audio decoding

This section describes the decoding of Scalable Channel Audio.

Below figure shows the decoding flowchart of the decoding for Scalable Channel Audio.

Scalable Channel Audio Decoding Flowchart

For a given loudspeaker layout (i.e. CL #i) among the list of loudspeaker_layout in scalable channel layout config,

Following sections, § 6.2.1 Gain , § 6.2.2 De-mixer and § 6.2.3 Recon Gain are only needed for decoding of scalable audio with num_layers > 1.

6.2.1. Gain

Gain module is the mirror process of Attenuation module. It recovers the reduced sample values using Output_Gain when its flag for ChannelGroup #j is on. When its flag is off, then this module shall be bypassed for ChannelGroup #j. Output_Gain(j) for ChannelGroup #j shall be applied to all samples of the mixed channels in the ChannelGroup #j. Where, mixed channels means the mixed channels from an input channel audio (i.e. a channel audio for CL #n).

To apply the gain, an implementation MUST use the following:

Sample *= pow(10, Output_Gain(j) / (20.0*256))

Where, Output_Gain(j) is the raw 16-bit value for jth layer which is specified in channel_audio_layer_config() .

6.2.2. De-mixer

For scalable channel audio with num_layers > 1, some channels of down-mixed audio for CL #i are delivered as is but the rest are mixed with other channels for CL #i-1.

De-mixer module reconstructs the rest of the down-mixed audio for CL #i from the mixed channels, which is passed by Gain module, and its relevant non-mixed channels using its relevant demixing parameters.

De-mixing for down-mixed audio for CL #i shall comply with the result by the combination of following surround and top de-mixers:

Initially, wIdx(0) = 0 and the value of wIdx(k) shall be derived as follows:

Mapping of wIdx(k) to w(k) should be as follows:

wIdx(k) :   w(k)
   0    :    0
   1    :  0.0179
   2    :  0.0391
   3    :  0.0658
   4    :  0.1038
   5    :  0.25
   6    :  0.3962
   7    :  0.4342
   8    :  0.4609
   9    :  0.4821
   10    : 0.5

When D_set = { x | S1 < x ≤ Si and x is an integer},

When Ti = 2,

When Ti = 4,

For example, when CL #1 = 2ch, CL #2 = 3.1.2ch, CL #3 = 5.1.2ch and CL #4 = 7.1.4ch. To reconstruct the rest (i.e. Ls5/Rs5/Ltf/Rtf) of the down-mixed 5.1.2ch,

Ls5 = 1/δ(k) × (L2 - 0.707 × C - L5) and Rs5 = 1/δ(k) × (R2 - 0.707 × C - R5).
Ltf = Ltf3 - w(k) x (L2 - 0.707 x C - L5) and Rtf = Rtf3 - w(k) x (R2 - 0.707 x C - R5).

6.2.3. Recon Gain

recon_gain shall be only applied to all of audio samples of the de-mixed channels from De-mixer module.

Below figure shows the smoothing scheme of recon_gain .

Smoothing Scheme of Recon Gain

Recommend values for specific codecs are as follows

6.3. Mix Presentation

An IA sequence may contain more than one mix presentation. § 6.3.1 Selecting a Mix Presentation details how a mix presentation should be selected from multiple of them.

A mix presentation specifies how to render, process and mix one or more audio elements. Each audio element should first be individually rendered and processed before mixing. Then, any additional processing specified by output_mix_config() should be applied to the mixed audio signal in order to generate the final output audio for playback. § 6.3.2 Rendering an Audio Element details how each audio element should be rendered, while § 6.3.3 Mixing Audio Elements details how the audio elements should be processed and mixed.

6.3.1. Selecting a Mix Presentation

An IA sequence may contain more than one mix presentations. The IA parser should select the appropriate mix presentation in the following order.

  1. If there are any user-selectable mixes, the IA parser should select the mix, or mixes, that match the user’s preferences. An example might be a mix with a specific language. Mix presentations may use mix_presentation_friendly_label to describe such mixes.

  2. If there are more than one valid mixes remaining, the IA parser should select an appropriate mix for rendering, in the following order.

    1. If the playback layout is binaural, i.e. headphones:

      1. Select the mix with audio_element_id whose loudspeaker_layout is BINAURAL.

      2. If there is no such mix, select the mix with the highest available loudness_layout .

    2. If the playback layout is loudspeakers:

      1. If there is a mix with an loudness_layout that matches the playback loudspeaker layout, it should be selected. If there are more than one matching mixes, the first one should be selected.

      2. If there is no such mix, select the mix presentation with the highest available loudness_layout .

6.3.2. Rendering an Audio Element

This specification supports the rendering of either a multichannel or ambisonics audio element to either a target loudspeaker layout or a binaural output.

In this section, for a given x.y.z layout, the next highest layout x'.y'.z' means that x', y' and z' are greater than or equal to x, y and z, respectively.

audio_element_type Playback layout Section
CHANNEL_BASED Loudspeakers § 6.3.2.1 Rendering a channel-based audio element to loudspeakers
SCENE_BASED Loudspeakers § 6.3.2.2 Rendering a scene-based audio element to loudspeakers
CHANNEL_BASED Binaural output § 6.3.2.3 Rendering a channel-based audio element to a binaural output
SCENE_BASED Binaural output § 6.3.2.4 Rendering a scene-based audio element to a binaural output
6.3.2.1. Rendering a channel-based audio element to loudspeakers

This section defines the renderer to use, given a channel-based audio element and a loudspeaker playback layout.

If the EAR Direct Speakers renderer is used, the following should be provided for each audio channel of the audio element:

In [ITU2051-3] , an LFE audio channel may be identified either by an explicit label or its frequency content. In this specification, the LFE channel is identified based on the explicit label only, given by loudspeaker_layout .

6.3.2.2. Rendering a scene-based audio element to loudspeakers

This section defines the renderer to use, given a scene-based audio element and a loudspeaker playback layout.

If the EAR HOA renderer is used, the following metadata should be provided to the renderer for each audio channel:

  1. Ambisonics order

  2. Ambisonics degree

  3. Ambisonics normalization method

In this specification, the [AmbiX] format is adopted, which uses SN3D normalization and ACN channel ordering. Accordingly, the Ambisonics order and degree can be computed from the channel index k as follows:

order   n = floor(sqrt(k)),
degree  m = k - n * (n + 1).
6.3.2.3. Rendering a channel-based audio element to a binaural output

Given a channel-based audio element and a binaural playback layout, the Binaural EBU ADM Direct Speaker renderer [BEAR] should be used. The highest layout provided in scalable_channel_layout_config() should be used as the input to the renderer.

6.3.2.4. Rendering a scene-based audio element to a binaural output

Given a scene-based audio element and a binaural playback system, the Resonance Audio renderer [Resonance-Audio] should be used.

6.3.3. Mixing Audio Elements

Each audio element is processed individually before mixing as follows:

  1. Render to the playback layout.

  2. If all audio elements do not have a common sample rate, re-sample to 48 kHz.

  3. If all audio elements do not have a common bit-depth, convert to a common bit-depth. This specification recommends using 16 bits.

  4. If loudness_layout matches with the playback layout, apply any per-element processing according to element_mix_config() .

The rendered and processed audio elements are then summed, and then apply output_mix_config() to generate one sub-mixed audio signal. If there are more than one sub-mixes, the output of each sub-mix is further summed to generate the final mixed audio signal.

6.4. Animated Parameters

This section describes how a set of parameters is animated over a subblock in a parameter block and applied to the corresponding audio samples, using the information provided in AnimatedParameterData() .

If animation_type is equal to STEP, the parameter value provided by start_point_value should be applied to all time steps in the subblock.

If animation_type is equal to LINEAR or BEZIER, the information provided in AnimatedParameterData() describes how the parameter is animated as a Bezier curve. Let T be the subblock duration defined in the parameter_block_obu() and P0 , P1 and P2 be 2D coordinates defined as

P0 = (t0, start_point_value),
P1 = (t1, control_point_value),
P2 = (t2, end_point_value),

where t0 = 0 is the subblock start time, t2 = D is the subblock end time and t1 is the control point time given by

t1 = round(D * control_point_relative_time).

The values of t0 , t1 and t2 are expressed as ticks at the parameter_rate given in the associated parameter definition.

If animation_type is equal to LINEAR, the parameter value is linearly interpolated between start_point_value and end_point_value at a given point in time as:

B_linear(a) = (1 - a) * P0 + a * P2,
0 <= a <= 1,

where B_linear(a) = (t, y) is a 2D coordinate with the parameter value y at time t .

If animation_type is equal to BEZIER, the parameter value is interpolated following a quadratic Bezier curve between start_point_value and end_point_value at a given point in time as:

B_quad(a) = (1 - a)^2 * P0 + 2 * (1 - a) * a * P1 + a^2 * P2,
0 <= a <= 1.

where B_quad(a) = (t, y) is a 2D coordinate with parameter value y at time t .

To apply the parameter values to the audio samples in the subblock without interpolation, the parameter_rate is first resampled to the audio sample rate to give:

n0 = t0 * audio_sample_rate / parameter_rate,
n1 = t1 * audio_sample_rate / parameter_rate,
n2 = t2 * audio_sample_rate / parameter_rate.

Then, P0 , P1 , P2 can be rewritten as:

P0 = (n0, start_point_value),
P1 = (n1, control_point_value),
P2 = (n2, end_point_value).

Next, the parameter value y is computed for each time t that corresponds to an integer audio sample index, t = n = [0, 1, 2, ..., n2] . This is done by computing the equivalent value of a for every n , and then applying the Bezier equations B_linear(a) and B_quad(a) to find the parameter value y .

In the case of B_linear(a) , the mapping between n and a is given by:

a = n ÷ n2.

In the case of B_quad(a) , the mapping between n and a is given by:

a = (-beta + sqrt(beta^2 - 4 * alpha * gamma)) ÷ (2 * alpha),

where

alpha = n0 - 2 * n1 + n2,
beta = 2 * (n1 - n0),
gamma = n0 - n.

6.5. Post Processing

6.5.1. Loudness Normalization

Loudness normalization should be done by adjusting the loudness level to a target output level using information provided in § 2.8.7 Loudness Info Syntax and Semantics . A control may be provided to set unique target output levels for each anchored loudness and/or the integrated loudness. If loudness normalization increases the output level, a peak limiter to prevent saturation and/or clipping may be necessary; true_peak or digital_peak may be used to determine if peak limiting is needed. Alternately, the total amount of normalization may be limited.

The rendered layouts that were used to measure the loudness information of a sub-mix are provided by loudness_layout s.

If one of them matches the playback layout, the loudness information should be used directly for normalization. If there is a mismatch between loudness_layout and the playback layout, the implementation may choose to use the provided loudness information of the highest loudness_layout as-is.

If there is more than one selected loudness_info() specified in the mix presentation (i.e. in case of multiple sub-mixes), the implementation should normalize the loudness of each sub-mix independently before summing them.

6.5.2. Limiter

The limiter should limit the true peak of audio signal at -1 dBTP, where true peak is defined in [ITU1770-4] . The limiter should apply to multichannel signals in a linked manner and further support auto-release.

6.6. Down-mix Matrix

6.6.1. Dynamic Down-mix Matrix

This section recommends dynamic down-mixing matrices.

The dynamic down-mixing matrics complies with the down-mixing mechanism which is specified in § 9.2.2 Annex B-2: Down-mix Mechanism .

6.6.2. Static Down-mix Matrix

This section specifies static down-mix matrices to render to 3.1.2ch from each of 5.1.2ch, 5.1.4ch, 7.1.2ch and 7.1.4ch.

The figures below show static down-mix matrices to 3.1.2ch.

3.1.2ch Down-mix matrix for 5.1.2ch
3.1.2ch Down-mix matrix for 5.1.4ch
3.1.2ch Down-mix matrix for 7.1.2ch
3.1.2ch Down-mix matrix for 7.1.4ch

Where, p1 = 0.707. Implementations may use limiter defined in § 6.5.2 Limiter to preserve energy of audio signals instead of normalization factors.

7. IAMF Generation Process (Informative)

This section provides a guideline for IA encoding for a given input audio format.

Recommended input audio format for IA encoding is as follows:

For a given input audio and user inputs, IA encoder outputs IA sequence which conforms to § 2 Open Bitstream Unit (OBU) Syntax and Semantics .

Input audio is as follows:

User inputs are:

IA encoding can be done by using the combination of following generation processing.

The below figure shows IA encoder configuration for one single audio element.

The IA encoder is composed of Pre-processor, Codec encoder and OBU packetizer.

IA Encoder Configuration

7.1. Ambisonics Encoding

For Ambisonics encoding:

7.2. Scalable Channel Audio Encoding

For Scalable Channel Audio encoding:

Below figure shows IA encoding flowchart for Scalable Channel Audio.

IA Encoding Flowchart for Scalable Channel Audio

7.3. Mix Presentation Encoding

For Mix Presentation OBU for one single channel-based audio element, Mix Presentation OBU follows following restrictions:

For Mix Presentation for one single scene-based audio element, Mix Presentation OBU follows following restrictions:

For Mix Presentation for N (>1) audio elements (when num_sub-mixes = 1), Mix Presentation OBU follows following restrictions:

7.3.1. Element Mix Config

This section provide a guideline to generate element_mix_config().

An IA multiplexer may merge two or more IA sequences. In this case, it should adjust the gain values for element_mix_config() s as necessary to describe the desired relative gains between the IA sequences when they are summed to generate the final mix. It should also ensure that the gains selected do not result in clipping when the final mix is generated.

7.4. Multiple Audio Elements Encoding

This section provide a way to generate IA sequence having multiple audio elements from two IA simple or base profiles.

7.4.1. Multiple Audio Elements with One Codec Config

This section provides a way how to generate IA sequence having multiple audio elements from two IA simple profiles with the same codec config OBU. However, the result complies with the base profile of IA sequence.

Step1: Descriptor OBUs are generated as follows:

Step2: ith temporal unit is generated as follows:

Step3: Generate IA sequence which starts descriptor OBUs and followed by temporal units in order.

7.5. Post Processing

This section provides a way to generate algorithms for post processing.

7.5.1. Loudness Information

This section provides a way to generate loudness_info().

For a given Mix Presentation OBU and a given loudness_layout of a sub-mix of the given Mix Presentation OBU, the followings are processed in order to produce loudness_info().

8. Convention

8.1. Syntax Description

All of syntax elements conform to Syntactic Description Language specified in [MP4-Systems] unless it is explicitly described in the specification.

8.1.1. Data types

leb128() syntaxName

leb128() indicates the type of an unsigned integer. It indicates the following unsigned integer syntaxName is encoded by leb128() specified in [AV1-Convention] .

syntaxName is an unsigned integer which is encoded by leb128() specified in [AV1-Convention] .

sleb128() syntaxName

sleb128() indicates the type of an signed integer. It indicates the following signed integer syntaxName is encoded by leb128() specified in [AV1-Convention] .

syntaxName is an signed integer which is encoded by leb128() specified in [AV1-Convention] .

string syntaxName

string indicates a null-terminated (i.e. ending at the first byte set to 0x00), UTF-8 encoded as defined in [RFC3629] and whose length shall be limited to 128 bytes.

syntaxName is a human readable label. The label shall not include the number 0.

8.2. Arithmetic Operators

+ Addition.
- Subtraction.
* Multiplication.
÷ Floating point (arithmetic) division.
/ Integer division with truncation of the result toward zero.
floor(x) The largest integer that is smaller than or equal to x.
sqrt(x) The square root of x.

8.3. Function

8.3.1. Function templates

When the template keyword is used to decorate the class declaration, it indicates that the code is a template with a placeholder type that can be reused by other classes. Only classes that use the template present in the bitstream; the template itself does not present in the bitstream. Classes that use a function template pass a data type that is specified in either [MP4-Systems] or § 8.1.1 Data types .

Example

template <class T>
class Foo {
  T t;
}
class Bar {
  Foo<int> f;
}

8.3.2. Mathematical functions

Clip3(x, y, z)

It conforms to Clip3 specified in [AV1-Convention] .

round(x)

The round() function returns the integer value closest to x and may be implemented as

round(x) = floor(x + 0.5).

MOD(Number, Divisor)

The MOD() function returns the remainder after Number is divided by Divisor .

9. Annex

9.1. Annex A: ID Linking Scheme (Informative)

The below figure shows the linking scheme among IDs in obu_header or obu payload.

ID Linking Scheme

In the above figure,

9.2. Annex B: Rules for Scalable Channel Audio (Normative)

This Annex specifies normative rules for scalable channel audio with mum_layers num_layers > 1.

9.2.1. Annex B-1: Down-mix parameter and Loudness

This section describes how to generate down-mix parameters and loudness level for a given channel audio and a given list of channel layouts for scalability. scalability (i.e. num_layers > 1).

Below figure shows a block diagram for down-mix parameter and loudness module including down-mixer.

IA Down-mix Parameter and Loudness

For a given Channel Audio channel-based input audio (e.g. 7.1.4ch) and a given list of channel layouts based on the Channel Audio, input audio,

9.2.2. Annex B-2: Down-mix Mechanism

This section specifies the down-mixing mechanism to generate down-mixed audio for scalable channel audio.

For a given Channel Audio channel-based input audio which conforms to loudspeaker_layout , the surround and top channels (if any) are separately down-mixed and especially step by step until to get a target channels.

Implementers may MAY use another method to get the down-mixed audio from the given channel input audio, but the down-mixed audio shall SHALL comply with that by this section.

Therefore, a down-mixer based on the down-mix mechanism is a combination of following surround down-mixer(s) and top down-mixer(s) as depicted in below figure.

IA Down-mix Mechanism
For example, to get down-mixed 3.1.2ch from 7.1.4ch: - S3 of 3.1.2ch is generated by using S7to5 and S5to3 encs. - TF2 of 3.1.2ch is generated by using T4to2 and T2toTF2 encs.

For example, to get down-mixed audio 3.1.2ch from 7.1.4ch:

9.2.3. Annex B-3: Channel Layout Generation Rule

This section describes the generation rule for channel layouts for scalable channel audio.

For a given channel layout (CL #n) of an channel-based input Channel Audio, audio, any list of CLs ({CL #i: i = 1, 2, ..., n}) for a scalable channel audio shall SHALL conform with following rules:

Down-mix paths, which conform to the above rule, shall SHALL be only allowed for scalable channel audio with num_layers > 1 as depicted in below figure.

IA Down-mix Path

9.2.4. Annex B-4: Recon Gain Generation

This section describes how to generate recon_gain .

NOTE: Recon gain generation is not required when the codec is lossless such as codec_id is set to ipcm or fLaC .

Recon_Gain Recon gain needs to be applied to de-mixed channels. For this, IA encoder needs to deliver it to IA decoders.

Let’s define followings:

If 10*log10(level Ok / maxL^2) is less than the first threshold value (e.g. -80dB), Recon_Gain (k, i) = 0. Where, maxL = 32767 for 16bits.

If 10*log10(level Ok / level Mk ) is less than the second threshold value (e.g. -6dB), Recon_Gain (k, i) is set to the value which makes level Ok = Recon_Gain (k, i)^2 x level Dk. Otherwise, Recon_Gain (k, i) = 1. Actual value (i.e. recon_gain ) to be delivered is floor(255*Recon_Gain).

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4. - D_Lrs7 and D_Rrs7 are de-mixed from Ls5 and Rs5 in the (i-1)th ChannelGroup by using Lss7 and Rss7 in the ith ChannelGroup and its relevant demixing parameters (i.e., α(k) and β(k)) , respectively. - D_Ltb4 and D_Rtb4 are de-mixed from Ltf2 and Rtf2 in the (i-1)th ChannelGroup by using Ltf4 and Rtf4 in the ith ChannelGroup and its relevant demixing parameter (i.e., γ(k)), respectively. Recon_Gain for D_Lrs7: - Level Ok is the signal power for the frame #k of Lrs7 in the ith ChannelGroup. - Level Mk is the signal power for the frame #k of Ls5 in the (i-1)th ChannelGroup. - Level Dk is the signal power for the frame #k of D_Lrs7. Recon_Gain for D_Rrs7: - Level Ok is the signal power for the frame #k of Rrs7 in the ith ChannelGroup. - Level Mk is the signal power for the frame #k of Rs5 in the (i-1)th ChannelGroup. - Level Dk is the signal power for the frame #k of D_Rrs7. Recon_Gain for D_Ltb4: - Level Ok is the signal power for the frame #k of Ltf4 in the ith ChannelGroup. - Level Mk is the signal power for the frame #k of Ltf2 in the (i-1)th ChannelGroup. - Level Dk is the signal power for the frame #k of D_Ltb4. Recon_Gain for D_Rtb4: - Level Ok is the signal power for the frame #k of Rtf4 in the ith ChannelGroup. - Level Mk is the signal power for the frame #k of Rtf2 in the (i-1)th ChannelGroup. - Level Dk is the signal power for the frame #k of D_Rtb4.

For example, if we assume CL #i = 7.1.4ch and CL #i-1 = 5.1.2ch, then de-mixed channels are D_Lrs7, D_Rrs7, D_Ltb4 and D_Rtb4.

Recon_Gain for D_Lrs7:

Recon_Gain for D_Rrs7:

Recon_Gain for D_Ltb4:

Recon_Gain for D_Rtb4:

9.2.5. Annex B-5: ChannelGroup Generation Rule

This section describes the generation rule for ChannelGroup. ChannelGroup .

For a given Channel Audio channel-based input audio and the list of CLs ({CL #i: i = 1, 2, ..., n}), CG Generation module outputs the transformed audio (i.e. ChannelGroups) which shall SHALL conform to following rules:

Below figure shows one example of transformation matrix with 4 CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch).

Example of Transformation Matrix with 4 CGs

9.3. Annex C: Consumption of IAMF bitstream (informative)

TODO. Fill in example workflows.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example" , like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note" , like this:

Note, this is an informative note.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[AAC]
Information technology — Generic coding of moving pictures and associated audio information — Part 7: Advanced Audio Coding (AAC) . Standard. URL: https://www.iso.org/standard/43345.html
[AmbiX]
AMBIX - A SUGGESTED AMBISONICS FORMAT . Paper. URL: https://iem.kug.ac.at/fileadmin/media/iem/projects/2011/ambisonics11_nachbar_zotter_sontacchi_deleflie.pdf
[AV1-Convention]
Conventions . Spec. URL: https://aomedia.org/av1/specification/conventions/
[BCP47]
BCP 47 . Best Practice. URL: https://www.rfc-editor.org/info/bcp47
[BEAR]
Developing a Binaural Renderer for Audio Definition Model Content . Paper. URL: https://www.aes.org/e-lib/browse.cfm?elib=21729
[FLAC]
Free Lossless Audio Codec . Best Practice. URL: https://xiph.org/flac/format.html
[ISO14496-14]
Information technology — Coding of audio-visual objects — Part 14: MP4 file format . January 2020. Published. URL: https://www.iso.org/standard/79110.html
[ISOBMFF]
Information technology — Coding of audio-visual objects — Part 12: ISO Base Media File Format . December 2015. International Standard. URL: http://standards.iso.org/ittf/PubliclyAvailableStandards/c068960_ISO_IEC_14496-12_2015.zip
[ISOIEC-23091-3-2018]
Information Technology - Coding-Independent Code Points - Part 3: Audio . Standard. URL: https://www.iso.org/standard/73413.html
[ITU1770-4]
Algorithms to measure audio programme loudness and true-peak audio level . Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1770-4-201510-I!!PDF-E.pdf
[ITU2051-3]
Advance sound system for programme production . Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2051-3-202205-I!!PDF-E.pdf
[ITU2127-0]
Audio Definition Model renderer for advanced sound systems . Standard. URL: https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.2127-0-201906-I!!PDF-E.pdf
[MP4-Audio]
Information technology — Coding of audio-visual objects — Part 3: Audio . Standard. URL: https://www.iso.org/standard/76383.html
[MP4-PCM]
Information technology — MPEG audio technologies — Part 5: Uncompressed audio in MPEG-4 file format . Standard. URL: https://www.iso.org/standard/77752.html
[MP4-Systems]
Information technology — Coding of audio-visual objects — Part 1: Systems . Standard. URL: https://www.iso.org/standard/55688.html
[OPUS-IN-ISOBMFF]
Encapsulation of Opus in ISO Base Media File Format . Best Practice. URL: https://opus-codec.org/docs/opus_in_isobmff.html
[Q-Format]
Q (number format) . Best Practice. URL: https://en.wikipedia.org/wiki/Q_(number_format)
[Resonance-Audio]
Efficient Encoding and Decoding of Binaural Sound with Resonance Audio . Paper. URL: https://www.aes.org/e-lib/browse.cfm?elib=20446
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels . March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[RFC3629]
F. Yergeau. UTF-8, a transformation format of ISO 10646 . November 2003. Internet Standard. URL: https://www.rfc-editor.org/rfc/rfc3629
[RFC6381]
R. Gellens; D. Singer; P. Frojdh. The 'Codecs' and 'Profiles' Parameters for "Bucket" Media Types . August 2011. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6381
[RFC6716]
JM. Valin; K. Vos; T. Terriberry. Definition of the Opus Audio Codec . September 2012. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc6716
[RFC7845]
T. Terriberry; R. Lee; R. Giles. Ogg Encapsulation for the Opus Audio Codec . April 2016. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc7845
[RFC8486]
J. Skoglund; M. Graczyk. Ambisonics in an Ogg Opus Container . October 2018. Proposed Standard. URL: https://www.rfc-editor.org/rfc/rfc8486

Informative References

[AI-CAD-Mixing]
AI 3D immersive audio codec based on content-adaptive dynamic down-mixing and up-mixing framework . Paper. URL: https://www.aes.org/e-lib/browse.cfm?elib=21489