Trends in Standardization of Audio Coding Technologies

Tomoyasu Komori
Advanced Television Systems Research Division

An ordinance from the Ministry of Internal Affairs and Communications (MIC) was issued in 2011 on revision of the audio coding formats of 8K Super Hi-Vision (8K) broadcasting using 22.2 multichannel (22.2 ch) sound. The ordinance makes it possible to use 22.2 ch sound in broadcasting satellite (BS) digital broadcasts and other media. In particular, it specifies that digital broadcast audio formats conform to either MPEG-4 Advanced Audio Coding (AAC) or Audio Lossless Coding (ALS). The Association of Radio Industries and Businesses (ARIB) revised ARIB STD-B32 accordingly. These revisions set the maximum number of audio input channels for digital broadcasts to “22 channels and two low-frequency effect (LFE) channels”, and added MPEG-4 AAC and ALS to the available formats. This article describes the latest trends in standardization and audio coding formats for 3D sound.

1. Introduction

In Japan, audio encoding formats were revised in 2011 by the issuing of MIC ordinance No. 87, “Standard digital broadcasting formats for television broadcasting1)”, to enable 8K broadcasts with 22.2 ch sound. The ordinance increases the maximum number of input audio channels for BS and Communications Satellite (CS) digital broadcasts from 5.1 ch (5 channels and 1 LFE channel) to 22.2 ch (22 channels and two LFE channels). Audio encodings for 8K broadcasts were also regulated to conform to the MPEG-4 AAC standard2), which is the most efficient lossy compression coding, or to MPEG-4 ALS3), which is a non-lossy compression coding.

ARIB revised its standard, ARIB STD-B32, “Video codings, audio codings and multiplexing methods for digital broadcasting4)” in response to the MIC ordinance. In this revision, regulations were added with detailed specifications supporting 22.2 ch audio modes in the MPEG-4 AAC audio coding5). For MPEG-4 ALS audio coding, regulations were added on the number of channels and constraints such as the prediction order.

This article describes these trends in international and domestic standardization and introduces the latest 3D audio coding scheme, called MPEG-H 3D Audio, which was standardized in February, 2015.

2. Overview of 22.2 ch sound

22.2 ch is a 3D sound format with a total of 24 channels arranged in three layers6).

There are nine channels in the top layer, above the viewing position, ten channels in the middle layer, at the level of the viewers’ ears, three channels in the bottom layer, below the viewer’s position, and two LFE channels. The arrangement and labels of the channels in 22.2 ch sound system are shown in Figure 1.

NHK set requirements for a highly realistic sound format suitable for 8K broadcasts, conducted subjective evaluations showing that 22.2 ch sound system meets these requirements, and has been contributing to standardization of the format in Japan and internationally6).

Figure 1: 22.2 ch audio channel placement and labels

3. Overview of MPEG-4 AAC standard and ALS standard

3.1 Compression encoding technology for audio

There are two main types of encoding technology used for compression of audio signals.

  1. Coding methods that consider auditory characteristics: with these methods, the distortion produced from the encoding is either completely or almost completely undetectable acoustically, even with compression.
  2. Methods that attempt to eliminate redundancy in the audio data using techniques such as waveform prediction or statistical methods: if the original signal can be perfectly reproduced from the received data, it is called a lossless encoding.

AAC is a type (a) method, while ALS is a type (b) method.

3.2 Overview of MPEG-4 AAC

MPEG-4 AAC is standardized in International Organization for Standardization/ International Electrotechnical Commission 14496-3 Subpart 4. MPEG-4 AAC is an extension of MPEG-2 AAC (ISO/IEC 13818-7)7); it can efficiently encode audio signals such as music and can handle multichannel signals such as 22.2 ch in addition to monaural and stereo.

MPEG-4 AAC is a type of frequency-domain compression encoding, which encodes by analyzing frequency components of the audio signal and using techniques such as masking*1 to achieve high compression rates by exploiting the characteristics of human hearing. A block diagram of audio encoding using auditory characteristics is shown in Figure 2. To break down audio into frequency components, MPEG-4 AAC uses a “transform coding” method, which uses the Discrete Cosine Transform (DCT) to convert the signal directly into a frequency domain signal. When performing transform coding, the long window (block) used to transform the signal from the time into the frequency domain is 2,048 samples, but this can be changed adaptively to 256 sample blocks if a finer time resolution is needed.

MPEG-4 AAC has several audio object types*2, but broadcast services currently only use “Low Complexity” (LC), which has a good balance between the size of the decoder circuit and sound quality.

With MPEG-4 AAC, almost no distortion due to encoding can be detected, even when compressing a stereo signal by approximately 1/12 its original size into something in the range of 128 to 144 kbps.

Figure 2: Block diagram of audio coding using Psychoacoustic model

3.3 Differences between MPEG-2 AAC and MPEG-4 AAC

MPEG-2 AAC (ISO/IEC 13818-7) and MPEG-4 AAC (ISO/IEC 14496-3 Subpart 4) use almost the same tools for compressing audio signals, but MPEG-4 AAC adds an encoding tool called Perceptual Noise Substitution (PNS)*3. When encoding audio, much of the required bit rate is for transmitting the DCT coefficients gotten from transforming the audio signal into the frequency domain. PNS reduces the bit rate by treating signals within a scale-factor band*4 as noise within the band and sends only the applicable power information. That information is then used to add noise of a suitable level when reconstructing the audio signal during decoding.

3.4 Overview of MPEG-4 ALS

MPEG-4 ALS was standardized as ISO/IEC 14496-3:2007 Amd.2 MPEG-4 Audio Lossless Coding in March, 2006. It is a type of lossless encoding and can exactly reproduce the original waveform through predictive analysis, by using linear predictive techniques on past sample, even for multi-channel signals and signals with high sampling rates. The input audio signal is analyzed in order to calculate the linear prediction parameters and prediction residual. The parameters and residual are variable-length encoded to format the encoded bitstream (Figure 3). The amplitude of the prediction residual is generally small compared with the original signal, and this characteristic can be used to compress the amount of data relative to the uncompressed data by 15% to 70%.

Figure 3: Basic architecture of MPEG-4 ALS encoding and decoding

4. ARIB STD-B32 revisions

Several revisions to ARIB STD-B32 were made to support ultra-high-definition television in advanced BS digital broadcasts. In addition to supporting 22.2 ch audio input signals, a functionality was standardized for the down-mixing*5 parameters when 22.2 ch audio encoded in MPEG-4 AAC is received on devices with 5.1 ch audio or stereo, along with formats for transmitting these parameters. Dialog enhancement*6 and dialog switching functions*7 were also introduced to extend conventional broadcast services. There are also some restrictions on the parameters that can be used with MPEG-4 ALS.

Note that in the MPEG-4 audio encoding standard, there is a wide range of sampling frequencies and numbers of channels that can be used, but ordinances and bulletins from MIC, and the ARIB standards, specify that 8K broadcasts must use a sampling frequency of 48 kHz and quantization of 16 bits or greater. Table 1 gives technical formats for audio applicable to each digital broadcast standard format (from 2011 MIC ordinances No. 87 and No. 94).

Separate numbers were also assigned in the MPEG-4 audio encoding standard, for commonly used audio systems such as two-channel stereo and 5.1 ch audio. Table 2 gives the numbering for the channel configurations and number of channels usable with MPEG-4 AAC and ALS. Note that 22.2 ch audio is assigned the number 13.

Table 1: Audio formats suitable for digital broadcasting
Audio input format Audio coding format
Sampling
frequency
Max. audio
input channels
MPEG-2
AAC
MPEG-2
BC†2
MPEG-4
AAC
MPEG-4
ALS
Digital terrestrial TV broadcasting 32 kHz
44.1 kHz
48 kHz
5.1 ch Y      
V-High multimedia broadcasting 32 kHz
44.1 kHz
48 kHz
5.1 ch Y      
V-Low multimedia broadcasting 32 kHz
or greater
5.1 ch Y   Y Y
BS digital broadcasting 32 kHz
44.1 kHz
48 kHz
5.1 ch Y      
Advanced BS digital broadcasting 48 kHz 22.2 ch     Y Y
Narrow band CS digital broadcasting 32 kHz
44.1 kHz
48 kHz
5.1 ch Y Y    
Wide band CS digital broadcasting 32 kHz
44.1 kHz
48 kHz
5.1 ch Y      
Advanced narrow band CS digital broadcasting 32 kHz
44.1 kHz
48 kHz
22.2 ch†1 Y   Y Y
Advanced wide band CS digital broadcasting 48 kHz 22.2 ch     Y Y

†1 Limited to 5.1 in operational regulations.
†2 Encoding that is backward compatible with MPEG-1 Layer 2.

Table 2: Individual channel configurations and number of channels usable with MPEG-4 AAC and ALS
Channel configuration number Number of channels
1 1 ch (1/0)
2 2 ch (2/0)
3 3 ch (3/0)
4 4 ch (3/1)
5 5 ch (3/2)
6 5.1 ch (3/2.1)
7 7.1 ch (5/2.1)
11 6.1 ch (3/0/3.1)
12 7.1 ch (3/2/2.1)
13 22.2 ch (3/3/3-5/2/3-3/0/0+2)
14 7.1 ch (2/0/0-3/0/2-0/0/0+1)
0 3 ch (2/1), 4 ch (2/2) or 2 audio track (dual mono) (1/0+1/0) case

・The channels are expressed as “top layer (front/side/back) — middle layer (front/side/back) ? bottom layer (front/side/back)+ LFE”
・0 indicates no channels allocated for that direction.
・Audio modes with only the middle layer are expressed as “middle layer (front/side/back).LFE”; audio modes with only the middle layer and no side channels and for stereo are expressed as “middle layer (front/back).LFE”

4.1 Revisions for transmitting AAC down-mix coefficients

When down-mixing from multichannel stereo with more than 5.1 channels (audio modes with channel configuration numbers 7, 11, 12, 13, and 14) to two-channel stereo, the signals are first down-mixed to 5.1 ch sound, and then to two-channel stereo. A data stream element (DSE)*8, as described in ISO/IEC 14496.3:2009/AMD 4, is used when sending coefficients*9 for down-mixing from 5.1 ch to two-channel stereo.

Note that when creating these standards, NHK used materials from many programs to conduct experiments8) examining how to downmix appropriately from 22.2 ch to 5.1 ch. It derived a default down-mixing method and set of coefficients and contributed them as revisions to ARIB STD-B32.

4.2 Revisions to the AAC dialog control function

(1) Dialog enhancement function

The dialog enhancement function distinguishes between the dialog channels (containing script, narration, etc.) and the background audio channels in a program by using flags, and it enables the dialog channel signal levels to be adjusted independently of the background channels.

(2) Dialog signal switching function

The dialog switching function enables additional alternate dialog signals (such as English or French dialog) to be transmitted separately from the 22.2 ch audio signal using a user-domain stream (DSE) within the same audio stream, and to be substituted for the originally allocated signal (the initial dialog signal) on the receiver. The alternate audio can be reproduced from single or multiple channels, as selected by the broadcaster. In that case, the audio levels of each channel can also be specified by the broadcaster (e.g., FC 0 dB, BtFC -3 dB, etc.).

Receivers with the dialog switching function can receive external instructions to switch, for example, the original Japanese dialog in FC and BtFC (see Figure 1) to English or French dialog. Moreover, the dialog’s level can be controlled after the language has been switched.

NHK submitted draft revisions including these dialog control functions after conducting a study of the MPEG-4 AAC syntax (the rules for expressing data within the encoded bit stream). It also prototyped a codec conforming to the standard and demonstrated the feasibility of the functions9).

4.3 ALS parameters

The MPEG-4 ALS standard supports up to 65,536 channels, and the linear prediction supports up to 1,023 orders, but the MPEG-4 ALS standard for digital broadcasting is restricted to a maximum of 22.2 channels and 15 orders for prediction.

5. Future coding formats

Besides MPEG-4 AAC and ALS, a number of 3D audio formats with more channels than 5.1 ch have recently begun to be used in movie theaters and home reproduction systems. For example, Auro-3D places additional loudspeakers above the horizontal plane of the loudspeakers in the 5.1 ch scheme, and there are 3D audio formats, such as Dolby Atoms, that can mix independent audio channels, called objects, with other channels during playback. This section introduces MPEG-H 3D Audio as one such format that is in the process of international standardization.

5.1 Latest trends in MPEG Audio standards: MPEG-H 3D Audio

MPEG is currently working on standardization of MPEG-H 3D Audio10), as a next-generation audio coding format for video formats exceeding the quality of HDTV, including 4K and 8K Ultra High Definition.

MPEG-H 3D Audio will encode multichannel audio such as 22.2 ch more efficiently and render 3D audio in smaller spaces with a more practical number of loudspeakers (e.g.: 10.1 or 8.1 channels) by redistributing signals to the individual speaker channels.

The specification mainly targets home reproduction systems with loudspeakers positioned overhead. However, it also accommodates other viewing, such as personal televisions, smartphones, and tablets used together with headphones.

The features of MPEG-H 3D Audio include advanced encoding technology, based on MPEG Unified Speech and Audio Coding (USAC)11)*10 and MPEG Spatial Audio Object Coding (SAOC)12)*11, and the use of multiple rendering technologies. The base rendering method is called Vector Base Amplitude Panning (VBAP)13)*12, which is combined with technology to play back the rendered signals in headphones or other loudspeaker arrangements.

It also uses a format called Higher Order Ambisonics (HOA)14), which expands the sound field into a sum of spherical surface harmonic functions*13 for recording and playback.

5.2 MPEG-H 3D Audio coding technology

A block diagram of audio encoding with MPEG-H 3D Audio is shown in Figure 4. Encoding efficiency of channel-based*14 objects is improved by encoding them after pre-rendering. Conversely, for objects whose playback position may change at the receiver, a monaural signal is provided to the encoder, and rendering and mixing are done by the receiver. Multiple objects can also be handled together using technologies such as MPEG SAOC; this reduces the number of transmission channels and the amount of data and improves coding efficiency. The core encoding block is AAC, and Single Channel Element (SCE)*15, Coupling Channel Element (CPE)*16, and Quad Channel Element (QCE)*17 are used to improve efficiency. MPEG-H 3D can also encode object metadata (OAM)*18 efficiently.

Figure 4: Block diagram of MPEG-H 3D Audio encoding

6. Conclusion

This article described the revisions to MIC ordinances and ARIB standards that have been issued in an effort to standardize audio coding technology for 8K broadcasting. In addition, it introduced the formats conforming to MPEG-4 AAC and ALS standards that will enable 22.2 ch sound broadcast services on advanced BS digital broadcasts and other services. It also described revisions to ARIB standards on downmixing and dialog control functions, revisions related to new broadcast services, and standardization trends related to MPEG-H 3D Audio, which is the latest audio encoding format for 3D audio. NHK will continue to contribute to domestic and international standardization in the future.