Technologies and Services on Digital Broadcasting (3)

Source Coding of Audio
"Technologies and Services on Digital Broadcasting" (in Japanese, ISBN4-339-01162-2) is published by CORONA publishing co., Ltd. Copying, reprinting, translation, or retransmission of this document is prohibited without the permission of the authors and the publishers, CORONA publishing co., Ltd. and NHK.

This issue features the second of a new series of articles on state-of-the-art digital broadcasting technology. Researchers at NHK have just published a new book titled "Technologies and Services on Digital Broadcasting", whose chief editor is STRL's Director General Dr. Osamu Yamada. Although the book is written in Japanese, it has lots of useful information on digital broadcasting technology, from basic techniques to application systems. The editors over the next several issues will serialize the Japanese book into English articles.

1. Digitization of audio signals

The audio signal of Broadcasting Satellite (BS) analog television in Japan is broadcast digitally using a system that includes a high-quality audio transmission mode, called B-mode, having better-than-CD (Compact Disk) quality. Many countries have comparable low-bit-rate audio coding systems. Low-bit-rate audio coding is now in use in Communication Satellite (CS) digital television and BS digital television broadcasting in Japan and in the digital television and digital audio broadcasting of Europe and the United States.
The parameters that affect audio quality are bandwidth, signal-to-noise ratio, and dynamic range. Each of these relates to sampling frequency and number of quantization bits. In a high-quality audio system, an audio bandwidth of 15 to 20 kHz is desirable, and for this reason, sampling frequencies of 32, 44.1, and 48 kHz are used. The number of quantization bits is 16 or 14, which covers a dynamic range of 96 or 86 dB, respectively. In low-bit-rate audio coding, moreover, sound quality is greatly affected by the audio coding scheme and the transmission bit rate.

2. Low-bit-rate audio coding

Linear PCM digital audio coding requires a high transmission bit rate. Thus, there is a need for high-quality low-bit-rate audio coding that can make efficient use of limited resources.
Low-bit-rate audio coding methods generally fall under two categories.
Coding that takes human auditory characteristics into account: Coding noise, if generated, will either not be audible, or at most, almost imperceptible.
Elimination of redundancies in audio-signal data by using predictive coding of waveforms and statistical techniques: This is called "lossless coding" if all of the original signal can be regenerated from the received data.
Compared with method , coding method gives a substantially higher level of bit compression.
Low-bit-rate audio coding methods can also be divided into time-domain coding and frequency-domain coding types. While time-domain coding processes audio-signal waveforms in the time domain, frequency-domain coding decomposes the audio-signal waveform into frequency components that then become the object of processing. Table 1 shows current audio coding techniques and their applications.

Table 1: Examples of audio coding technologies and applications
Coding Technologies
Applications
Time-domain coding Instantaneous companding coding Transmission line for FM broadcasting
Near-instantaneous companding coding A-mode audio for BS television
Predictive coding G.722
Frequency-domain coding Sub-band coding MPEG-1 Layer I and Layer II audio coding
Transform coding AAC audio coding, AC-3 audio coding

Instantaneous companding coding, near-instantaneous companding coding, and predictive coding are time-domain codings. Near-instantaneous companding coding is used in the A-mode audio coding system of current BS analog television while near-instantaneous differential companding coding, which combines near-instantaneous companding and differential PCM coding, is the audio coding method of the HiVision MUSE system.
Two types of frequency-domain coding, sub-band coding and transform coding, are used for coding audio signals. A coding scheme in the frequency domain enables human auditory properties to be incorporated in a very efficient way. This type of audio coding, however, requires a high computation rate, but advances in signal-processing devices such as digital signal processors (DSPs) are making real-time processing possible. Current frequency-domain audio coding systems include the ATRAC system for MiniDiscs (MDs) and the MPEG-2 AAC and MPEG-2 BC audio coding systems used in BS digital broadcasting and CS digital broadcasting.

3. MPEG-1 audio coding system

MPEG-1 audio coding operating at 128 kbps per channel achieves a near-CD level of quality. By employing the properties of human perception, MEPG-1 can maintain the sound quality of the original signal despite a compression ratio of about 6:1.

(1) Human auditory property: masking
Masking is the effect by which a fainter but distinctly audible signal (B) becomes inaudible when a louder signal (A) occurs simultaneously. Figure 1 shows the masking threshold derived from the masking effect. In the figure, the horizontal axis represents log-frequency and the vertical axis masking level; all signals with a level below the threshold are not audible. Changing the frequency of the louder sound results in masking properties in which the threshold generally shifts to the left or right on the graph. Changing the signal level, on the other hand, results in masking properties in which the threshold generally shifts upward or downward.

(2) Coding scheme
The MPEG-1 audio coding scheme uses frequency-component decomposition and a psychoacoustic model incorporating the masking effect to determine adaptive bit allocation.
This system divides the input audio signal into 32 equally spaced sub-band signals and performs processing for each sub-band. In parallel with this frequency decomposition, a psychoacoustic model computes the global masking threshold at which noise from the frequency components of the audio signal is not audible. Figure 2 shows an example of calculating a global masking threshold for an input audio signal. One procedure for calculating a global masking threshold based on a short-time Fast Fourier Transform (ST-FFT) is given below.
Calculation of frequency components: Transform the audio signal into frequency spectral components by using a ST-FFT.
Calculation of individual masking thresholds: For each spectral component, calculate individual masking thresholds by translating the masking template shown in Figure 1 upward or downward or to the left or right.
Calculation of global masking threshold: Combine the masking thresholds computed for each spectral component and the absolute threshold (the minimum level at which sound is audible) to determine the global masking threshold shown in Figure 2.

 
Figure 1: Masking Figure 2: Audio signal and masking threshold

Noise and other audio signals below this global masking threshold are made inaudible by the audio signal itself. Since perception of sound below the global masking threshold is not possible, the decomposed components below this threshold need not be coded or transmitted. In short, a significantly higher compression rate can be obtained with a perceptual audio coding method.
MPEG-1 audio coding features three layers, called Layer I, Layer II, and Layer III, that make tradeoffs between complexity and coding efficiency. The following describes the Layer II coding scheme. Figure 3 shows a block diagram of an MPEG-1 Layer II audio coding system. In this system, a 24-ms regular interval is used as one frame in the case of a 48-kHz sampling rate.
Decompose audio signal: Decompose the input audio signal into 32 equally spaced sub-bands' signals by using an analysis filter bank. One frame consists of 36 samples per sub-band.
Calculation of scale factor: For each sub-band, calculate the scale factor from the absolute value of the maximum of 12 consecutive sub-band samples. One frame consists of three scale factors.
Dynamic bit allocation: Using the global masking threshold computed in Figure 2, allocate optimal quantization bits for each sub-band so as to make quantization noise inaudible.
Quantization and coding: Requantize the audio frequency components according to the allocated quantization bits.
Multiplexing: Assemble the bit stream, which consists of quantized and coded spectrum components, scale factors, and bit allocation data.
The decoder reverses the encoder procedure by using a synthesis filter bank and inverse quantizing to reconstruct the broadband audio signal.
Subjective listening tests have shown that the sound quality of an MPEG-1 audio coding system operating at 128-kbps/channel is comparable to that of the original signal, even for critical audio materials.

Figure 3: MPEG-1 Layer II audio coding system

4. MPEG-1 stereo coding

A two-channel stereo signal can playback audio with presence. That is, multiple sound sources can be localized on the horizontal plane by playback of one signal on the left speaker and another signal on the right speaker.
By using the human auditory property of 'sound-image' localization perception, a two-channel stereo signal can be coded more efficiently than a dual mono signal.

(1) Human auditory property: perception of 'sound-image' localization
Figure 4: Perception of sound arrival direction and frequency range of direction cues
Humans perceive the direction of a 'sound image' based on the difference in sound intensity and sound arrival time at the left and right ears. A 'sound image' is sensed to be in the direction of the louder sound and in the direction of the preceded sound. For example, if sound intensity and arrival time are identical at both ears, a sound will be perceived as coming from straight ahead.
Figure 4 shows the frequency range in which differences in sound intensity and in arrival time affect sound-image localization. In the high frequency range, differences in sound-pressure level and differences in sound energy envelope time between the ears affect localization of the sound image.

(2) MPEG-1 intensity stereo audio coding
Intensity stereo coding is a simplified approximation of directional transform coding based on the perception of sound-image localization. In this scheme, only one channel of sub-bands has to be transmitted in a manner that combines two channels. Direction information, moreover, is transmitted via coding of independent scale factor values for left and right channels. As a result, only the energy envelope is transmitted for both channels. In short, intensity stereo coding can be performed without affecting localization of the sound image. This coding tool is specified as an option in MPEG-1 audio coding.

5. MPEG-2 AAC audio coding

In Japan, MPEG-2 Advanced Audio Coding (AAC) is applied to BS digital broadcasting. It is also the standard for audio coding in Japanese digital broadcasting services, including digital terrestrial television broadcasting, 2.6-GHz band digital satellite audio broadcasting, digital terrestrial audio broadcasting, and wideband CS digital broadcasting.
MPEG-2 AAC supports various audio configurations, from mono to multi-channel stereo audio services. The AAC scheme has been designed to reduce the bit rate of the MPEG-1 audio coding scheme for equivalent quality audio.
The AAC system has three profiles: Main profile, Low Complexity (LC) profile, and Scalable Sampling Rate (SSR) profile. The Main profile enables encoding of the highest sound quality at the same bit rate as MPEG-1. The LC profile can be decoded with a smaller complexity than that of the Main profile although at slightly degraded audio quality. The SSR profile allows decoding according to several audio bandwidths. Of these, the LC profile is used in Japan's digital broadcasting.
Advanced Audio Coding can achieve near-CD quality at half the bit rate of MPEG-2 BC (see section 6). For example, a multi-channel stereo signal can be given a near-CD quality encoding at 320 kbps, while a 2-channel stereo-audio signal can be coded at near-CD quality at 128 kbps/2ch using the Main profile and at 144 kbps/2ch using the LC and SSR profiles.
In contrast to MPEG-1 and MPEG-2 BC coding, which use sub-band coding, AAC uses a Modified Discrete Cosine Transform (MDCT) that compresses components by using human auditory properties like masking. A block diagram of an AAC system is shown in Figure 5.
In AAC, the MDCT transforms 2048 audio samples into 1024 DCT frequency coefficients. Between adjacent blocks, a 50% window overlap is used to eliminate distortion between blocks, as shown in Figure 6-(1). Also, for sudden transient audio signals, eight short blocks, each consisting of 256 samples, are used to eliminate so-called "pre-echo" distortion (Figure 6-(2)).
These MDCT coefficients are quantized based on the number of quantization bits necessary to make quantization noise fall under the global masking threshold and are then processed by Huffman coding.
The Main profile, moreover, includes a tool for predicting DCT coefficients. The prediction signal is used as needed to decrease redundancy in audio having little spectral variation.

Figure 5: MPEG-2 AAC audio coding

Figure 6: Window function of AAC coding

6. MPEG-2 BC and AC-3 audio coding

(1) Multi-channel audio systems
Figure 7: Loudspeaker arrangement for multi-channel stereo audio
Multi-channel audio systems provide improved presence and stereophonic sound compared with conventional 2-channel stereo systems. They complement the high-quality video signals of high-definition television.
The basic loudspeaker arrangement of a multi-channel audio system is shown in Figure 9. In addition to the left (L) and right (R) channels, there are a center channel (C) and two left-and-right surround channels (Ls and Rs). The center channel gives a stable location to the sound images and the left-and-right surround channels enhance presence. These features hold true even if listeners are seated off-center. This channel configuration is called "3/2 stereo" and is recommended in ITU-R Rec. BS.775, which includes an option for a low-frequency enhanced signal using a subwoofer speaker, the 5.1 channel configuration of cinema sound. Multi-channel audio signal can be efficiently coded using schemes like MPEG-2 AAC, MPEG-2 BC, and AC-3.

(2) MPEG-2 BC audio coding
Japan has adopted MPEG-2 BC as the audio coding system for its current CS digital television broadcasting systems. MPEG-2 BC is an extension of MPEG-1 coding and features backward/forward compatibility with MPEG-1. "Backward compatible" means that the MPEG-1 2-channel decoder decodes a MPEG-2 BC multi-channel encoded stream, and "forward compatible" means that the MPEG-2 BC multi-channel decoder decodes an MPEG-1 two-channel encoded stream.
A block diagram of an MPEG-2 BC audio coding system is shown in Figure 8. To satisfy the requirement of backward compatibility, the system first down mixes the original multi-channel signals to yield a two-channel stereo signal and then encodes the matrixed signals with an MPEG-1 encoder. While the additional channels for multi-channel audio are encoded by a similar scheme, the following actions make encoding more efficient:
Perform dynamic transmission channel switching in order to provide greater orthogonality between down mixed signals and the additional signals.
Detect signals that do not contribute to the localization of sound images and transmit these components as a single channel.
Reduce inter-channel redundancy by adaptive multi-channel prediction.
Transmit the high-frequency components of the center channel in the left-and-right channels and constitute a phantom source at the location of the center loudspeaker.
The two-channel audio stream is transmitted in the MPEG-1 main data field and the additional multi-channel audio stream is transmitted in the auxiliary data field.

Figure 8: MPEG-2 BC audio coding

(3) AC-3 audio coding
The United States' digital terrestrial television (DTV) broadcasting system uses AC-3 audio coding. The AC-3 audio coding system encodes a multi-channel stereo signal at bit rates of from 320 to 384 kbps. A block diagram of an AC-3 audio encoder is shown in Figure 9.
This coding system employs transform coding using MDCT, and compresses the transformed audio components using human auditory properties like the masking phenomenon.
The block size of the transform is 512 samples and the output of the 512-point MDCT gives real-valued spectral coefficients. This transformation is performed while the window function overlaps by 50% at adjacent blocks, as shown in Figure 10. In addition, the system switches to a transform block length of 256 samples (half the normal size) for transient audio signals.
The exponent part of the MDCT coefficients represents the spectral envelopes of the audio signal. The exponential parts of adjacent blocks are coded by differential PCM coding because the spectral envelope changes relatively little between adjacent blocks. The mantissa parts of the MDCT coefficients, on the other hand, are coded based on allocated quantization bits using auditory properties such as masking. If the total number of bits for coding of individual audio channels should exceed the number of transmission bits, a coupling channel is used, which is equivalent to multi-channel intensity stereo coding. In this case, only one channel with added intensity information is transmitted for the DCT lines.
An AC-3 serial coded audio bit stream is made up of synchronization frames containing six coded Audio Blocks (AB), each consisting of 256 samples, as shown in Figure 10.

Figure 9: AC-3 audio coding

Figure 10: Window function and frame structure of AC-3

7. Standards for audio coding in broadcasting

International standards related to broadcasting are recommended by ITU-R.

(1) Audio studio standard
ITU-R Recommendation BS646-1 specifies 16 quantization bits and a sampling frequency of 48 kHz for a broadcasting studio's audio signal.

(2) Low-bit-rate audio coding standards for broadcasting
ITU-R Recommendation BS.1115 gives a low-bit-rate audio coding standard for 2-channel stereo signals. This recommendation specifies the audio coding systems and operational bit rate for four broadcasting applications: contribution, distribution, emission, and commentary. Coding standards and bit rates are shown in Table 2.
For emission, this recommendation prescribes MPEG-1 Layer II at a bit rate of 128 kbps/ch. For contribution and distribution, it specifies MPEG-1 Layer II at a bit rate of at least 180 kbps/ch, and for commentary, MPEG-1 Layer III at a bit rate of at least 60 kbps/ch.
MPEG has standardized MPEG-2 BC and MPEG-2 AAC coding for multi-channel audio, and ITU-R BS.1196 has specified MPEG-2 BC and AC-3 as audio coding systems for terrestrial digital television broadcasting. The audio coding systems adopted by various regions are shown in Table 3.

Table 2: 2-channel audio coding standards for broadcasting applications
Emission
Distribution
Contribution
Commentary
Sampling frequency (kHz)
48 (32)
32 48
48
32 (48)
No. of input/output quantization bits (bit)
16
16
18
14
Compression system
MPEG-1
Layer II
MPEG-1
Layer II
MPEG-1
Layer II
MPEG-1
Layer III
Audio-bit-rate/channel (kb/s)
128
180*
180*
60*
*: minimum

Table 3: Audio coding used in digital television broadcasting and DVDs
Region
Japan
United States
Europe
System
Digital television broadcasting
CS : MPEG-2 BC
MPEG-2 AAC
United States
AC-3 (DTV system)
MPEG-2 BC
(DVB system)
BS: MPEG-2 AAC
DVD
AC-3
AC-3
MPEG-2 BC

References
* ISO/IEC 11172-3: Information Technology - Coding of Moving Pictures and Associated Audio for digital storage media up to about 1.5Mbit/s, Audio (1993)
* ISO/IEC 13818-7: Information Technology - Generic Coding of Moving Pictures and Associated Audio, Advanced Audio Coding (1997)
* D.G.Kirby and K.Watanabe : Formal subjective testing of MPEG-2 NBC multichannel audio coding algorithm, AES 102nd Convention, preprint 4418 (1997.3)
* M.Bosi et al : ISO/IEC MPEG-2 Advanced Audio Coding, JAES, 45, 10 (1997.10)
* ITU-R Rec.BS.775-1 : Multichannel Stereophonic Sound System with and without Accompaning Picture (1994)
* ISO/IEC 13818-3 : Information Technology - Generic Coding of Moving Pictures and Associated Audio, Audio (1994)
* M.F.Davis : The AC-3 Multichannel Coder, 95th AES Convention preprint 3774 (1995)

(Mr. Kaoru Watanabe)





Copyright 2002 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.

NHK STRL NHK Broadcast Technology