| Technologies and Services on Digital Broadcasting (3) |
Source Coding of Audio |
| "Technologies and Services on Digital Broadcasting" (in Japanese, ISBN4-339-01162-2) is published by CORONA publishing co., Ltd. Copying, reprinting, translation, or retransmission of this document is prohibited without the permission of the authors and the publishers, CORONA publishing co., Ltd. and NHK.
|
|
| This issue features the second of a new series of articles on state-of-the-art digital broadcasting technology. Researchers at NHK have just published a new book titled "Technologies and Services on Digital Broadcasting", whose chief editor is STRL's Director General Dr. Osamu Yamada. Although the book is written in Japanese, it has lots of useful information on digital broadcasting technology, from basic techniques to application systems. The editors over the next several issues will serialize the Japanese book into English articles. |
|
| 1. Digitization of audio signals |
|
The audio signal of Broadcasting Satellite (BS) analog television in Japan is broadcast digitally using a system that includes a high-quality audio transmission mode, called B-mode, having better-than-CD (Compact Disk) quality. Many countries have comparable low-bit-rate audio coding systems. Low-bit-rate audio coding is now in use in Communication Satellite (CS) digital television and BS digital television broadcasting in Japan and in the digital television and digital audio broadcasting of Europe and the United States.
The parameters that affect audio quality are bandwidth, signal-to-noise ratio, and dynamic range. Each of these relates to sampling frequency and number of quantization bits. In a high-quality audio system, an audio bandwidth of 15 to 20 kHz is desirable, and for this reason, sampling frequencies of 32, 44.1, and 48 kHz are used. The number of quantization bits is 16 or 14, which covers a dynamic range of 96 or 86 dB, respectively. In low-bit-rate audio coding, moreover, sound quality is greatly affected by the audio coding scheme and the transmission bit rate.
| 2.
Low-bit-rate audio
coding |
|
Linear PCM digital audio
coding requires a high
transmission bit rate.
Thus, there is a need
for high-quality low-bit-rate
audio coding that can
make efficient use of
limited resources.
Low-bit-rate audio coding
methods generally fall
under two categories.
 |
Coding
that takes human auditory
characteristics into
account: Coding noise,
if generated, will
either not be audible,
or at most, almost
imperceptible. |
 |
Elimination
of redundancies in
audio-signal data
by using predictive
coding of waveforms
and statistical techniques:
This is called "lossless
coding" if all of
the original signal
can be regenerated
from the received
data. |
Compared with method ,
coding method gives
a substantially higher level
of bit compression. Low-bit-rate
audio coding methods can
also be divided into time-domain
coding and frequency-domain
coding types. While time-domain
coding processes audio-signal
waveforms in the time domain,
frequency-domain coding
decomposes the audio-signal
waveform into frequency
components that then become
the object of processing.
Table 1 shows current audio
coding techniques and their
applications.
| Table
1: Examples of audio
coding technologies
and applications |
 |
Coding Technologies
|
Applications
|
|
Time-domain
coding
|
Instantaneous
companding coding |
Transmission
line for FM broadcasting |
| Near-instantaneous
companding coding |
A-mode
audio for BS television |
| Predictive
coding |
G.722 |
| Frequency-domain
coding |
Sub-band
coding |
MPEG-1
Layer I and Layer
II audio coding |
| Transform
coding |
AAC
audio coding, AC-3
audio coding |
Instantaneous companding
coding, near-instantaneous
companding coding, and predictive
coding are time-domain codings.
Near-instantaneous companding
coding is used in the A-mode
audio coding system of current
BS analog television while
near-instantaneous differential
companding coding, which
combines near-instantaneous
companding and differential
PCM coding, is the audio
coding method of the HiVision
MUSE system.
Two types of frequency-domain coding, sub-band coding and transform coding, are used for coding audio signals. A coding scheme in the frequency domain enables human auditory properties to be incorporated in a very efficient way. This type of audio coding, however, requires a high computation rate, but advances in signal-processing devices such as digital signal processors (DSPs) are making real-time processing possible. Current frequency-domain audio coding systems include the ATRAC system for MiniDiscs (MDs) and the MPEG-2 AAC and MPEG-2 BC audio coding systems used in BS digital broadcasting and CS digital broadcasting.
| 3. MPEG-1 audio coding system
|
|
MPEG-1 audio coding operating at 128 kbps per channel achieves a near-CD level of quality. By employing the properties of human perception, MEPG-1 can maintain the sound quality of the original signal despite a compression ratio of about 6:1.
(1) Human auditory
property: masking
Masking is the effect by which a fainter but distinctly audible signal (B) becomes inaudible when a louder signal (A) occurs simultaneously. Figure 1 shows the masking threshold derived from the masking effect. In the figure, the horizontal axis represents log-frequency and the vertical axis masking level; all signals with a level below the threshold are not audible. Changing the frequency of the louder sound results in masking properties in which the threshold generally shifts to the left or right on the graph. Changing the signal level, on the other hand, results in masking properties in which the threshold generally shifts upward or downward.
(2) Coding scheme
The MPEG-1 audio coding
scheme uses frequency-component
decomposition and a psychoacoustic
model incorporating the
masking effect to determine
adaptive bit allocation.
This system divides the
input audio signal into
32 equally spaced sub-band
signals and performs processing
for each sub-band. In
parallel with this frequency
decomposition, a psychoacoustic
model computes the global
masking threshold at which
noise from the frequency
components of the audio
signal is not audible.
Figure 2 shows an example
of calculating a global
masking threshold for
an input audio signal.
One procedure for calculating
a global masking threshold
based on a short-time
Fast Fourier Transform
(ST-FFT) is given below.
 |
Calculation
of frequency components:
Transform the audio
signal into frequency
spectral components
by using a ST-FFT. |
 |
Calculation
of individual masking
thresholds: For each
spectral component,
calculate individual
masking thresholds
by translating the
masking template shown
in Figure 1 upward
or downward or to
the left or right. |
 |
Calculation
of global masking
threshold: Combine
the masking thresholds
computed for each
spectral component
and the absolute threshold
(the minimum level
at which sound is
audible) to determine
the global masking
threshold shown in
Figure 2. |
 |
|
 |
| Figure
1: Masking |
Figure
2: Audio signal and
masking threshold |
Noise and other audio
signals below this global
masking threshold are
made inaudible by the
audio signal itself. Since
perception of sound below
the global masking threshold
is not possible, the decomposed
components below this
threshold need not be
coded or transmitted.
In short, a significantly
higher compression rate
can be obtained with a
perceptual audio coding
method.
MPEG-1 audio coding features
three layers, called Layer
I, Layer II, and Layer
III, that make tradeoffs
between complexity and
coding efficiency. The
following describes the
Layer II coding scheme.
Figure 3 shows a block
diagram of an MPEG-1 Layer
II audio coding system.
In this system, a 24-ms
regular interval is used
as one frame in the case
of a 48-kHz sampling rate.
 |
Decompose
audio signal: Decompose
the input audio signal
into 32 equally spaced
sub-bands' signals
by using an analysis
filter bank. One frame
consists of 36 samples
per sub-band. |
 |
Calculation
of scale factor: For
each sub-band, calculate
the scale factor from
the absolute value
of the maximum of
12 consecutive sub-band
samples. One frame
consists of three
scale factors. |
 |
Dynamic
bit allocation: Using
the global masking
threshold computed
in Figure 2, allocate
optimal quantization
bits for each sub-band
so as to make quantization
noise inaudible. |
 |
Quantization
and coding: Requantize
the audio frequency
components according
to the allocated quantization
bits. |
 |
Multiplexing:
Assemble the bit stream,
which consists of
quantized and coded
spectrum components,
scale factors, and
bit allocation data. |
The decoder reverses the
encoder procedure by using
a synthesis filter bank
and inverse quantizing to
reconstruct the broadband
audio signal. Subjective
listening tests have shown
that the sound quality of
an MPEG-1 audio coding system
operating at 128-kbps/channel
is comparable to that of
the original signal, even
for critical audio materials.
 |
| Figure
3: MPEG-1 Layer II
audio coding system |
A two-channel stereo
signal can playback audio
with presence. That is,
multiple sound sources
can be localized on the
horizontal plane by playback
of one signal on the left
speaker and another signal
on the right speaker.
By using the human auditory
property of 'sound-image'
localization perception,
a two-channel stereo signal
can be coded more efficiently
than a dual mono signal.
(1) Human auditory
property: perception of
'sound-image' localization
 |
| Figure
4: Perception of sound
arrival direction
and frequency range
of direction cues
|
Humans perceive the direction
of a 'sound image' based
on the difference in sound
intensity and sound arrival
time at the left and right
ears. A 'sound image'
is sensed to be in the
direction of the louder
sound and in the direction
of the preceded sound.
For example, if sound
intensity and arrival
time are identical at
both ears, a sound will
be perceived as coming
from straight ahead.
Figure 4 shows the frequency
range in which differences
in sound intensity and
in arrival time affect
sound-image localization.
In the high frequency
range, differences in
sound-pressure level and
differences in sound energy
envelope time between
the ears affect localization
of the sound image.
(2) MPEG-1 intensity stereo audio coding
Intensity stereo coding
is a simplified approximation
of directional transform
coding based on the perception
of sound-image localization.
In this scheme, only one
channel of sub-bands has
to be transmitted in a
manner that combines two
channels. Direction information,
moreover, is transmitted
via coding of independent
scale factor values for
left and right channels.
As a result, only the
energy envelope is transmitted
for both channels. In
short, intensity stereo
coding can be performed
without affecting localization
of the sound image. This
coding tool is specified
as an option in MPEG-1
audio coding.
| 5. MPEG-2 AAC audio coding
|
|
In Japan, MPEG-2 Advanced
Audio Coding (AAC) is
applied to BS digital
broadcasting. It is also
the standard for audio
coding in Japanese digital
broadcasting services,
including digital terrestrial
television broadcasting,
2.6-GHz band digital satellite
audio broadcasting, digital
terrestrial audio broadcasting,
and wideband CS digital
broadcasting.
MPEG-2 AAC supports various
audio configurations,
from mono to multi-channel
stereo audio services.
The AAC scheme has been
designed to reduce the
bit rate of the MPEG-1
audio coding scheme for
equivalent quality audio.
The AAC system has three
profiles: Main profile,
Low Complexity (LC) profile,
and Scalable Sampling
Rate (SSR) profile. The
Main profile enables encoding
of the highest sound quality
at the same bit rate as
MPEG-1. The LC profile
can be decoded with a
smaller complexity than
that of the Main profile
although at slightly degraded
audio quality. The SSR
profile allows decoding
according to several audio
bandwidths. Of these,
the LC profile is used
in Japan's digital broadcasting.
Advanced Audio Coding
can achieve near-CD quality
at half the bit rate of
MPEG-2 BC (see section
6). For example, a multi-channel
stereo signal can be given
a near-CD quality encoding
at 320 kbps, while a 2-channel
stereo-audio signal can
be coded at near-CD quality
at 128 kbps/2ch using
the Main profile and at
144 kbps/2ch using the
LC and SSR profiles.
In contrast to MPEG-1
and MPEG-2 BC coding,
which use sub-band coding,
AAC uses a Modified Discrete
Cosine Transform (MDCT)
that compresses components
by using human auditory
properties like masking.
A block diagram of an
AAC system is shown in
Figure 5.
In AAC, the MDCT transforms
2048 audio samples into
1024 DCT frequency coefficients.
Between adjacent blocks,
a 50% window overlap is
used to eliminate distortion
between blocks, as shown
in Figure 6-(1). Also,
for sudden transient audio
signals, eight short blocks,
each consisting of 256
samples, are used to eliminate
so-called "pre-echo" distortion
(Figure 6-(2)).
These MDCT coefficients
are quantized based on
the number of quantization
bits necessary to make
quantization noise fall
under the global masking
threshold and are then
processed by Huffman coding.
The Main profile, moreover,
includes a tool for predicting
DCT coefficients. The
prediction signal is used
as needed to decrease
redundancy in audio having
little spectral variation.
 |
| Figure
5: MPEG-2 AAC audio coding |
 |
| Figure 6: Window function of AAC coding |
| 6. MPEG-2 BC and AC-3 audio coding
|
|
(1) Multi-channel
audio systems
 |
| Figure
7: Loudspeaker arrangement
for multi-channel
stereo audio |
Multi-channel audio systems
provide improved presence
and stereophonic sound
compared with conventional
2-channel stereo systems.
They complement the high-quality
video signals of high-definition
television.
The basic loudspeaker
arrangement of a multi-channel
audio system is shown
in Figure 9. In addition
to the left (L) and right
(R) channels, there are
a center channel (C) and
two left-and-right surround
channels (Ls and Rs).
The center channel gives
a stable location to the
sound images and the left-and-right
surround channels enhance
presence. These features
hold true even if listeners
are seated off-center.
This channel configuration
is called "3/2 stereo"
and is recommended in
ITU-R Rec. BS.775, which
includes an option for
a low-frequency enhanced
signal using a subwoofer
speaker, the 5.1 channel
configuration of cinema
sound. Multi-channel audio
signal can be efficiently
coded using schemes like
MPEG-2 AAC, MPEG-2 BC,
and AC-3.
(2) MPEG-2 BC audio coding
Japan has adopted MPEG-2
BC as the audio coding
system for its current
CS digital television
broadcasting systems.
MPEG-2 BC is an extension
of MPEG-1 coding and features
backward/forward compatibility
with MPEG-1. "Backward
compatible" means that
the MPEG-1 2-channel decoder
decodes a MPEG-2 BC multi-channel
encoded stream, and "forward
compatible" means that
the MPEG-2 BC multi-channel
decoder decodes an MPEG-1
two-channel encoded stream.
A block diagram of an
MPEG-2 BC audio coding
system is shown in Figure
8. To satisfy the requirement
of backward compatibility,
the system first down
mixes the original multi-channel
signals to yield a two-channel
stereo signal and then
encodes the matrixed signals
with an MPEG-1 encoder.
While the additional channels
for multi-channel audio
are encoded by a similar
scheme, the following
actions make encoding
more efficient:
 |
Perform
dynamic transmission
channel switching
in order to provide
greater orthogonality
between down mixed
signals and the additional
signals. |
 |
Detect
signals that do not
contribute to the
localization of sound
images and transmit
these components as
a single channel. |
 |
Reduce
inter-channel redundancy
by adaptive multi-channel
prediction. |
 |
Transmit
the high-frequency
components of the
center channel in
the left-and-right
channels and constitute
a phantom source at
the location of the
center loudspeaker. |
The two-channel audio
stream is transmitted
in the MPEG-1 main data
field and the additional
multi-channel audio stream
is transmitted in the
auxiliary data field.
 |
| Figure 8: MPEG-2 BC audio coding |
(3) AC-3 audio coding
The United States' digital
terrestrial television
(DTV) broadcasting system
uses AC-3 audio coding.
The AC-3 audio coding
system encodes a multi-channel
stereo signal at bit rates
of from 320 to 384 kbps.
A block diagram of an
AC-3 audio encoder is
shown in Figure 9.
This coding system employs
transform coding using
MDCT, and compresses the
transformed audio components
using human auditory properties
like the masking phenomenon.
The block size of the
transform is 512 samples
and the output of the
512-point MDCT gives real-valued
spectral coefficients.
This transformation is
performed while the window
function overlaps by 50%
at adjacent blocks, as
shown in Figure 10. In
addition, the system switches
to a transform block length
of 256 samples (half the
normal size) for transient
audio signals.
The exponent part of the
MDCT coefficients represents
the spectral envelopes
of the audio signal. The
exponential parts of adjacent
blocks are coded by differential
PCM coding because the
spectral envelope changes
relatively little between
adjacent blocks. The mantissa
parts of the MDCT coefficients,
on the other hand, are
coded based on allocated
quantization bits using
auditory properties such
as masking. If the total
number of bits for coding
of individual audio channels
should exceed the number
of transmission bits,
a coupling channel is
used, which is equivalent
to multi-channel intensity
stereo coding. In this
case, only one channel
with added intensity information
is transmitted for the
DCT lines.
An AC-3 serial coded audio
bit stream is made up
of synchronization frames
containing six coded Audio
Blocks (AB), each consisting
of 256 samples, as shown
in Figure 10.
 |
| Figure 9: AC-3 audio coding |
 |
| Figure 10: Window function and frame structure of AC-3 |
| 7. Standards for audio coding in broadcasting
|
|
International standards
related to broadcasting
are recommended by ITU-R.
(1) Audio studio standard
ITU-R Recommendation BS646-1 specifies 16 quantization bits and a sampling frequency of 48 kHz for a broadcasting studio's audio signal.
(2) Low-bit-rate audio
coding standards for broadcasting
ITU-R Recommendation BS.1115
gives a low-bit-rate audio
coding standard for 2-channel
stereo signals. This recommendation
specifies the audio coding
systems and operational
bit rate for four broadcasting
applications: contribution,
distribution, emission,
and commentary. Coding
standards and bit rates
are shown in Table 2.
For emission, this recommendation
prescribes MPEG-1 Layer
II at a bit rate of 128
kbps/ch. For contribution
and distribution, it specifies
MPEG-1 Layer II at a bit
rate of at least 180 kbps/ch,
and for commentary, MPEG-1
Layer III at a bit rate
of at least 60 kbps/ch.
MPEG has standardized
MPEG-2 BC and MPEG-2 AAC
coding for multi-channel
audio, and ITU-R BS.1196
has specified MPEG-2 BC
and AC-3 as audio coding
systems for terrestrial
digital television broadcasting.
The audio coding systems
adopted by various regions
are shown in Table 3.
|
Table
2: 2-channel audio
coding standards for
broadcasting applications |
 |
Emission
|
Distribution
|
Contribution
|
Commentary
|
| Sampling
frequency (kHz)
|
48 (32)
|
32 48
|
48
|
32 (48)
|
| No.
of input/output quantization
bits (bit)
|
16
|
16
|
18
|
14
|
| Compression
system
|
MPEG-1 Layer II
|
MPEG-1 Layer II
|
MPEG-1 Layer II
|
MPEG-1 Layer III
|
| Audio-bit-rate/channel
(kb/s)
|
128
|
180*
|
180*
|
60*
|
|
*:
minimum
|
|
Table 3: Audio coding used in digital television broadcasting and DVDs |
Region
|
Japan
|
United States
|
Europe
|
| System |
|
Digital
television broadcasting
|
CS : MPEG-2 BC
MPEG-2 AAC
|
United States
AC-3 (DTV system)
|
MPEG-2 BC
(DVB system)
|
|
BS: MPEG-2 AAC
|
| DVD |
AC-3
|
AC-3
|
MPEG-2 BC
|
References
| * |
ISO/IEC 11172-3: Information Technology - Coding of Moving Pictures and Associated Audio for digital storage media up to about 1.5Mbit/s, Audio (1993) |
| * |
ISO/IEC 13818-7: Information Technology - Generic Coding of Moving Pictures and Associated Audio, Advanced Audio Coding (1997) |
| * |
D.G.Kirby and K.Watanabe : Formal subjective testing of MPEG-2 NBC multichannel audio coding algorithm, AES 102nd Convention, preprint 4418 (1997.3) |
| * |
M.Bosi et al : ISO/IEC MPEG-2 Advanced Audio Coding, JAES, 45, 10 (1997.10) |
| * |
ITU-R Rec.BS.775-1 : Multichannel Stereophonic Sound System with and without Accompaning Picture (1994) |
| * |
ISO/IEC 13818-3 : Information Technology - Generic Coding of Moving Pictures and Associated Audio, Audio (1994) |
| * |
M.F.Davis : The AC-3 Multichannel Coder, 95th AES Convention preprint 3774 (1995) |
(Mr. Kaoru Watanabe) |