Technologies and Services of Digital Broadcasting (2)

Source Coding
-High-efficiency coding of video signals-
"Technologies and Services of Digital Broadcasting" (in Japanese, ISBN4-339-01162-2) is published by CORONA publishing co., Ltd. Copying, reprinting, translation, or retransmission of this document is prohibited without the permission of the authors and the publishers, CORONA publishing co., Ltd. and NHK.

1. Fundamentals of high-efficiency coding

Digitizing a television signal in its original form produces a huge amount of information. For example, digitizing the NTSC signal that is currently used in Japanese broadcasting results in a data rate of 100 Mbps, with a standard definition television component signal and an HDTV production standard signal resulting in about 216 Mbps and 1.2 Gbps, respectively. Transmitting this amount of information, which is at least 1000 times as much as an audio signal, requires a broadband transmission path, and recording it requires large-capacity storage equipment.
A picture signal, however, possesses multidimensional redundancy. In addition to the strong two-dimensional correlation between adjacent picture elements (pels) over space, there is also correlation along the time axis. A good degree of data compression can be expected by using such statistical properties of a picture to eliminate redundant sections.
Against the above background, high-efficiency coding has been researched for many years with the aim of achieving efficient transmission or storage of picture signals (especially television signals) by reducing their redundancy. The following overviews the properties of video signals in conjunction with their supporting role in the design of coding systems, and describes the framework for compression processing using those properties.

[1] Statistical properties of video signals
A great deal of data has been presented in relation to the statistical properties of television signals and other kinds of picture signals, and models of such signals have been constructed on the basis of that data.

(1) Amplitude distribution
Figure 1: Difference-signal distribution between adjacent pels
Figure 2: Measurement of difference-signal amplitude distribution between frames
Due to the ease of measuring the amplitude distribution of a picture signal, many examples of such measurements can be found. It is intuitively clear, however, that such measurements are extremely dependent on the picture itself and can vary greatly according to shooting conditions. For a television signal, moreover, the non-linearity of the picture tube is corrected (gamma correction) at the camera, which means that the amplitude distribution in camera output will be somewhat different from the optical intensity distribution of the original picture. Rigorous modeling is therefore difficult to achieve, but it is possible to use a uniform distribution as a first-order approximation showing the aggregate average of many pictures.
Picture coding is often carried out using difference signals in the picture instead of amplitude itself. This is because the high correlation of the picture signal means that difference in amplitude between (spatially or temporally) adjacent pels is frequently small. Figure 1 shows an example of a difference-signal distribution between adjacent pels and Fig. 2 shows an example of the amplitude distribution of the difference signal in the temporal direction, i.e., between frames. As shown by these figures, the amplitude density function of the difference signal is nearly a double-sided exponential distribution (Laplace distribution) as given by Eq. (1).
(1)
(2) Correlation function and power spectrum
When the statistical properties of picture information g(x, y) are invariable with respect to screen position, that picture information is called "spatially stationary." The autocorrelation function for stationary two-dimensional picture information is defined as follows.
(2)
This indicates the strength of the spatial correlation between two points separated by a displacement of .
Kretzmer was the first to actually measure a picture's autocorrelation function using optical means. Results of that measurement are shown in Fig. 3. Many subsequent measurements exhibited similar behavior. As shown by the figure, the correlation between adjacent pels is normally high despite the fact that correlation differs according to picture content. This characteristic can be approximated one-dimensionally as follows.
(3)
It can also be expressed two-dimensionally as
(4)
or
(5)
Equation (4) is called the "circularly symmetric model" and Eq. (5) the "separable model." We point out here that horizontal and vertical correlation is somewhat stronger than diagonal correlation in ordinary pictures as reflected by structures and natural objects like the horizon. For this reason, the separable model of Eq. (5) is considered to be closer to reality.
For the separable model, we consider a grid-shaped sample and express the correlation function between pels separated by m pels horizontally and n pels vertically as follows, where the interval between sample points is taken to be x0, y0.
(6)
A random signal process that satisfies a condition like this is called a first-order Markov process.
While the autocorrelation function expresses the properties of picture information in the spatial domain, power spectral density (referred to as simply "power spectrum" below) expresses picture properties in the spatial frequency domain. The two-dimensional power spectrum of a picture is denoted as and defined as follows.
(7)
It is known that this power spectrum and correlation function constitute a Fourier transform pair according to the Wiener-Khintchine theorem. This means that a signal with high autocorrelation like a picture signal concentrates power in the low-frequency domain. Letting , , and in Eq. (5), the two-dimensional power spectrum becomes after a Fourier transform.
(8)
The above deals with two-dimensional signals like still pictures, but there is also a three-dimensional autocorrelation function derived by Conner and Limb that deals with moving picture signals. In this regard, they created a model of a moving picture signal that considers object speed or the camera's aperture effect, accumulation effect, etc., and derived autocorrelation based on that model. They also showed that spatial correlation along the direction of motion increases as object speed increases, but that temporal correlation decreases.
Miyahara has also measured the spectrum of the inter-frame difference signal. It was found that the autocorrelation function of this signal derived from that measurement deviates slightly from the exponential function of Eq. (3).

Figure 3: Results of measuring the autocorrelation function

[2] Redundancy and compression
As described in section [1], a picture signal generally has high inter-pel correlation and includes many redundant components. Removing these redundant components should make it possible to represent a signal in an optimal fashion and compress information. Redundancy based on the statistical properties described above is called "statistical redundancy," and how to reduce it is the basis for designing a high-efficiency coding process.
Another form of redundancy targeted for reduction by high-efficiency coding is "visual redundancy." This kind of redundancy is based on human visual characteristics such as the following.
- Visibility with respect to resolution in a diagonal direction is lower than that in the horizontal or vertical direction
- Visibility with respect to high-frequency components is lower than that of low-frequency components
- Visibility with respect to moving objects is lower than that of stationary objects
Visual redundancy is used in analog bandwidth-reduction techniques as exemplified by MUSE (Multiple Sub-Nyquist Sampling Encoding). In digital data-compression techniques, statistical redundancy and visual redundancy are cleverly reduced to achieve high compressibility.

[3] Basic coding/decoding process
The basic picture coding/decoding process including picture input and output is shown in Fig. 4. This process begins with an analog/digital converter (A/D) converting the picture signal output from a photoelectric converter such as a camera to a two-dimensional or three-dimensional digital data array. Each element of this array represents a sample of the picture in question and is called a "picture element," or "pel" for short. This data array includes many redundant components, and the source coder can remove nearly all of these redundancies. The data array is passed through a channel coder that adds previously established fixed-redundant components such as error-correction codes to increase the reliability of transmission or of the storage medium. Note that high efficiency coding here generally means source coding.

Figure 4: Basic steps of the picture coding/decoding process

The source-coding process is shown in Fig. 5. The first step in the process is signal conversion. Making use of video characteristics (statistical redundancy), this step converts the signal into data having a distribution as biased as possible, decreases entropy, and extracts only information that must be coded. In principle, this process is not accompanied by distortion. Next, quantization approximates the extracted data to a finite number of discrete levels and decreases entropy even further. This process, on the other hand, is irreversible and accompanied by distortion. It is common, however, to make use of visual characteristics (visual redundancy) to manipulate the data in such a way as to decrease the amount of subjective distortion as much as possible. Finally, in code allocation, the system allocates optimal binary codes to the symbols output by quantization so that the average code length of coded output approaches the entropy of those symbols. This is a reversible process without distortion.
It should be mentioned here that the constituent elements in the above processes are not designed independently of each other, and that there are techniques for combining two constituent elements.

Figure 5: Source-coding process

2. Elemental technologies

[1] Predictive coding
As described above, a picture signal exhibits high correlation and the difference between adjacent pels is generally small. It is therefore more efficient to transmit differences with respect to what has already been transmitted than to transmit the signal directly. This kind of coding system is called "predictive coding" (of which differential pulse-code modulation (DPCM) is nearly a synonym).
To be more specific, a predictive-coding system stores previous pel values that have already been coded in a memory within the coder, and computes a prediction value x' for pel x that is to be input. The system then quantizes the difference x-x' (prediction error), allocates a code to the result, and transmits the code. On the receiver side, the system first generates a prediction-error signal from transmitted codes and then adds that signal to the immediately previous picture signal that has already been decoded to obtain a new picture signal. The function that calculates prediction values is called a "predictor." The In moving picture coding, it is effective to perform prediction based on pels of the prior field or prior frame. These processes are called "inter-field prediction" and "inter-frame prediction," respectively. In this regard, frequently only a small part of the picture displays motion, for example, a scene showing an airplane flying across the sky. In this case, it is most efficient to extract the moving areas from the magnitude of the inter-frame prediction-error signal and to transmit a prediction-error signal only for those areas (called "conditional replenishment"). With this method, however, coding efficiency rapidly deteriorates as moving areas increase. To combat this problem, the motion quantities (motion vector) of a moving object can be estimated by some method and inter-frame prediction performed after compensating for that amount of motion. This method is called "motion-compensation (MC) prediction." Motion estimation methods include the block-matching method and the gradient method, with the former finding widespread use at present.

Figure 6: Example of predictive coding (intra-field prediction)

Figure 7: Principle of motion-compensation prediction

[2] Orthogonal transform coding
In terms of the frequency domain, the fact that an ordinary picture signal exhibits high correlation means that power concentrates in low-frequency components. Accordingly, coding and transmitting the magnitudes (coefficients) of only low-frequency components where power concentrates should make it possible to achieve an overall reduction in the amount of information.
A method called "transform coding" linearly transforms a correlated signal into a signal having as little correlation as possible, and uses the fact that power concentrates in specific transform coefficients to optimize the amount of information allocated to each coefficient. This process results in an overall reduction in coded information.
In actuality, it is common to divide the image into blocks of appropriate sizes and to transmit the picture signal after performing an orthogonal transformation on each block and quantizing coefficients. Various proposals have been made for the transform bases (waveforms) used in orthogonal transform coding, with two well-known ones being the Hadamard transform that is easy to implement in hardware and the Karhunen-Loeve (KL) transform that is most efficient in principle (though not easy to implement). In recent years, however, the discrete cosine transform (DCT) has found widespread use as it can achieve characteristics near those of the KL transform for ordinary picture signals. Its adoption has also been spurred by the growing ease of implementing DCT processes in LSI devices as circuit technologies continue to progress.

(1) DCT
Figure 8: Transform bases
Figures 8(a) and 8(b) show transform bases for DCT of block-length 8 and those for the KL transform of a picture signal approximating a first-order Markov model and having an adjacent-pel correlation of 0.95. As can be seen, these two sets of transform bases are in near agreement. It can also be theoretically shown that the KL transform approaches DCT as the correlation coefficient approaches 1.0 in the first-order Markov model.
Figure 9 shows an example of DCT coding for a one-dimensional sequence of video-signal values. In this process, the video signal is transformed into an array of component magnitudes (coefficients) corresponding to each of the DCT transform bases. In general, the components for the bases expressing low frequency, which are positioned towards the top of the figure, have large magnitudes. A relatively large amount of information is therefore allocated to these components while a relatively small amount of information is allocated to high-frequency components. This makes it possible to represent a video signal with less information overall than is used to represent the original signal sequence. On the decoding side, the video signal is reconstructed by performing a linear summation of the coefficient-weighted transform bases received from the coding side.
Because a picture signal has high correlation in both the horizontal and vertical directions, it is common practice to perform a two-dimensional DCT in these directions using a 8-pel-horizontal by 8-pel-vertical block.
The DCT exhibits nearly optimal characteristics in the sense that it concentrates power in certain coefficients (coefficients corresponding to low-frequency bases), as described above. On the other hand, DCT can also be a cause of picture degradation due to features like block distortion and mosquito noise, as described below.

Figure 9: Example of DCT coding (1 dimensional process)

(2) LOT
Figure 10: LOT basis area
Coding that performs transformations in units of blocks as in DCT can be a source of visual degradation called "block distortion" in which block structures can be seen in the decoded picture when the DCT is given an insufficient amount of information. To prevent this from happening, methods like the lapped orthogonal transform (LOT) and wavelet transform are being studied. The former transform, which uses bases representing the overlapping of transform bases at adjacent blocks (Fig. 10), is described first.
In LOT, N*N transform coefficients are determined from the pel values of a 2N*2N block, and overlapping of this block with adjacent blocks therefore occurs for N/2 pels at the top, bottom, left, and right. On the decoding side, a 2N*2N block of pels is reconstructed, and the overlapping sections are added on. This process ensures continuity between adjacent blocks and reduces block distortion.
In deriving the bases used in LOT, the bases of overlapped adjacent blocks are assumed to be orthogonal DCT bases. Establishing this condition maintains the key DCT feature of concentrating power. Nevertheless, with LOT and the wavelet transform described below, block-wise discontinuities may occur in the signal targeted for transformation, due to block-by-block processing for motion compensation or other reasons. To solve this problem, overlapping motion compensation has been proposed.

(3) Wavelet transform
In DCT coding, the existence of discontinuities like sharp edges within a block can result in the appearance of ringing artifacts called "mosquito noise" or "corona effects" around those edges. This kind of visual degradation occurs because a local change affects the entire transform block due to the use of transform bases that have the same length from low to high frequencies. The wavelet transform, on the other hand, features a short transform basis length for high-frequency components compared with that for low-frequency components. This lowers the effect of local changes with respect to high frequencies and suppresses the occurrence of mosquito noise.
In frequency analyses such as a Fourier transformation, it is not known where a change in the signal has occurred. The wavelet transform, however, does allow analysis in both the frequency and temporal domains and uses transform bases that are localized in time.
In general, "wavelets" are expressed as a set of functions as in Eq. (9) that can scale by a factor of a (scale transform) or shift by a factor of b (shift transform) the "basic wavelet" function , which is to be localized in either time or frequency.
(9)
Equation (9) can be used to extract time and frequency components from a signal in accordance with parameters a and b in an operation called a "wavelet transform." As a becomes larger, that is, as frequency becomes higher, the width of the waveform becomes smaller, temporal resolution increases, and frequency resolution decreases (Fig. 11).

Figure 11: Wavelet transform

Figure 12 compares the transform bases of the wavelet transform with those of DCT. As mentioned above, the wavelet transform can suppress the spread of high-frequency mosquito noise in areas near object edges or the like where the signal changes sharply. Moreover, in a manner similar to LOT, adjacent transform bases overlap in a wavelet transform and block distortion will not, in principle, appear.

Figure 12: Comparison of transform bases between the wavelet transform and DCT

The wavelet transform can be implemented as a filter bank that recursively divides the low-frequency area, as shown in Fig. 13. In other words, the wavelet transform can be treated as a high-pass filter in which a certain frequency band is extracted and low-frequency components are thinned out in a ratio of 2:1 for another round of filtering. This thinning-out operation is equivalent to extending the previous basis function by two times, and repeating this operation makes it possible to extract low-frequency components from high-frequency components in a stepwise manner.

Figure 13: Implementation of the wavelet transform by a filter bank

[3] Vector quantization
Vector quantization is a method that represents a group of pels in vector form and performs an optimal quantization of that vector. In this method, coding efficiency near the theoretical limit can be achieved by increasing the dimension of the vector. Doing so, however, rapidly increases the amount of coding processing, so a vector dimension of 16 corresponding to a group of 4 horizontal pels * 4 vertical pels is typically chosen as a realistic compromise.
The coding procedure is as follows. The system first divides the image into blocks of 4*4 and compares each block with a previously prepared set of standard pel patterns called a codebook. Then, for each block, the system sends an index value corresponding to the pattern that best approximates that block. On the decoding side, the system reproduces the pattern corresponding to the received index value using the same codebook as the coding side.
Figure 14 compares vector quantization with scalar quantization, which quantizes each pel individually. In particular, Fig. 14(a) shows the result of performing scalar quantization on two pels individually, while Fig. 14(b) shows the manner of quantization that can be achieved by performing a vector quantization that treats x0 and x1 as a pair in the form of a two-dimensional vector. Dividing the signal space multidimensionally in this way can reduce overall quantization distortion.

Figure 14: Comparison of vector quantization and scalar quantization

On the other hand, deriving a method for designing a compact and highly general codebook and reducing the amount of quantization computation and required memory are important issues in vector quantization. As a means of dealing with the former issue, the so-called Linde-Buzo-Gray (LBG) algorithms developed by Linde et al. are known. As for the latter issue, it has already been mentioned that because of processing and memory restrictions, a 16-dimensional 4*4 vector is currently considered to be the upper limit. Even at this dimension, however, a huge number of computations are needed to ensure good-quality reproduction. Various solutions to this problem have been proposed. One is the "tree search algorithm" that uses a hierarchically arranged codebook corresponding to the division of signal space into levels to reduce the amount of computation up to the output vector of the final node. Another is "mean-separation normalized vector quantization" that performs vector quantization after first subjecting an input picture signal to mean separation and normalization and then converting it to a standard probability distribution.

[4] Subband coding
The subband coding technique first divides the input signal into a number of separate bands through the use of filters, and then codes the picture of each band individually. Using subbands in this way makes it possible to allocate an appropriate amount of information to each band and to control the frequency distribution of the distortion. In other words, subband coding allows for coding control that considers not only the power spectrum distribution of the picture signal but also the characteristics of human vision. In addition, the ability of forming a hierarchy in the frequency domain means that subband coding can be used as a resolution-conversion technique when transmitting picture signals between systems of different resolutions and as a technique for achieving graceful degradation.
An example of a subband-coding procedure is shown in Fig. 15. On the coding side, the system divides the input signal into separate bands using analysis filters and performs downsampling in accordance with the bandwidths in question. On the decoding side, the system synthesizes the picture after performing upsampling.

Figure 15: Example of a subband-coding procedure

In addition to one-dimensional division, subband division is also capable of two-dimensional division in the horizontal and vertical directions as well as three-dimensional division that includes the temporal direction. Two-dimensional division, however, is the most common.
To raise coding efficiency here, DPCM, orthogonal transform, or other coding methods may be used to code the divided signals. Further division of these signals, moreover, can eliminate spatial redundancy and allows for a configuration that negates the need for orthogonal transform or the like. In this case, either all bands are further divided or only low-frequency signals are recursively divided. A technique that adaptively changes the method of division according to picture content has also been proposed. Coding only by band division features no block distortion as seen in orthogonal transform coding. Similar to LOT, however, it is not highly compatible with adaptive processing in block units, and some sort of measure to deal with this problem is required.
Types of subband filters include the quadrature mirror filter (QMF) and the symmetric short kernel filter (SSKF).

Figure 16: Two-dimensional subband division

[5] Entropy coding
Entropy coding shortens the average code length with respect to the source output having a finite number of symbols. It does this by using occurrence probability to allocate short code words to symbols with high frequencies of occurrence. This is an information-preservation type of coding that uses only statistical properties.

(1) Huffman coding
Given the occurrence frequency of coded objects (here, output values of quantization), Huffman coding minimizes the average code length by allocating codes of optimal code length according to frequency of occurrence.
Taking orthogonal transform coding as an example, the output values of quantizing the coefficients of low-frequency components are large while a high probability exists that the output value of quantizing a coefficient of a high-frequency component is zero. With this in mind, we start from a DC level in a two-dimensional DCT and extract quantization levels by zigzag scanning in the horizontal and vertical directions toward high-frequency components, as shown in Fig. 17. As a result, relatively large values will consecutively appear at the beginning of the scan while the probability that a run of zeroes occurs in the high-frequency area is high. It is therefore efficient in many cases to perform "run-length" coding that codes the length of a run of zeroes. At this point, Huffman coding can be performed in two ways. The first way handles symbols representing a level other than zero and symbols representing run-length in the same coding space, and allocates Huffman codes according to the occurrence frequency of both types of symbol. The second way treats the combination of run-length and the following non-zero level as one symbol, and allocates Huffman codes according to the occurrence frequency of such symbols (two-dimensional Huffman).

Figure 17: Zigzag scanning

(2) Arithmetic coding
The arithmetic coding technique divides the interval [0, 1] on the number line into lengths proportional to the appearance probability of symbols and allocates the symbols in the stream targeted for coding to their corresponding partial intervals. This division process is then repeated, and the number expressing the coordinate of a point in the finally obtained interval as a binary fraction is taken to be the code of that symbol stream. The longer the stream becomes, the closer the average code length gets to entropy.
The main purpose of arithmetic coding is to code binary data, but it can also be used to turn multi-value information sources (like pictures) into binary data. Arithmetic coding expressly for multi-value use is also being studied.

(Dr. Yoshiaki Shishikui)