22.2 Multichannel Sound Reproduction System for Home Use

Kentaro Matsui

NHK has developed a 22.2 multichannel sound (22.2 sound) system for 8K Super Hi-Vision (8K), an ultra high-definition TV. The system consists of 24 spatially arranged audio channels including two low frequency effect (LFE) channels for reproducing three-dimensional spatial sound. To respond to various viewing circumstances of 8K in homes, we have also developed several reproduction methods to reproduce 22.2 sound with fewer loudspeakers. In this paper, we propose binaural reproduction of 22.2 sound with 12 loudspeakers integrated into a flat panel display, which makes it possible for us to experience 22.2 sound without installing 24 discrete loudspeakers.

1. Introduction

We are conducting research on 22.2 sound, the sound system for 8K. 22.2 sound consists of 22 audio channels surrounding the listeners and two LFE channels and is capable of reproducing an immersive and realistic three-dimensional sound field that conveys a sense of reality to the listeners¹⁾. It is currently undergoing international standardization in anticipation of its use in broadcasting.

The way of viewing television depends a lot on increasingly diverse consumer lifestyles²⁾, and in many cases, it would not be easy to set up 24 individual loudspeakers in typical living rooms. As such, it is important to find ways to provide different options for reproducing 22.2 sound with fewer loudspeakers.

One technology that could resolve this issue is called binaural reproduction, which makes use of the characteristics of the human auditory system to localize a sound image^*1 so that it is perceived to come from an arbitrary location. This technology can be used to synthesize sound images for channels in directions where there is no loudspeaker, meaning that 22.2 sound could be reproduced using fewer loudspeakers.

This article gives an overview of binaural reproduction and reports on methods for estimating the head-related transfer function (HRTF), which includes features related to how humans perceive the localization of sound images and is fundamental to binaural reproduction. We also report on binaural reproduction of 22.2 sound using a loudspeaker frame that has been developed by Science & Technical Research Laboratories (STRL).

2. Binaural Reproduction

2.1 Binaural reproduction over headphones

An HRTF is defined as an acoustic transfer function that describes how a given sound wave from a specific position reaches the entrance to the listener's external ear canal in a free sound field^*2. HRTFs for the left and right ears contain many directional properties related to how humans perceive the localization of sound images, such as interaural time and level differences and spectral cues^*3 in the frequency characteristics. The sound image can be localized in an arbitrary direction by measuring the left and right HRTFs and adding their properties to a sound source. This method is mathematically formulated in the time domain as a convolution of an audio signal and head-related impulse responses (HRIRs), the time-domain equivalent of HRTFs. An audio signal generated this way is called a binaural signal, and reproducing such a binaural signal using headphones is called binaural reproduction.

2.2 Binaural reproduction over loudspeakers

The left and right driver units of headphones can provide audio signals directly to each ear. Binaural reproduction over headphones can be realized by reproducing the audio signals convolved with the left and right HRIRs through these driver units (Figure 1). In contrast, in the case of binaural reproduction over loudspeakers, each loudspeaker propagates audio signals to both ears, and this situation is called acoustic crosstalk (Figure 2). To suppress this unwanted acoustic phenomenon, audio signal processing is necessary to ensure that only the intended audio signal propagates to each ear. This compensatory processing is called crosstalk cancellation. Figure 3 shows a schematic diagram illustrating binaural reproduction over two stereophonic loudspeakers. In the figure, G represents the acoustic transfer functions from each loudspeaker to the left and right ears, X is the HRTFs for the directions to be presented to the left and right ears, and H represents the crosstalk cancellation controller. The relationship between input audio signal u and output audio signals y can be expressed by the following equations.

The controller, H, synthesizes the intended audio signal, wherein the HRTF for the intended direction is applied to the input audio signal, at the locations of the listener’s ears. As a result, the relation between the input and output audio signals is given by:

Thus, the controller, H, is designed to be the inverse of G, so

The positions of the listener's ears are the targets for the controller and are called the control points. There are several approaches to designing the controller. At STRL, we have developed convolution-based processing in the time domain³⁾ as well as processing in the frequency domain using singular value decomposition⁴⁾. Currently, we are using frequency-domain processing, which requires less computing time and load. In this case, the controller takes the form of a set of finite impulse response (FIR) filters by computing the inverses of the matrices for each discrete frequency bin^*4 and transforming them into the time domain through an inverse Fourier transform.

The above description is for binaural reproduction using two loudspeakers, but the method can be easily extended to reproduction using three or more loudspeakers. The number of control points increases proportionally with the number of loudspeakers, so using multiple loudspeakers for binaural reproduction is useful for expanding the listening area. In such cases, Equation (2) becomes:

Here, p is the number of control points, and q is the number of loudspeakers.

Figure 1: Binaural reproduction over headphones

Figure 3: Binaural reproduction over two stereophonic loudspeakers

3. Multi-directional simultaneous estimation of HRTFs

As will be discussed in Section 4, we are studying binaural reproduction using multiple loudspeakers placed inside a frame around the display. Measuring HRTFs takes longer as the number of loudspeakers increases, so to shorten the measurement time, we developed a multi-directional simultaneous estimation method that is based on system identification theory^*5.

Here, we assume that the HRIR for each direction can be approximated using an n^th-order FIR model. Also, the set of HRIRs for m directions is viewed as a MISO (Multiple Input Single Output) system wherein the measurement signals from the m directions are regarded as inputs and the signal recorded at the entrance to the external ear canal is regarded as an output. Generally, an m-input, one-output, n^th-order FIR model is given by:

Here, y(k) is the output at discrete time k, and w(k) is Gaussian noise. x_i(k) is the input vector from the i^th direction, composed of the inputs u_i(k), and θ_i is the parameter vector of the FIR model for the i^th direction. These quantities can be expressed as:

Aligning Equation (6) for each time k = 1, 2, …, N gives:

Here, X_i is the matrix with rows that are the input vectors, x_i(k), given by the following equation.

This can be simplified to

so Equation (8) can be written as

The parameter, θ, that satisfies this input/output relation is estimated using the least-squares method. A detailed derivation is given in reference⁵⁾, but the parameter , that minimizes the evaluation function based on one-step ahead prediction^*6, or

is a least-squares estimate, given by:

In Equation (12), is the 2-norm (Euclidean norm). The procedure for estimating HRTFs for multiple directions is as follows. First, measure the responses at the entrance to the left and right external ear canals by radiating the measurement signals from m directions. Then, use Equation (13) to estimate the parameters of the FIR models for each direction.

For Equation (13) to have a solution, R must be a positive-definite matrix^*7. Measurement signals that satisfy this condition can be designed using the following procedure⁶⁾.

i) Create the signal m(k) from a pseudo-random binary sequence (PRBS) of period T, and use it as the input for the first direction.

Here, period T must satisfy.

ii) For the input for the second direction, circularly shift the input for the first direction, u₁(k), l samples.

iii) Similarly thereafter, the input for the i^th direction, u_i(k), is acquired by shifting the input for the i-1^th direction, u_i-1(k), l samples.

4. Binaural reproduction of 22.2 sound with a loudspeaker frame

4.1 Flat panel display integrated loudspeaker frame

We expect that the number of loudspeakers and the places to put them will be limited by their room size and furniture in ordinary living rooms for 8K viewing²⁾. Accordingly, we have been conducting research on a loudspeaker frame that is integrated into a flat panel display (FPD). Figure 4 shows a 12-loudspeaker frame we developed for an 85-inch liquid crystal display (LCD). The frame has five dynamic loudspeaker units equally spaced along the top and bottom and one each on the left and right edges. This arrangement corresponds to a mapping of the front channels in the 22.2 sound system. Each of the loudspeaker units in the frame is housed in independent cavities to prevent mutual interference and cross modulation. Each loudspeaker unit has high input tolerance, low distortion characteristics, and a maximum sound pressure level of 92 dB, even though its diaphragm is small (7 cm diameter). Also, as shown in Figure 5, the loudspeaker unit has a compact corrugated edge design, which is diverted from the monitor speaker system of NHK, and suppresses antiresonances at large amplitudes, achieving approximately a 20 dB reduction in distortion in the midrange frequencies compared with conventional units of the same diameter.

Two subwoofer units are on the left and right sides. These units improve three-dimensional spatial impressions related to ambience and envelopment by reproducing the LFE channels and also improve low frequency characteristics by reproducing the bass content of 22.2 sound.

Figure 5: Corrugated edge structure reducing harmonic distortion

4.2 Binaural reproduction with a flat panel display integrated loudspeaker frame

The front channels in the 22.2 sound system, except for the FLc, FRc and FC channels^*8, are assigned to one of the 12 loudspeaker units on the frame. The perception of the three excepted channels are synthesized as phantom sound images using the pair-wise amplitude panning method^*9. Our recent research revealed that a vertical pair of loudspeakers brings better directional stability of the frontal phantom sound images than a horizontal pair of loudspeakers, so we use the loudspeaker units above and below to synthesize these three channels. The side and back channels are synthesized at their specific channel positions through binaural reproduction using the 12 loudspeaker units. The listening position is at a distance of 1.5 times the display height, 1.5 m from the display. In this case, Equation (4) is underdetermined^*10, so the solution is not unique. Thus, when designing the controller, we adopt the least-norm solution^*13 that minimize the condition number^*11 of the computed inverse matrix and make the controller more robust^*12.

The HRTFs used in designing the controller, namely the HRTFs relating each of the loudspeaker units in the frame to the listening position, are computed using the simultaneous estimation method discussed in Section 3. As shown in Figure 6, measurements were made in an acoustic anechoic chamber in STRL. Measurement signals are radiated from each of the 12 loudspeaker units in the frame and recorded using a dummy head placed at the listening position. Here, a PRBS of length 2¹⁷-1 samples, at a sampling frequency of 48 kHz, was used. The loudspeaker units were arranged around the frame, so the distance to the listening position was different for each unit. To incorporate delays resulting from these distance differences and reflections from the loudspeaker frame, the circular shift, l, in Equation (18), was set to 1,200 samples, longer than a general HRIR. The measured sound pressure level at the dummy head was 70 dB.

The HRTFs thats were used to synthesize the intended audio signals, that is, the HRTFs from the side and at the back of the listening position, were measured using the same dummy head. Simultaneous estimation was difficult, so they were measured one-at-a-time using a logarithmic time-stretched pulse (LogTSP)^*14 with a length of 2¹⁷ samples at a sampling frequency of 48 kHz.

As examples, the HRIRs from the left and right loudspeaker units in the middle of the vertical side to the left and right ears are shown in Figure 7, and their frequency amplitude responses (HRTF) are shown in Figure 8. We normalized the HRIRs so that the overall peak would be -2 dB (Full Scale) and windowed them with a 512-sample rectangular window. In Figure 7 and Figure 8, the terms "left" and "right" indicate that the measurement signals were radiated from the left and right loudspeaker units, respectively. Figure 9 shows the HRIRs from the SiL and SiR channels^*15, immediately lateral to the listening position, to the left and right ears, and Figure 10 shows the corresponding frequency amplitude responses.

Figure 7: Example of HRIRs from the loudspeaker unit to the left and right ears

Figure 8: Example of HRTFs from loudspeaker unit to left and right ears

Figure 9: Example of HRIRs from the side to left and right ears

Figure 10: Example of HRTFs from the side to left and right ears

4.3 Controller performance evaluation

We conducted an experiment to evaluate the performance of the controller quantitatively. The 12-loudspeaker frame and dummy head were set at the same positions as the HRTF measurement in the previous section, and the controller was cascaded with the 12-loudspeaker frame. The intended audio signals were fed to the left or right input terminal of the controller, and the responses at the positions of the left and right ears of the dummy head were recorded. A unit impulse signal was used for the intended audio signal. However, since it is difficult to feed a unit impulse directly, a LogTSP was fed and the response was convolved with the inverse of the LogTSP to obtain the impulse response.

The measured impulse responses are shown in Figure 11, and the corresponding frequency amplitude responses in Figure 12. The terms "left" and "right" in these figures indicate a unit impulse signal was fed from the left or right input terminal; therefore, unit impulse signals with some delay should be observed at ipsilateral ears and silent signals at the contralateral ears. In Figure 12, we see that the responses observed at ipsilateral ears approximate the envelope of the intended allpass properties to some extent, and that crosstalk occurring at contralateral ears are suppressed by -15 dB or less at almost all frequencies. In both cases, the performance of crosstalk cancellation deteriorates at low and high frequencies. This is mainly because these frequencies are out of the frequency range of the loudspeaker units

We evaluated the stability of the controller by referring the condition number of the inverse matrices computed during the design of the controller as an index. The condition number is an objective measure of sensitivity to noise or numerical errors. A system of equations having a large condition number is considered to be illconditioned and is susceptible to small amounts of noise or errors introduced in the computation process. Here, we varied the number and layout of the loudspeaker units used for binaural reproduction, as shown in Figure 13, and computed the condition number for each discrete frequency bin. Figure 14 shows the condition number as a function of frequency and the number of loudspeaker units. The condition number decreases roughly in inverse proportion to number of loudspeaker units, and peaks, seen in the reproduction using two units, are gradually suppressed as units are added. These results imply that by increasing the number of loudspeakers used for binaural reproduction, the stability of the controller and synthesized phantom sound images can be improved.

Figure 11: Impulse responses measured at control point

Figure 12: Frequency amplitude responses measured at control point

Figure 13: Loudspeaker units arrangement used to evaluate stability of the controller

Figure 14: Condition number vs. frequency and number of loudspeaker units

5. Conclusion

We described a method for reproducing 22.2 sound in the home on a loudspeaker frame integrated into a flat panel display and gave an overview of this system. We showed this method's effectiveness by conducting experimental measurements on a prototype 12-loudspeaker frame. We also described a method for multi-directional simultaneous estimation of HRTFs that are the basis for binaural reproduction.

Currently, we can handle one listening position, having given priority to developing a stable system, but we will try to extend the method to multiple listening positions or a broad listening area. We will also continue studying issues related to implementation, including cutting down on the amount of computations.

Part of this research was conducted in cooperation with the Adachi Lab in the Science and Technology Department of Keio University. We would like to express our thanks to Prof. Shuichi Adachi and his students for their assistance in advancing this research.

This article has been amended and corrected based on the following papers appearing in Journal of the Acoustical Society of Japan and ITE Journal.

K. Ishikawa, Y. Tokuzumi, I. Maruta, S. Adachi, K. Matsui, and A. Ando: "Multidirectional simultaneous estimation of head-related transfer function in a 3-dimensional space by system identification," Journal of the Acoustical Society of Japan, Vol. 69, No. 7, pp. 321-330 (2013)

K. Matsui, S. Oishi, T. Sugimoto, S. Oode, Y. Nakayama, H. Okubo, H. Sato, K. Mizuno, Y. Morita and S. Adachi: "Binaural Reproduction of 22.2 Multichannel Sound with Flat Panel Display-Integrated Loudspeaker Frame for Home Use," ITE Journal, Vol. 68, No. 10, pp. J447-J456 (2014)