Objective perceptual audio quality measurement methods^*1

Kaoru Watanabe

In digital broadcasting, video and audio signals are broadcast at low bit rates^*2 by employing efficient compression schemes. Audio signals are compressed using so-called perceptual audio coding whereby the noise components caused by compression artifacts are controlled in frequency bands where they have little effect on the perceived quality. Subjective evaluation is the ideal method for assessing the audio quality of these encoded signals, but this is a time-consuming process which is unsuitable for real-time audio quality assessments. We therefore require an objective measurement method of audio quality that can quantitatively estimate the perceptual audio quality based on the physical features of the audio signal. The PEAQ (Perceived Evaluation of Audio Quality) measurement method recommended by ITU-R is capable of making accurate measurements of perceptual audio quality through the psychoacoustic models that simulate phenomena such as auditory masking. In this paper we present types of objective audio quality measurement methods, and we describe the PEAQ measurement algorithm and several approaches to improve in objective audio quality measurement methods.

1. Introduction

Since the introduction of digital BS TV broadcasting in December 2000, the number of TV sets capable of receiving these broadcasting has reached approximately 51 million (as of late March 2009). In December 2003, digital terrestrial TV broadcasting was started in Tokyo, Nagoya and Osaka region. Today these broadcasting are transmitted from the regional centers of every administrative district in Japan, and have gradually extended to approximately 48 million viewers (as of March 2009). Furthermore, the 1 seg service was set up in April 2006 to allow TV programs to be viewed on mobile terminals. This service rapidly became a popular way and has reached an audience of approximately 53 million viewers (February 2009). The radio bands used for these digital broadcasting are a limited resource, so low bit rates coding which efficiently compressed the video and audio signals are employed for digital TV broadcasting. With the spread of the internet and mobile phones, there is also rapid growth in music distribution models where compressed audio signals are downloaded and transferred to mobile audio players.

These audio signals are compressed and coded using so-called perceptual audio coding whereby the noise components caused by compression are controlled in frequency bands where they have little effect on the perceived audio quality due to threshold of audibility and masking phenomena. But within perceptual audio coding, differences between the qualities of decoded audio signals can still occur due to the different compression schemes employed in audio encoders. Even when compression is performed with the same type of coding schemes at the same bit rate, differences between audio qualities can sometimes arise due to differences of the bit allocation methods employed in the encoders. In order to provide high-quality audio services in the limited transmission bandwidths, it is important to implement quality design at the service planning and quality management when the service is operational. So far, the best way to obtain most reliable results is to perform carefully designed subjective evaluation tests. However, subjective evaluation tests take time consuming and expensive. A strict subjective evaluation is the ideal method when selecting substantial systems such as international audio coding standards, but is not suitable for real-time audio quality monitoring. Therefore we need an objective measurement method of audio quality that uses physical features to quantitatively estimate the perceptual audio quality¹⁾²⁾. As an objective quality measure for audio signals, metrics such as the SN (signal to noise) ratio and total harmonic distortion have been used traditionally. However, for perceptual audio coding, since the noise is controlled in frequency bands that have small perceptual effects, it is difficult to measure the perceptual quality of coded audio signals in terms of metrics such as the SN ratio. Studies has therefore been carried out into objective audio quality measurement methods that can be applied to low bit-rates audio coding such as perceptual coding, and the PEAQ measurement method has been recommended by the ITU-R (International Telecommunication Union – Radiocommunication Sector).

In this paper we first discuss the classification of objective measurement methods, and we briefly introduce an objective measurement method for telephony speech. Next, after explaining the auditory system and sound perception which are based on an objective measurement method for the perceptual audio signal quality, we introduce the PEAQ measurement method, and some examples of its applications. Finally, we discuss the new prospects of objective measurement methods such as their extension to multi-channel stereo audio signals.

2. Classification of objective sound quality measurement methods

Sound can be broadly classified into two categories, one is high-fidelity audio and the other is speech in a telephone environment. For this reason, there are different sound quality evaluation criteria for these two categories. For example, high-fidelity audio must be applicable with all kinds of sound, whereas telephony is generally just concerned with speech content. High-fidelity audio is principally evaluated based on the perceived quality in listening tests, whereas for telephony speech it is also important to consider the quality of talking quality^*3 and the interaction quality of the conversation^*4. Even in compression coding schemes which are the typical applications of objective measurement methods, there are differences in the coding mechanisms for high-fidelity audio and telephony speech signals, such as the fact that the latter may be exploited models of human voice production. On the other hand, in terms of the applications of objective measurement methods, in the field of high-fidelity audio the main needs are to ascertain the coding performance before starting up a service and to provide accurate information for bit-rate settings, audio equipment and the like, whereas in the field of telephony speech, especially for IP telephony services^*5 there are a wide range of applications such as ascertaining the performance of IP telephony networks, devices and equipment, designing IP telephony circuits, and real-time monitoring and management for the optimal operation of telephony networks.

Table 1 shows a technical classification of objective sound quality measurement methods which classified based on their input signals and application fields³⁾. These include models where the audio signals themselves are used as the input signals (media layer models) and models where the coded bitstreams are used as input (bitstream layer models). Media layer models that use audio signals as input can be sub-categorized into (a) full-reference models that use the original audio signal and the decoded audio signal, (b) non-reference models that use only the decoded signal, and (c) reduced-reference models that use decoded signal and features of the original audio signal.

For IP telephony network applications, models include packet layer models which use the header information of packets such as RTP (Real-time Transport Protocol)^*6 and RTCP (RTP Control Protocol)^*7, and parametric models which use quality design and management parameters of networks and terminals, such as the coding bit rate and packet loss. There are also hybrid models that aim to expand the applicable range or improve the measurement accuracy by combining these two types of model.

In the high-fidelity audio field, studies have been progress in the full-reference objective measurement method, and little work has been done on other measurement models. On the other hand, various objective measurement methods for telephony speech have been studied for application to IP telephony networks, including full-reference and non-reference methods, packet layer models and parametric models. Therefore we will first introduce some objective measurement methods for telephony speech quality, taking IP telephony as an example⁴⁾.

Table 1: Technical classification of objective sound quality measurement methods
		Input signal	Main purposes
Media layer models	Full-reference	Original sound, processed sound (signal from device under test)	Ascertaining performance of equipment, etc. Optimizing system parameters
	Non-reference	Processed sound (signal from device under test)	In-service quality management
	Reduced-reference	Processed audio (signal from device under test), features of original sound	In-service quality management
Packet layer models		Packet header information (RTP etc.)	In-service quality management
Parametric models		Quality design & management parameters	Network quality design In-service quality management
Bitstream layer models		Coded bitstream (before decoding)	In-service quality management
Hybrid models		Combination of the above	In-service quality management

3. Objective measurement methods for telephone-quality audio

As a full-reference media layer model, the ITU-T (International Telecommunication Union – Telecommunication Standardization Sector) has recommended the PESQ (Perceptual Evaluation of Speech Quality) measurement method⁵⁾. This standard is based on the PSQM (Perceptual Speech Quality Measure) measurement method which estimates perceptual speech quality from a temporally continuous distortion component⁶⁾, with improvements that allow it to also work with intermittent distortion effects such as packet loss in IP telephony. The basic concept of the PSQM method is that — like the PEAQ method for audio signals described below — it estimates the perceptual quality of the decoded signals using an auditory perception model. Specifically, it measures the frequency spectrum with each critical band of the original speech signal and the speech signal processed by the IP telephony system, and after normalizing the loudness, it estimates the perceptual speech quality of the IP telephony signal from the difference (distortion) between these two spectra. However, the auditory model used in the PSQM method is optimized for speech, and does not match auditory models obtained from auditory perception experiments. This full-reference method uses the original speech signal for the input signal, so it is able to estimate the perceptual speech quality depending on the speech source and is suitable for ascertaining the quality of equipment and services, but for telephony systems it is difficult to acquire the original speech signal, making this approach unsuitable for in-service quality management.

As a non-reference media layer model, the ITU-T has recommended the SEAM (Single-Ended Assessment Model) measurement method⁷⁾. In non-reference models, it is not possible to use the original speech signal in order to identify where distortion occurs in the processed speech signal. Thus in the SEAM method, knowledge about the human voice production mechanism is used to extract distortion that is more likely to originate from coding artifacts or network issues than from the original human voice, and the perceptual speech quality is estimated based on this extracted distortion. Specifically, it works by extracting vocal tract information together with a feature of unnaturalness. It also extracts features that influence to speech quality impairment, such as strong additive noise, interference, muting and temporal clipping. Intermediate stage impairment quantities are extracted from these features, and the perceptual quality of the processed speech is estimated in combination with the additional signal characteristics.

Since the media layer model uses speech signals as its input, it is computationally expensive for the real-time monitoring of IP telephony networks. Therefore a packet layer model has been proposed for real-time speech quality management. ITU-T recommendation P.564⁸⁾ calculates distortion parameters such as network packet loss and packet delays from RTP and RTCP packets, which convey the header information of audio packets, and estimates the speech quality based on these parameters. However, the current recommendation P.564 does not specify any particular measurement methods, and instead specifies the minimum criteria for objective speech quality assessment models that predict the impact of observed IP network impairments. Since this model estimates the quality only from the limited information in the packet headers, it is difficult to measure the speech quality depending on the speech source.

The parametric model is used to optimally design IP telephony network. In this model, the quality is broken down into a series of factors that affect the speech quality of IP telephony networks (signal attenuation and lag, circuit echo, etc.), and the overall quality is estimated from the evaluation scores for each factor which are pre-stored in a database. One parametric model is the E-model⁹⁾ recommended by the ITU-T. The E-model includes input parameters originating from terminal factors, environment factors and network factors, and the output evaluation value R is obtained as a function of these parameters. In the network design, the design parameters of the terminals and telephone network are varied for the network to optimize based on the R value. Accordingly, all the parameters to be evaluated in the telephone circuit network must be known in advance. This method is sometimes used for in-service quality management.

In packet layer models, since only the packet header information is used to estimate the quality, it is impossible to estimate the perceptual quality depending on the speech source. On the other hand, in media layer models, the packets have to be fully decoded before estimating the speech quality, so the computational load is large. A bit-stream layer model estimates the quality by parsing the coded bit-stream in order to take into consideration the effects of the speech source. In hybrid models which combine these techniques, improvements can be expected in the applicable range of objective speech quality measurement methods and the accuracy of quality estimation.

4. Objective perceptual audio quality measurement methods, and their applications

In the perceptual audio coding techniques used for digital broadcasting and portable audio players, audio signal is compressed to low bit rates by employing auditory models such as auditory masking. To accurately estimate the perceptual audio quality from the audio signal, how to apply the auditory model is of great importance. In the field of auditory psychology, many researchers have been studying and modeling various auditory qualities such as loudness perception, frequency perception and masking perception. Thus, we will first present an overview of auditory mechanisms, auditory masking, critical band and auditory filters. We will also introduce binaural perception in relation to the directional localization of the sound. After that, we will discuss the PEAQ measurement method, and some examples of its applications.

4.1 Auditory system and sound perception

4.1.1 Auditory mechanisms

Human auditory systems can be anatomically divided into the peripheral, the intermediate, and the central stage of the cerebral cortex¹⁰⁾¹¹⁾. As shown in Fig. 1, the peripheral consist of the outer ear which direct sound through the auditory canal to the eardrum, the middle ear which act as a mechanical transformer, converting the sound pressure into the vibrations, and the inner ear which transduces these mechanical vibrations to nerve impulses. The outer ear comprises the external pinna and auditory canal. The pinna does help in collecting sound and also contributes to determine the direction of sounds. Due to resonance in the auditory canal, the sound pressure level in front of the eardrum increase about 10 dB higher over the frequency range of 3-3.5 kHz. In the middle ear, the eardrum and auditory ossicles (hammer, anvil and stirrup) transmits the sound vibrations of the auditory canal efficiently into the lymph fluid of the inner ear. The middle ear also influences spectral components of the sound vibrations and the transmission characteristic of sounds is best at frequencies over the central region of 500-4,000 Hz. The threshold of audibility corresponds well with this frequency response, so it is thought that the threshold of audibility depends on factors such as the transmission characteristics of the middle ear and the resonance of the auditory canal.

The inner ear functions as a mechanism for transforming vibrations to properly coded neural impulses. Vibrations at the base of the stirrup bone cause vibrations in the lymph fluid, which are transmitted to the cochlea where they produce traveling waves along the basilar membrane. The basilar membrane becomes wider towards the end of the cochlea, so that the maximum vibration occurs in different parts of the membrane at different frequencies. Approximately 15,000 hair cells are arranged on the surface of the basilar membrane. These are the peripheral ends of the auditory nerves. There are two types of hair cells — inner hair cells and outer hair cells. A single inner hair cell is connected to about 20 auditory nerves via synapses (nerve cell junctions). The vast majority of nerve fibers that transmit information from the cochlea to the auditory center are connected to these inner hair cells. On the other hand, the outer hair cells are connected to the basilar membrane via about 6 auditory nerves and synapses, and are thought to be related to the high sensitivity and acute resonance characteristics of the cochlea.

The structure and function of the auditory peripheral as described above have been extensively studies, but many parts relating to the intermediate and the central stage of the cerebral cortex of auditory system are still unknown. Figure 2 shows a simplified representation of the afferent^*8 pathways from the cochlea to the auditory nerve system and the auditory center of the cerebral cortex. The afferent pathway is the route from the cochlea to the cerebral cortex via the cochlear neurons, superior olivary nucleus, inferior colliculus and medial geniculate body. Nerve cells emanating from the cochlea enter into the cochlear nucleus. As they approach the auditory center, the nerve fiber bundles follow an increasing number of paths, with some going in the opposite direction and others going in the same direction. Also, dome fibers connect to a single higher-level neuron, while others bypass the next neuron and connect directly to neurons at even higher levels.

Figure 1: Schematic illustration of the auditory anatomy

Figure 2: Afferent nerve paths to the auditory cortex of the cerebral cortex

4.1.2 Auditory perception¹²⁾¹³⁾

(a) Critical bands and auditory filters
Auditory masking is the perceptual phenomenon that the subject, in the presence of one perceived stimulus cannot respond to another, usually lower-level, signal. In some cases the lower-level sounds become completely inaudible, while in others they become more difficult to hear. The latter case is called partial masking. Masking is defined as the phenomenon whereby the threshold of audibility of the sound is increased in the presence of other sounds, or as the amount of this increase, and the threshold of audibility produced by masking is called the masking threshold. Also, the sound that causes the masking is called the masker, and the masked sound is called the maskee. The maskee is sometimes simply expressed as a signal.

Let us assume a pure tone A of frequency ƒa is presented together with band noise B of frequency ƒa and bandwidth Δƒ If the level at which signal A become just audible is determined for various bandwidths Δƒ while the bandwidth Δƒ of masker B is kept constant, then as shown in Fig. 3, the masking threshold of signal A is more or less constant when Δƒ is above a certain bandwidth. This phenomenon can be well explained if we assume that when human detect a pure tone A in the middle of band noise B, they assume a bandpass filter with a central frequency at the pure tone A so that the threshold value is determined by the amount of noise passing through the filter. This bandpass filter function of the auditory system is called auditory filtering, and the filter band is called the critical band.

(b) Masking and nerve stimulation patterns
The masking pattern is the result of determining the masking threshold of signal A while moving the frequency of signal A and keeping masker B fixed. Figure 4 shows the masking pattern of a tonal stimulus in the presence of narrow band noise centered at 1 kHz with various noise levels. Figure 5 shows the same masking pattern, but the horizontal axis are normalized by Bark scale, which is the scale of numbering the critical band from the lowest frequency. The spread of the masking pattern is asymmetry and much shallower at higher frequencies than at lower frequencies. As the masker level increases, the spread at the higher frequency of the masker pattern becomes more flattening. In other words, these results show that as the masker level increases, signals at the higher frequency are more likely to be masked by the masker.

Cases where the signal and masker exist simultaneously are called simultaneous masking, and cases where making occurs without the signal and masker existing simultaneously are called temporal masking. Temporal masking includes pre-masking, where a signal is masked by a masker presented at an earlier time, and post-masking where they are presented in the reverse order.

The activity pattern of auditory nerves caused by a particular sound is called the nerve excitation pattern. In simultaneous masking, if nerve excitation is caused by signal A, this can be explained by a model where this signal cannot be heard as long as this increased quantity cannot be detected. Specifically, it could be said that the masking pattern is a psychophysical measure of the degree of spreading in the strength of nerve excitation that occurs when the masker stimulates the auditory system.

(c) Binaural perception
By binaural hearing, it is possible to perceive the direction of a sound source in the space. It is also possible to perceive the blur of a sound. In general, if a sound source is positioned in front of a listener towards the right side, then the sounds reaching the right ear will be louder and arrive earlier than those reaching the left ear. In other words, it is expected that differences in the level and timing of sounds perceived by both ears are used as cues for directional perception, and this has been confirmed experimentally. For frequencies below approximately 1.5 kHz, the time difference is the principal cue for direction perception, while for higher frequencies the difference in levels is principal. Furthermore, it has been suggested that sound localization blur is related to inter-aural cross correlation^*9.

Figure 3: Masking of a pure tone by band noise

Figure 4: Masking pattern of narrow-band noise

Figure 5: Masking pattern of narrow-band noise normalized by critical bands

4.2 PEAQ method¹⁴⁾¹⁵⁾

The PEAQ method is a full-reference objective measurement method which measures the perceptual quality of the audio signal from the device under test using two signals — the original audio signal and a signal from the device under test, such as a low-bit-rate coding system. The measurement concept of the PEAQ method is shown in Fig. 6. The Peripheral Ear Model in this figure is the model that transforms the incoming sound into basilar membrane representation of the outer, middle and inner ear such as the abovementioned masking, nerve excitation patterns, critical bands and auditory filters. On the other hand, since there are many unknown aspects regarding how to model the intermediate auditory stage and auditory cortex, these are substituted with a cognitive model of a neural network structure. For the selection of the neural network structure and the model output values, training process was performed so that the objective measured values matched the subjective evaluation data, and the most suitable candidates were selected as the PEAQ standard method.

The PEAQ method specifies a basic version for real-time applications and an advanced version that can obtain maximum accuracy, although both versions have the same basic concept. The measurement method is summarized here.

The original audio signal and signal from the device under test are input to the Peripheral Ear Model (an algorithm that models the auditory system) respectively, and their respective outputs are obtained. For example, in the following discussion it will be assumed that the Peripheral Ear Model employed in basic version.
From the output values of the Peripheral Ear Model for the original audio signal and signal from the device under test, model output values (MOVs) expressing the perceptual salience of audible degradations are determined. The basic version has 11 MOVs, while the advanced version has 5.
The MOVs are input to the neural network cognitive model, and the specified neural network coefficient values are used to determine the audio quality of the signal from the device under test.

The basic version of the auditory model performs calculations using a 2,048-point FFT. This Peripheral Ear Model is shown in Fig. 7. The functions of this model are summarized below:

An FFT computation is used to transform the audio signal into 23.4 Hz frequency components (assuming a 48 kHz sampling frequency), which are weighted by multiplying them by the frequency responses of the outer and middle ear.
The frequency components are grouped in units of 1/4 bandwidth of the critical band corresponding to the inner ear frequency analysis functions.
Physiological noise such as blood flow is added to the frequency components.
Nerve excitation patterns are calculated by considering the temporal extent of pre-masking and the extent of simultaneous masking on the frequency axis. These results constitute the auditory model outputs.

Figure 6: Principle of PEAQ measurement method

Figure 7: PEAQ Basic Version Peripheral Ear Model

4.3 Estimation accuracy and application examples of the PEAQ method

The PEAQ method has been standardized to estimate the perceptual audio quality of high quality mono and stereophonic audio signals only, which correspond to the results of the subjective evaluation suitable for an audio system with small impairments¹⁶⁾. Figure 8(a) is a scatter plot showing the relationship between the SN ratio and the subjective difference grade (SDG) of subjective evaluation, and Fig. 8(b) is a scatter plot showing the relationship between the objective difference grade (ODG) of the PEAQ method and SDG. Each point in Fig. 8 represents the result of the audio quality of various low-bit-rate audio coding techniques such as MP3, AAC and ATRAC (as used by MiniDisc players)¹⁵⁾. The SN ratios are calculated by regarding the difference between the original audio and the coded audio as noise. As this figure shows, the SN ratio is not suitable for measuring the audio quality for low bit rates audio coders, while the PEAQ method is able to perform quality estimation with the subjective evaluation results. However, there are still some deviations between the subjective and objective difference grades, so it is inappropriate to substitute PEAQ method for the subjective evaluation when selecting important systems such as international standard coding schemes.

An application of the PEAQ measurement method is the designation of audio quality for memory audio of Japan Electronics and Information Technology Industries Association (JEITA) standard¹⁷⁾¹⁸⁾. In memory audio equipment, a higher bit rate improves audio quality but shorten the recording time. Conversely, increasing the recording time lessen the bit rate. In order to accurately designate the performance of memory audio equipment for consumers, it should be simultaneously reported both the coding performance (audio quality) and recording time in specifications. The PEAQ method is therefore used to indicate the coding performance of memory audio equipment. Since the perceptual audio quality varies depending on the audio materials, the JEITA standard also specifies the audio materials that are to be used for audio quality designation, and prescribes the use of 8 types of audio source which are critical to encode, such as the SQAM^*10 castanet and triangle materials.

Figure 8: SN ratio and subjective difference grade (SDG),
and PEAQ measured value (ODG: Objective Difference Grade)
and subjective difference grade (SDG)

5. New prospects of objective audio quality measurement methods

As mentioned above, in opting for the PEAQ method, training process was performed using the subjective evaluation results of high quality mono and stereophonic audio signals only, and the method that most closely matched the subjective evaluation results was selected as the standard. In general, the certain coverage of PEAQ method is within the range of the training data of subjective evaluation, and for other types of audio signals the estimation accuracy cannot be guaranteed. In other words, multi-channel audio signals or audio signals with AM or FM broadcasting quality was not included in the training process and is thus outside the coverage of the PEAQ measurement method, and in fact it has been pointed out that it does not estimate accurately for these types of audio signals.

With the spread of digital broadcasting and DVDs, many consumers are now enjoying 5.1-multichannel stereophonic audio. Also lower bit rates audio coding, such as in the one-seg terrestrial digital broadcasting for mobile phones, is popular. Therefore, the ITU-R is working on extending the PEAQ method to accommodate multi-channel stereo signals. It is also tackling the issue of objective measurement methods applicable with AM or FM broadcasting audio quality. In this section we introduce the state of progress in objective audio quality measurement methods.

5.1 Extension to multi-channel audio

Subjective evaluation attributes for multi-channel stereo audio include the basic audio quality^*11 and the front image quality and impression of surround quality. The front image quality is related to the localization of the frontal sound sources and impression of surround quality is related to spatial impression, ambience, or special directional surround effects. As an objective measurement method for multi-channel sound, a method has been proposed where the timbral quality and the front image quality and the surround quality are first obtained, after which the basic audio quality is estimated from the linear combination of these attributes¹⁹⁾. As mentioned above, humans use inter-aural level differences and inter-aural time differences at each ear as cues for the perception of sound direction. Therefore in the proposed scheme, a total of 22 features are extracted relating to sound quality, including the spectral features and features of the inter-aural correlation coefficient in the front and 6 other directions ranging from 30 to 180°. From these extracted features, several suitable features are selected which are most correlated to the front image quality, and the front image quality is estimated from these features. The timbral quality and impression of surround quality are estimated in a similar way. The　basic audio quality is estimated from the linear combination of these attributes. According to the report, training process was performed using the database of low-pass filtered audio signals and two-channel down-mixed audio signals from 5-channel stereo, from which it was possible to accurately estimate perceptual audio quality. In the future, it should be validated the applicability of this technique using perceptual audio coding with multi-channel audio signals.

As another objective measurement method, the ITU-R has been proposed methods for directly estimating the basic audio quality of multi-channel stereo²⁰⁾. This proposed scheme first convolutes the multi-channel signal with the head-related transfer function to each ear from each loudspeaker, and then calculates the binaural signals for the left and right ears. The inter-aural level differences, inter-aural time differences and inter-aural cross correlation coefficients are calculated for each original audio signal and signal from the device under test respectively, and the degradation of spatial image quality such as the sense of direction and the sense of surround is determined from the differences between the original audio and the audio signal from the device under test. The degradation in spatial image quality and the audio quality degradation of individual channels determined by the PEAQ method are combined to estimate the basic quality of multi-channel stereo. This objective measurement method was verified using the so-called MPEG-Surround compression coding, which has been standardized for multi-channel stereo by MPEG in July 2006, and it was reported that a higher correlation was obtained than with the conventional PEAQ method. Another method has been proposed that uses the similar concept but instead of using head-related transfer functions from the loudspeakers to each ear, it calculates the inter-channel level differences and the inter-channel time differences from the combination of the 5-channel audio signals themselves, and thereby determines a cue coefficient for the degradation in spatial image quality which is combined with the sound quality degradation obtained from the PEAQ method or the like to estimate the basic audio quality²¹⁾.

However, it should be noted that these results on multi-channel stereo audio were verified using data whose quality range with AM or FM broadcasting quality.

5.2 Applicability with AM or FM audio quality

An objective measurement method has been proposed to estimate the intermediate audio quality such as AM/FM audio quality²²⁾. This proposed method extracts features, which include not only the 11 MOVs of the basic version of the PEAQ method but also the feature called the energy-equivalent threshold of the 2-4 kHz band which is effective in highly impaired quality cases. The method estimates perceptual audio quality either through the linear combination of these 12 MOVs or some of these MOVs. It has been reported that on conducting a verification on the basis of subjective evaluation results conforming to a subjective evaluation method suitable for AM or FM audio quality, this objective measurement method showed a correlation with the results of subjective evaluation.

A study has also been made of the applicability of the original PEAQ method and the modified PEAQ method to predict intermediate audio quality. The modifications of the PEAQ method were made to aspects such as the simultaneous masking of the auditory model and the model coefficients. Based on this study for evaluation of MPEG-2 AAC coded materials at bit rates of 32 kbit/s through 80 kbit/s per stereo, it was reported that the results of both the methods tend to differ between the audio source categories of "speech, speech + background noise" and "music", and both the methods performed well for "speech, speech + background noise" materials²³⁾. On the other hand, in one-seg broadcasting, audio signals are compressed by AAC+SBR low bit rates audio coding. When the audio quality of AAC+SBR coding was measured by the PEAQ method, it was reported that some 64 kbps stereo audio signals, which are evaluated as "very good" in subjective evaluation, are measured close to the worst value of -4.0, so it was difficult to measure audio quality by the PEAQ method¹⁸⁾. It is expected that further developments will be made in objective measurement methods suitable even for coding methods such as AAC+SBR.

5.3 Improvement of audio quality measurement precision

A number of proposals have been made for new full-reference objective measurement methods aimed at improving on the performance of the PEAQ measurement method. One of these proposals uses an auditory model similar to that of the PEAQ method, but for the cognitive model it uses a self-organizing map^*12 instead of a hierarchical neural network. The model of self-organizing map is combined with new model output values suitable for this self-organizing map. It has been reported that this method performs better than the PEAQ method. Another proposal is a method where the features are calculated using an auditory model that more accurately simulates the human auditory peripheral systems, and the audio quality is estimated from the 5% correlation value^*13 of the time-series data or the mutual correlation coefficient of the features of the original audio and processed audio signals²⁵⁾, and it has also been reported that this method can produce improvements in performance.

6. Conclusion

We have described an objective measurement method for quantitatively estimating the perceptual audio quality of sounds based on the physical features of the audio signals. First, we classified objective sound quality measurement methods and introduce some objective measurement methods for telephony speech quality, taking IP telephony as an example. Next, we described the auditory mechanisms and perception processes that form the basis of objective measurement methods for the perceptual quality of audio signals, and we introduced the full-reference PEAQ method that can objectively estimate the sound quality of high-quality coded audio signal, and some applications of this algorithm. With the PEAQ method, it was possible to make objective sound quality measurements correlated to the perceptual audio quality that could not be measured with the traditional SN ratio. However, there was still a certain deviations between the subjective difference grades and objective difference grades, so when selecting substantial systems such as international standard audio coding, it is ultimately preferable to make judgments from a subjective evaluation and to use objective measurement methods as an additional information. Furthermore, broadcasting services such as 5.1 channel stereo and one-seg services are widely spreading. At present, the ITU-R is studying to extend the PEAQ methods from mono and stereo signals to accommodate multi-channel stereo signals. We have also raised some issues with regard to objective measurement methods for intermediate audio quality, which covers AM to FM audio quality. From the viewpoint of quality management and the accurate provision of information to audiences, it is expected that objective perceptual audio quality measurement methods for these types of audio signals will soon be standardized.

Objective perceptual audio quality measurement methods*1