NHK Laboratories Note No. 475

Application of Speech Rate Conversion Technology to Video Editing
-Allows up to 5 times normal speed playback while maintaining speech intelligibility-


by
Atsushi Imai, Nobumasa Seiyama, Takeshi Mishima, Tohru Takagi and Eiichi Miyasaka
Human Science Research Division
Abstract

    This paper describes an application of speech rate conversion technology to video editing. In video editing, it is common to search through the material at several times normal speed. The speech rate conversion technology maintains the original pitch and timbre of speech despite playing it back at a faster rate, which is varied adaptively to permit fast listening in real-time. In listening tests, users were able to comprehend speech played at up to 5 times normal speed which was incomprehensible without the adaptive rate conversion (even when pitch-shifted to restore the original pitch). A prototype editing system, which maintains the intelligibility of speech during variable-rate playback, has been applied to a non-linear editing system. The system can change the replay speed of MPEG1 or 2 audio and video simultaneously has been implemented on a personal computer.
1.Introduction
    Editing operators frequently search source material on the basis of a speech soundtrack, such as when editing parliamentary debates or street interviews. In such cases, many in the broadcasting field have wondered whether "rapid playback" might be possible.
    Conventional fast-forward playback gives rise to incomprehensibility of the speech soundtrack as a result of either a pitch shift corresponding to the replay rate, or skipping. However, application of speech rate conversion technology to a non linear editing system enables speech material played at up to 5 times normal speed to be understood. The replay rate, conversion system maintains intelligibility by outputting speech at a rate that is varied adaptively while maintaining the original pitch and timbre.
    The device can also replay at a slow rate, leading to new types of special effect.

2.Characteristics
    Speech rate conversion technology [1], which converts input speech adaptively to a desired average output rate while maintaining the original pitch and timbre, has been applied to a prototype editing system. The rate conversion system has the following characteristics:
  1. Adaptive speech rate control which takes into consideration the actual characteristics of each utterance (e.g. changes in pitch or power) enabling the speech topic to be understood even at high speed. On the other hand, superfluous pauses were eliminated from the speech signal so as not to accumulate time delay.

  2. Replays speech synchronized with video at 1/3 to 5 times normal speed, boosting intelligibility or (at slow speeds) leading to new special effects for program production. Presently, the system is in practical use in the NHK-TV language course programs "Chinese Conversation" and "French Conversation" in which it slows the conversations of native speakers by approximately half for hearing lessons.
    
3.Algorithm
    The adaptive speech rate control method was designed on the basis of the following three factors:
  1. The key to comprehensibility is that at least the beginning of each utterance be caught correctly by the listener [2];

  2. Portions of locally elevated pitch compared with the neighborhood carry particular significance especially in Japanese;

  3. Portions where both the power and the pitch are lower, especially near the ends of sentences, might be less important for comprehension.

    The rate control system works by expanding or contracting the speech waveform through the insertion or deletion by a pitch period [1]. In silent or unvoiced portions, a pseudo-pitch is extracted based on the peak of the calculated auto-correlation function, and sections of the corresponding length are repeated in the same way as in voiced portions [4].
    The principle of the speech rate conversion method is shown in Figure 1.
    In this paper, speech rate control using the proposed adaptive speech rate function is called "adaptive S.R.C.", while control using a linear rate function that changes the speech rate uniformly is called "linear S.R.C."
    

3.1 Adaptive S.R.C. function
    The basic function R(t) gives the boosting rate to be multiplied by rp which is a reciprocal of required replay rate.


(1)


where rs (>1.0) is the first boosting rate at the beginning of a sentence (T=0), and re (=1.0) is the rate after T [s] (T=2.5).
    R(t) changes from rs to re continuously inside a breath group. However, it is impossible to predict the duration of the utterance, and we used a fixed time interval T=2500ms based on the typical length of Japanese news speech utterances.


Figure1. Principle of the speech rate conversion method.

Figure1. Principle of the speech rate conversion method.


    In case of replaying at 1/rp times normal speed, the insertions or deletions of pitch period are performed so that the length of the waveform l(n) [s] from beginning of the utterance to k-th pitch period should be as follows.


(2)


Below, 'pl(k)' is the length of k-th pitch period.
The beginning of each utterance is detected on the basis of the duration of the preceding pause in the speech. The threshold is set at 200ms; the start of speech following a pause of at least this duration is taken to be the beginning of the utterance.


3.2 Adaptation
     R(t) realizes increased expansion, for a "slower impression", at the beginning of each utterance. Additionally, a partial slowdown at emphasized portions of the speech is implemented.
    A relation between movement of the fundamental frequency F0 and emphasized portions of speech has been described [3]. Therefore, we attempted to increase the degree of expansion in portions where F0 increases abruptly, as detected by comparing a current estimate of F0 with a moving average of the estimated F0 taken over a longer interval.
    The adaptive S.R.C. applies an expansion rate R'(t) up to 10% higher than the default rate R(t) in emphasized portions (at time t0) for several pitch periods in [s] after t0.


(3)


The short time power is used to judge when it is permissible to curtail a portion of speech in


Figure 2. Example of 5 times normal speed conversion

Figure 2. Example of 5 times normal speed conversion




Figure 3. Principle of the adaptive speech rate conversion


Figure 3. Principle of the adaptive speech rate conversion


the event that time delay has been accumulated. Sections of speech with both low frequency and low power are considered to contribute less to the significance of the speech, especially near the ends of sentences. We attempted to curtail such sections even though they were not wholly silent so as not to accumulate time delay.

    An example is shown in Figure 2. Both waveforms were converted at 5 times normal speed. The upper waveform was converted by the linear S.R.C., and the lower by the adaptive S.R.C. It is clear that they have the same duration overall.
    An example of moving the adaptive S.R.C. function is shown in Figure 3.


4.Listening Comprehension Test
    In this section, we compare the comprehensibility of speech converted at 3 to 5 times normal speed by the linear and the adaptive S.R.C.


4.1 Listening material
    "The Japanese Language Proficiency Test", an official language test, was used. In pre-testing of ordinary Japanese adults, most of them scored perfect. The test has two types of questions:


Table 1. A table of the speech rate conversion pattern for experiments.

LinearS.R.C. Adaptive S.R.C.
3 times linear 3 times adaptive
4 times linear 4 times adaptive
5 times linear 5 times adaptive


(1) A skit with 4 choices of speech (question Type 1)
(2) A skit with 4 choices of speech and 4 prepared pictures for selection. (question Type 2)


4.2 Experimental speech
    The skits were converted by the various S.R.C. functions. The various patterns are shown in Table 1. Each skit was reproduced by all the 6 patterns from the table.


4.3 Subjects
    6 ordinary Japanese subjects aged in their 20's to 30's were used for the experiment.


4.4 Experimental procedure
    The experiment was carried out in a soundproof room for each subject. The experimental materials were presented through a pair of loudspeakers (DS-A3).
Each skit is presented following a start signal.
The reply time after the question is 6 seconds.
A skit is presented to all the subjects by different conversion pattern, so that the content of individual skits would not bias the experimental results.



Figure 4. Results of listening comprehension test.


Figure 4. Results of listening comprehension test.
Left: question type 1 (speech only).
Right: question type 2 (speech with picture).



4.5 Results
    The results are shown in Figure 4. Each bar in the figure is the average score within the same pattern of Table 1, because the results were differed little from one subject to another. The result was differ little according to two question types, at up to 4 times normal speed.
    Speech converted by the adaptive method scored consistently higher at all 3 rates by comparison with the linear method, despite the fact that the duration of the speech was the same irrespective of method. The superiority of the adaptive method is still more marked at 4 times normal speed and above.


5. Discussion
    It is clear that the proposed adaptive speech rate conversion method is effective in boosting the comprehensibility of speech played at several times normal speed, for example when searching through broadcasting pre-edit video.
    The experiments of 'Type 2' may be relevant to the problem of content retrieval with some incidental information such as search category or target images. The results at 5 times normal speed suggested a possibility for practical applications such as looking up video archives. Broadcasting stations typically maintain large volumes of video and audio materials from archives to current material, which are continuously searched and/or edited. According to what the broadcasting editor says, it is inevitable the steps of watching and hearing for search, selection or confirmation. That's why, the final judgement is done using a pair of eyes and ears. The ability to speed up the search process would be of great value.


6. Non-Linear Editing System
    A prototype system, which maintains the intelligibility of speech during variable-rate playback, has been applied to a non-linear editing system. That can change the replay speed of MPEG 1 or 2 audio and video simultaneously has been implemented on a personal computer.

6.1 Features
    This equipment combining proposed S.R.C. method with the variable-rate playback function of MPEG provides natural quality sound at a rate that is synchronized to the changed rate of the picture.

Figure 5. Block diagram of the audio-video speech converter for MPEG data file.

Figure 5. Block diagram of the audio-video speech converter for MPEG data file.


6.2 System
    The block diagram of this system is shown in Figure 5.
    The MPEG data both in file and coming over the network are available. The variable rate video is completely synchronized to the S.R.C audio, because the variable rate video is composed from the timing of audio data synthesis.
    We checked the performance of this algorithm on a P.C. (PentiumIII 1GHz). A prototype non-linear editing system is shown in Figure 6.


7. Future Activities
    At present, translation of overseas news bulletins usually involves going through the source material on a VTR, interpreting and transcribing the content before going on air. To support this work, we will continue the development of the variable rate audio-visual player. Another application example is at the retrieval terminal of a large-scale video and audio archiving system.
    The variable-rate playback function also has many potential applications other than in broadcast program production.



8. Conclusion and Remarks
    A new non-linear editing system with built-in the speech rate conversion functionality has been developed. It was found that the adaptive speech rate conversion function enables the user approximately to comprehend speech played at up to 5 times normal speed. This system which can change the speed of MPEG audio and video simultaneously has been implemented on a personal computer.
    It is available not only for video and audio retrieval and editing equipment, but also leads to new special effects for program production.

Figure 6. A prototype non-linear editing system.

Figure 6. A prototype non-linear editing system.



References

[1] A. Nakamura, N.Seiyama, A.Imai, T.Takagi, and E.Miyasaka, "A new approach to compensate degeneration of speech intelligibility for elderly listeners," IEEE Trans. Broadcast., vol.42, no.3, Sept. 1996.
[2] A.Imai, R.Ikezawa, N.Seiyama, A.Nakamura, T.Takagi, E.Miyasaka and K.Nakabayashi, "An adaptive speech rate conversion method for news programs without accumulating time delay," IEICE vol.J83-A No.8 pp.935-945 Aug. 2000.
[3] H.Hamada, J.Chiba, "Control method of prosodic features to emphasis keyword for text-to-speech synthesis." Procs. ASJ Spring Meeting, pp.279-280 Mar.1992.
[4] T.Takagi, N.Seiyama and E.Miyasaka, "A method for pitch extraction of speech signals using autocorrelation function through multiple window-lengths," IEICE vol.J80-A No.9 pp.1341-1350 Sept. 1997.
[5] N.Seiyama, A.Imai, T.Mishima, T.Takagi and E.Miyasaka, "Development of high-quality real-time speech rate conversion system," IEICE vol.J84-D-2 No6. Pp.918-926 Jun. 2001.
[6] N.Seiyama, A.Nakamura, A.Imai, T.Takagi and E.Miyasaka, "Portable speech rate conversion system" EUROSPEECH'95,vol.3 pp.1717-1720, 1995.
[7] K.Watanabe, "A study on the Effect of Slower Speech Rate Produced by the Speech Rate Converter", J.Otolaryngology. Japan, vol99, pp.445-453, 1996.
[8] Tohru Takagi, Nobumasa Seiyama and Eiich Miyasaka: "A Method for Pitch Extraction of Speech Signals Using AutocorrelationFunctions through Multiple Window Lengths", Electronics and Communications in Japan, Part 3, Vol. 83, No.2, pp.67-79, 2


Mr. Atsushi Imai Mr. Atsushi Imai
Atsushi Imai received a B.E. degree in Electrical Engineering from Saitama University, Saitama, Japan. He joined NHK in 1989. Since 1992, he has been with NHK Science and Technical Research Laboratories, where he has been engaged in research on speech perception and development of high quality speech rate conversion system.
Mr. Nobumasa Seiyama Mr. Nobumasa Seiyama
Nobumasa Seiyama received a B.E. and a M.E. degree in Electrical Engineering from Waseda University, Tokyo, Japan. He joined NHK in 1989. Since then, he has been with the Science and Technical Research Laboratories, where he has been engaged in speech processing, voice conversion, and development of high quality speech rate conversion system.
Mr. Takeshi Mishima Mr. Takeshi Mishima
Takeshi Mishima received a B.E. and a M.E degree in Department of Electrical Engineering from Meiji University, Kanagawa, Japan, in 1991 and 1993. He joined NHK in 1993. Since 1998, he has been with NHK Science and Technical Research Laboratories, where he has been engaged in the research on speech recognition and speech synthesis.
Mr. Tohru Takagi Mr. Tohru Takagi
Tohru Takagi received a B.E. and a M.E. degree in Electrical Engineering from University of Electro-Communications, Tokyo, Japan. He joined NHK in 1981. Since 1984, he has been with the Science and Technical Research Laboratories, where he has been engaged in speech synthesis, voice conversion, speech perception and development of high quality speech rate conversion system.
Mr. Eiichi Miyasaka Mr. Eiichi Miyasaka
Eiichi Miyasaka received a Dr. Eng. degree in Electrical Engineering from Tohoku University, Sendai, Japan. He joined NHK in 1969. Since 1972, he has been with NHK Science and Technical Research Laboratories, where he has been engaged in research on auditory mechanisms from physiological, psychological and electrical points of view, multi-channel sound reproduction system for HDTV, development of high quality speech rate conversion system and so on.


Copyright 2001 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.

BackHome