NHK Laboratories Note No. 471

Speech Recognition of Japanese News Commentary

by
Shinichi HOMMA, Akio KOBAYASHl, Shoei SATO, Toru lMAI, and Akio ANDO
(Human Science Research Division)
ABSTRACT
This paper describes some improvements in speech recognition of broadcast news commentary in Japanese. Since news commentary speech has different linguistic and acoustic features from read speech, it gives lower word recognition accuracy. In this paper we apply to news manuscripts some rules which represent the linguistic features of news commentaries, and generate word sequences for language model adaptation. We also use a large volume of transcriptions of news programs as training texts. Acoustic models are speaker- adapted and their structures are changed so as to recognize relatively short phonemes, because we found the speech rate of news commentary is sometimes much faster than that of read speech. Furthermore, by using a decoder that can handle cross-word triphone models, we reduced the word error rate by 32%.
1. Introduction
    Recent years have seen increasing demand for annotation of news programs with subtitles. In response, NHK (Japan Broadcasting Corporation) started subtitling news programs using a real-time speech recognition system from March 2000. At present, the subtitling is limited to anchor announcers' read speech. The speech uttered in news commentary has different properties from anchor announcers' read speech, which degrades word accuracy.
    There are two types of news commentary. One involves an anchor commenting on the news alone using charts, graphs, scale models and other materials. The other involves an anchor and some reporters explaining the news in a conversational style with similar materials. We focus on the former type of news commentary in this paper (See Fig. 1).

Figure 1. News commentary focused on in this paper
Figure 1. News commentary focused on in this paper

    Just before the news program begins, the anchor announcer creates a manuscript, although this is only rarely read as it is. Utterances in a news commentary have some features of spontaneous speech. There have been some reports dealing with the differences between spontaneous speech and read speech [1, 2]. DARPA-sponsored Hub-4 benchmark tests also deal with spontaneous speech in broadcast news transcription. Their lowest word error rate reached 14.4%, which was 6.8% worse than that without spontaneity [3], showing that the recognition of spontaneous speech is an ongoing problem.
    This paper is organized as follows. First we investigate and classify the linguistic features of news commentary by comparison of manuscripts and transcriptions, and construct some new language models. Second, we describe the acoustic features of news commentary by comparing examples with read speech, and construct some new acoustic models. Finally, we describe and discuss evaluation results using these language and acoustic models.

2. Linguistic features of news commentary
    We investigated the linguistic features of 12 previously-broadcast items of news commentary by comparing their transcriptions with the original manuscripts. The numbers of sentences and words are shown in Table 1.

Table 1: Comparison of manuscripts and transcriptions

ManuscriptsTranscriptions
Number or sentences112139
Number of words2,9273,140

    The results ofthe investigation were as follows.
  • The proportion of the sentences whose original manuscripts were exactly read was only 13%. The other sentences were partially replaced by other words with the same meaning or by ad-libs.
  • Speech was sometimes non-fluent. Filled pauses were observed at the beginning of 32% of the sentences. Typical pause fillers were "de","e" and "e:". These constituted 91% of all pause fillers. Hesitations and repetitions were also observed more frequently than in read speech.
  • Demonstrative pronouns were frequently used for indicating materials like charts, etc. It was rare that these expressions were written in the manuscripts, and they were often produced impromptu.
  • Colloquial words (e.g. "chotto" or "zutto") were often observed in the transcriptions.
  • Expressions in the predicates were often replaced by colloquial phrases. Examples were as follows:
... shi mashita </s> arrow... shita ndesu </s>
... shita ndesu ne </s>
... shi masu ga ... arrow... shita ndesu ...
... shita ndesu keredomo ...

3. Language models for news commentary
3.1. Automatic generation of commentary phrases
    We tried to apply some rules representing linguistic features of news commentaries to news manuscripts, to generate word sequences. The number of rules was 30. We show some examples of the rules below. The left side of the arrow shows a phrase in a manuscript and the right side shows a generated word sequence. To reflect the word sequences in trigrams, we added the words before and after the generated word-sequences (underlined) to those generated rules.



3.2. Language Modeling
    For the training procedure, we use the corpora of long-term news scripts, news transcriptions and short-term latest news scripts as shown in Table 2. We can expect the news transcriptions to be effective in the recognition of a news commentary because they include spontaneously uttered colloquial phrases.

Table 2:Prepared term and size of corpora
Long term news scripts Apr.1 '91 4 hours before broadcast
[Example of size]
     Jul.26 '00: 1.91M sentences
News transcriptions Jun.1 '97 Jul.31 '97
Apr.1 '98 Sep.30 '98
Apr,1 '99 1 day before broadcast
[Example of size]
    Jul.26 '00: 351K sentences
Short-term latest news scripts 6 hours before broadcast Just before broadcast
[Example of size]
     Jul.26 '00: 665 sentences

    The procedure of the language modeling is as follows:

  1. Construct an N-gram language model using long-term news scripts and news transcriptions.
  2. Generate word sequences observed as peculiar to news commentary using the rules (see 3.1).
  3. Add the word sequences to the short-term latest news scripts, and construct an N-gram language model.
  4. Construct an N-gram language model by linear interpolation of the language models of(1) and (3) using weights estimated by the EM algorithm using different training data [4].The ratio of the weights was 0.85 : 0.15 respectively.

    The procedure is executed every day according to the broadcast programs. In this paper, we construct 4 kinds of language models as shown in Table 3 in order to observe the efficiency of the generated word sequences and the news transcriptions added to the training data.

Table 3: Differences between language models
Language modelGenerated
word sequences
News transcriptions
LM-ADPT
LM-RULE
LM-ADPT+
LM-RULE+
Not used
Used
Not used
Used
Not used
Not used
Used
Used

3.3. Evaluation of the language models
    The evaluation test set was formed by 149 sentences consisting of 3,399 words, which were transcribed from 16 days of news programs during the period from May to July 2000. We prepared 12 language models corresponding to the broadcast days of the test set data. Each vocabulary size was 20K. We used a fixed vocabulary for the LM-ADPT and LM-RULE, and another one for the LM-ADPT+ and LM-RULE+.
    Table 4 shows the perplexity, the rate of hit trigrams (Hit) and the out-of-vocabulary (OOV) rate for each language model. The reason for the higher perplexities of LM-ADPT and LM-ADPT+ relative to LM-RULE and LM-RULE+ respectively is that it includes words whose appearance probabilities are almost zero. Although the comparison of the perplexities in this Table seems to be unfair, the generated word sequences and the addition of news transcriptions to the training data are effective, especially for the rate of hit trigrams.
Table 4: Evaluation of the language models
Language modelPerplexityHit (%)OOV (%)
LM-ADPT
LM-RULE
361.1
53.3
67.8
70.7
0.75
LM-ADPT+
LM-RULE+
56.4
41.8
72.2
73.6
0.67


4. Acoustic features of news commentary
4.1. Evaluated data
    We collected news commentary speech and read speech to investigate the differences between acoustic features shown in Table 5. This shows that the number of words per sentence in news commentary is fewer than that in read speech and that the speech rate of news commentary tends to be faster.

Table 5: Evaluated data

Commentary speechRead speech
Number of sentences13445
Number of words2,9221,717
Number of words per sentence21.838.2
Speech rate (mora/sec)8.78.4


4.2. Acoustic modeling and alignment
    Table 6 presents the acoustic model used as a baseline. We denote this acoustic model AM-BASE. Figure 2 presents the structure of HMM we used initially. It is a 3-state and left-to-right model.

Table 6:Acoustic model (AM-BASE)
Sampling frequency16kHz
Analysis windowHamming window, 25msec
Frame period10msec
Analysis parameters39 parameters
(12 MFCCs with log-power and
their first-and second-order regression coefficients)
HMMState-clustered 8-mixtured triphone HMMs
Number of triphone models and states5,648 models, 4,016 states
Training data102,877 sentences (315 hours)


Figure 2. Structure of HMM.
Figure 2. Structure of HMM.


    We adapted the AM-BASE model to an anchor announcer using MAP[5] estimation and constructed an acoustic model called AM-ADPT. For the adaptation, we used 9.2 hours (3,262 sentences) of the anchor announcer's clean speech broadcast from March 27 to April 28 2000. This period was before the period when the test set was selected (May, June and July 2000). 2,192 models (39%) of triphones out of 5,648 models were adapted.
    Next, we got phoneme alignment of the evaluated data shown in Table 5 by using AM- ADPT. In news commentary we found a high frequency of phonemes with short duration (less than 5 frames), which were observed in vowels (/a/, /e/, /i/, /o/, /u/), syllabic nasals (/N/), semivowels (/w/, /y/, /r/), and nasal constants (/m/, /n/). Figure 3 shows the distribution of the duration of vowels and syllabic nasals whose frequencies were relatively high.

Figure 3. Distribution of duration of vowels and syllabic nasals by using AM-ADPT.
Figure 3. Distribution of duration of vowels and syllabic nasals by using AM-ADPT.

    Based on the result, we constnlcted a new acoustic model called AM-SKIP whose HMMs were permitted to skip states. We gave a small probability of 0.01 to the transition as shown by the dotted arrows in Figure 4, carried out MAP adaptation using the same data used for constructing AM-ADPT and updated the output distributions. Table 7 presents the differences between the acoustic models AM-BASE, AM-ADPT and AM-SKIP.
Figure 4. Structure of HMM with skips.
Figure 4. Structure of HMM with skips.

Table 7: Differences between the acoustic models
Acoustic modelSpeaker adaptationState-skips
AM-BASE
AM-ADPT
AM-SKIP
Not adapted
Adapted
Adapted
Not permitted
Not permitted
Permitted


    We got phoneme alignment of the evaluated data again by using AM-ADPT. Figure 5 presents the distribution of the duration of vowels and syllabic nasals. By comparison with Figure 2, we notice that the frequency of the phonemes whose duration was 3 frames declined and those of less than 3 frames increased. The frequencies of phonemes less than 3 frames long were 4.2% in the read speech and 8.6% in the news commentary speech.

Figure 5. Distribution of duration of vowels and syllabic nasals by using AM-SKIP.
Figure 5. Distribution of duration of vowels and syllabic nasals by using AM-SKIP.

    Table 8 presents the comparison of duration and acoustic log-likelihood (average per frame) of vowels and syllabic nasals. The average durations were close, but the deviation for the commentary speech was larger. As for the likelihood, the value for the news commentary was smaller. Therefore we can say that a mismatch of the acoustic model for the news commentary was observed.

Table 8: Comparison of duration and acoustic likelihood

Commentary speechRead speech
Average duration6.56.6
Standard deviation of duration4.74.1
Average per frame of acoustic likelihood-63.6-59.2


5. Experiments
     We carried out speech recognition transcription experiments with the language models and the acoustic models above. We used our 2-pass decoder [6] as follows. The first pass using a bigram language model and triphone HMMs generates a word lattice time synchronously by the Viterbi beam search. The second pass rescores the N-best (N=200) sentences by a trigram language model to decide and output the best sentence. The evaluated speech data were the same as used in the evaluation of the language models in 3.3 (149 sentences consisting of 3,399 words).
     The result of the experiment is shown in Table 9. The results show the improvements on the baseline obtained with the other language models and acoustic models. When we used LM-RULE+ and AM-SKIP, we improved the word accuracy by 5.1% without increasing the real time factor. Furthermore, by using a decoder which can handle cross- word triphone models depending on the left context only, we improved the word accuracy by 0.3%. In this case, the real time factor cost about 2 times. Analyzing the result in detail, we found adding a large amount of news transcriptions to training for the language model data was the most effective. On the other hand, the use of a large volume of transcriptions cancelled the effectiveness of the adopted rules, which work effectively when a smaller volume oftranscriptions is available.

Table 9: Word accuracy (ACC) and real time factor (RTF) in the experiments
Table 9: Word accuracy (ACC) and real time factor (RTF) in the experiments


6. Conclusions
     We investigated the features of news commentary and found a colloquial tendency among its linguistic features and a short duration of some phonemes among its acoustic features. We applied to news manuscripts some rules representing the linguistic features of news commentary, and generated word sequences used for language model adaptation. We also added a large volume of transcriptions of news programs to this training data. As for acoustic modeling, we used MAP estimation and changed the structure of HMMs by allowing state skips. Furthermore, by using a decoder which can handle cross-word triphone models, we reduced the word error rate by 32%


References

  1. Blaauw, E., "Phonetic differences between read and spontaneous speech", Proceedings of International Conference on Spoken Language Processing, pp.751-754, 2000.
  2. Enzaki, M., "Changing speech styles: Strategies in read speech and casual and careful spontaneous speech", Proceedings of International Conference on Spoken Language Processing, pp.755-758, 1992.
  3. Pallet, D.S., Fiscus, J.G., Garofolo, J, S., Martin, A., and Przybocki, M., "1998 Broadcast news benchmark test results: English and non-English word error rate performance measures", Proceedings of DARPA Broadcast News Workshop, pp.5-10, 1999.
  4. Kobayashi, A., Onoe, K., Imai, T., and Ando, A., "Time dependent language model for broadcast news transcription and its post-correction", Proceedings of International Conference on Spoken Language Processing, pp.2435-2438, 1998.
  5. Gauvian, J.L. and Lee, CH., "Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains", IEEE Trans. S.A.P., vol.2, No.2 pp.291-298,1994.
  6. Imai, T., Kobayashi, A., Sato, S., Tanaka, H., and Ando, A., "Progressive 2-pass decoder for real-time broadcast news captioning", Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp.1937-1940, 2000.
  7. Homma, S., Imai, T., Ando, A., "An Examination of cross-word triphone -An implementation method of 1st-pass cross-word triphone models in 2-pass decoders-", Proceedings of Autumn Meeting of the Acoustical Society of Japan, 2-1-2, Sept. 1999 (in Japanese).
Homma Mr. Shinichi Homma
Shinichi Homma received a B.E. degrees in electronics and communication engineering from Waseda University, Tokyo, Japan, in 1992. He joined Japan Broadcasting Corporation (NHK) in 1992. Since 1998 he has been with the Science and Technical Research Laboratories, where he has engaged in the research on speech recognition for news subtitling system.
Kobayashi Mr. Akio Kobayashi
Akio Kobayashi received the B.E. degree in electrical engineering in 1991 from VVaseda University. He joined Japan Broadcasting Corporation (NHK) in 1991. Since 1996, he has been with the Science and Technical Research Laboratories of Japan Broadcasting Corporation, where he is engaged in the research on speech recognition.
Sato Mr. Shoei Sato
Shoei Sato received a B.E. degree and a M.E. degree in 1993 from Tohoku University. Sendai, Japan. Since 1995 he has been with NHK Science and Technical Research Laboratories and engaged in the research on digital satellite broadcasting system. His current research is automatic speech recognition systems.
Imai Dr. Toru Imai
Toru Imai received a B.E. degree in electrical engineering in 1987 and a Ph.D. degree in information and science in 1999 from VVaseda University, Tokyo, Japan. He joined NHK in 1987. Since 1990 he has been with NHK Science and Technical Research Laboratories, where he has been engaged in the research on speech recognition.
Ando Mr. Akio Ando
Akio Ando received the B.S. and M.S. degrees from Kyushu Institute of Design in 1978 and 1980, respectively. In 1980 he joined Japan Broadcasting Corporation(NHK). He has been with the Science and Technical Research Laboratories of Japan Broadcasting Corporation since August 1983. He is currently a senior researching engineer of the Human Science Research Division at the Laboratories. His research interests include speech processing and pattern recognition.


Copyright 2001 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.

BackHome