NHK Laboratories Note No. 464


Toru Imai, Akio Kobayashi, Shoei Sato,
Hideki Tanaka, and Akio Ando
(Human Science Research Division)

    This paper describes a speech recognition engine that progressively outputs the latest available results of words used for real-time closed captioning of Japanese broadcast news. The search engine called a progressive 2-pass decoder practically eliminates the disadvantage of conventional multiple-pass decoders that delay a decision until the end of a sentence. During the first pass of the search the proposed decoder periodically executes the second pass up to that time and detects a part of the final result of words. This method is not theoretically optimal but makes a quick decision with a negligible increase in word errors. In a recognition experiment on Japanese broadcast news, the decoder worked with an average decision delay of 554msec for each word and degraded word accuracy only by 0.22%.
    There is a great need for real-time captioning of broadcast news for the hearing impaired and elderly. Real-time captioning in English can be performed by stenocaptioners using a special keyboard that translates combinations of keys presenting phonetic sounds into spelled words. However, it is very difficult for Japanese to keep up with speech by such a device due to a great number of homonyms with ideograms, Kanji. To solve this problem and make captions for live broadcast news, we have developed a real-time large vocabulary speech recognition system followed by manual check and correction. NHK has started a subtitling service (see Fig. 1) by using the system at 7 oċ›£lock news on March 27, 2000. Such a service by speech recognition is probably the first one in the world. This paper describes a new search engine used in the system.

Fig. 1 NHK's subtitled news by speech recognition.

    To realize such a system and permit manual correction in real-time, high word accuracy is required. So far, we have developed two techniques especially designed for broadcast news: a time-dependent language model (TDLM) [1] adapted to recent broadcast news, and an automatic selection of acoustic training data from news anchors [2]. Besides the language and acoustic models, a search engine called a 2-pass decoder was developed with a single static tree lexicon and a pre-processed table of the maximum bigram probabilities in lower levels [3].
    Another requirement of such a system is a quick response or a short delay of displayed words from speech. Although the above 2-pass decoder or a multiple-pass search strategy [4] where various knowledge sources are applied at various stages on the entire sentence has computational advantages with high accuracy, they can not make a final decision until the end of the sentence. A one-pass decoder also has a similar problem as long as an initial common path [5] in a search space does not exist. Such a decision is too slow for application to real-time captioning. DARPA-sponsored Hub-4 benchmark tests deal with broadcast news transcription, but the current target is a system running not in real-time but in less than 10 times real-time [6]. We have changed the second pass procedure in our 2-pass decoder so that it makes a quick decision during speech while maintaining high accuracy.
    After describing the conventional 2-pass decoder in Section 2, this paper describes the new 2-pass decoder in Section 3. Evaluation results of the decoders on Japanese broadcast news is described in Section 4 with regard to word accuracy and decision delays from speech.

    Our conventional 2-pass decoder [3] executes the second pass just once at the end of a sentence as general multiple-pass decoders do [4]. The first pass using a bigram language model and triphone HMMs generates a word lattice time-synchronously. At a sentence-end the word lattice is recursively traced back to get N -best sentences. The second pass rescores the N -best sentences by a trigram language model to decide and output the best word sequence.
    In the first pass, the word-dependent N -best algorithm [7] is carried out with a modified Viterbi beam search. During the search top-n (n<<N ) paths which have different previous words are saved in each HMM state. If multiple paths reach the same state with the same previous word, only the best path is saved. At a word-end phoneme, its word-id, the word-end frame, scores and back-pointers to the previous word-ends for the N paths are saved as a word lattice, and only the best path propagates to following words. Pruning in the beam search is based on a global beam width of log-likelihood and a narrower word-end beam width.
    The first pass procedure runs on a tree-structured phoneme network [8] where initial phoneme sequences are shared by some words. Instead of copying the trees dynamically [8][9], we use a single and static tree (where n should be greater even if N =1 in order to avoid search errors). With some words sharing initial nodes of triphones, the tree-structured phoneme network can reduce the number of active nodes at the beginnings of words. However, bigram probabilities can not be applied until leaf nodes where word identity is known. To apply a language model earlier, the maximum bigram probability among words sharing a node is used for pruning [9]. It requires much computation to get the previous-word-dependent maximum bigram probability in each node whenever a list of shared words changes. For a small vocabulary, the maximum values can be pre-processed and saved in a table, but it requires huge memories for a larger vocabulary. Then we pre-process and save the table only for all nodes at frequently active lower levels (L)according to the size of the vocabulary.
    A problem with this decoder for real-time broadcast news captioning is that it makes a final decision at the end of each sentence. This means that closed captions are displayed with at least a sentence-long delay from speech.

Fig. 2 Comparison between the conventional 2-pass decoder and the new 2-pass decoder.

    Observing the most likely word sequence by the trace back in each frame during the first pass, we find that words close to the processed frame change frequently but earlier words do not. Then it is expected that a decision on earlier words without waiting for the sentence end does not lead to a significant increase of word errors in practical use. We apply this idea to our 2-pass decoder. The second pass is executed periodically on partial N -best word sequences while the first pass is executed as usual, and stable words are detected and progressively output as a part of the final result. Since such a decision does not guarantee the optimization for the entire utterance, we try to detect reliably stable words so that the method gives a quick response with maintaining high accuracy. Fig. 2 illustrates the difference between our conventional 2-pass decoder and the new progressive 2-pass decoder.
    The detailed procedure of the new second pass is as follows (see Fig. 3).
  1. Every t frames during the first pass the uncompleted word lattice is traced back from the word-end pointed by the node with the best score to get N -best word sequences up to that frame. The N -best word sequences do not include the premature word to which the best node belongs if the node is not a word-end. If there are words previously decided, word-end branches in that portion need not be expanded for the N -best.

  2. The N -best word sequences are rescored using a trigram language model to get the 1-best word sequence up to that frame.

  3. The current 1-best word sequence is compared with the previous 1-best word sequence obtained t frames before. If some words are repeated after the last decision, those words are regarded as likely to be correct and are decided as a part of the final result. If there are no repeated words, the 1-best word sequence is regarded as unstable and no decision is made. Since words close to the current frame are unstable, the latest M words are excluded from the target to be decided. Even if the best word in a pre-determined portion has been changed, it is ignored.

Fig. 3 Progressive 2-pass decoder.

    Note that the first pass is executed as in the conventional method and does not get any feedback from the second pass or the decided partial results. Just the second pass is executed many times periodically on part of the speech or the uncompleted word lattice. There is a trade-off between decision delays and word errors. They can be controlled by the parameters of the second-pass interval t and the decision margin of the latest M words. As t gets longer or M gets larger, fewer word errors are produced by this early decision method but with longer decision delays.
    More complicated methods could be used to realize the progressive 2-pass decoding: non-periodic execution of the second pass, tracing back from multiple active nodes, or regarding connections with pre-decided words, etc. However, this paper examines the degradation of word accuracy and decision delays from speech by the comparatively simple method mentioned above.

4.1 Conditions
    An experiment to transcribe NHK's Japanese broadcast news was performed to examine the proposed decoder. The evaluated speech data were 122 sentences (65 sentences for two anchormen and 57 sentences for an anchorwoman) consisting of 4,942 words in total. The average length of the sentences was 12.5sec. The evaluated speech data were acoustically analyzed into 39 parameters (12 MFCCs with log-power and their first- and second-order regression coefficients) every 10msec after digitization at 16 kHz and 16 bits with the Hamming window of 25msec width.
    Acoustic models of gender-dependent and speaker-independent state-clustered 8-mixtured triphone HMMs were trained from NHK news data (48.3 hours for males and 38.6 hours for females including the evaluated speakers) after the above acoustic analysis. The number of Japanese phonemes was 42. The number of logical triphone HMMs was 5,873, and the numbers of physical ones were 3,093 with 1,950 states for males and 2,894 with 1,656 states for females for the lexicon of 20K words. A bigram language model for the first pass and a trigram language model for the second pass were trained from NHK news scripts extending back over 7 years after Japanese morphological analysis. They were used with back-off smoothing in the recognition. In the decoder, the number of saved paths in each HMM state, n, was 4 at most. The number of rescored partial sentences, N , was 200 at most. The maximum bigram probabilities were pre-processed onto a table up to the level L=2 of the tree-structured phoneme network for the computational saving. The global beam width was 160 in log-likelihood while the word-end beam width was 110. The weighting of the language score was 14 without an insertion penalty. The perplexities of the evaluated data by the bigram model and the trigram model were 69.2 and 28.0, respectively. The rate of words outside of the vocabulary was 0.67%. The evaluation was executed on an Alpha-21264 500MHz machine with 768MB memory.
    We tried varying the second-pass interval t from 1 to 50 frames (10ms/frame-shift) and the decision margin M from 0 to 2 words. The decoder was evaluated by the recognition accuracy and the average decision delay (difference between the word-end frame and the decision frame).

4.2 Result
    Word accuracy for each condition is illustrated in Fig. 4. The average decision delay is illustrated in Fig. 5. For instance in the middle case of t=30 frames and M=1 word, the word accuracy was 96.54% and the average decision delay was 554msec (2.7sec at most). The word accuracy was only 0.22% lower than the baseline accuracy of 96.76% achieved by the conventional decoder. The degradation was negligible. The conventional decoder, which made a final decision at the end of sentence, showed an average decision delay of 7.2sec (37sec at most) as the baseline. As the interval or the margin was smaller, the decision delay decreased but the word accuracy also decreased. As the interval or the margin was larger, the word accuracy was close to the accuracy achieved by the conventional decoder though the delay became longer. When M was 2, far words from the processed frame could be decided and made very small degradations of word accuracy independent of the interval but with longer decision delays.
    In the off-line experiment where all the evaluated speech data were stored in the disk beforehand, the speech recognition was done in 0.7 times real-time. It was faster than the conventional decoder running at 0.8 times real-time. This indicates that the periodic short trace back takes less computation in total than the trace back on the entire sentence once. Actually, the number of N -best word sequences listed up every time was much smaller than the expected value for the entire sentence. From the viewpoint of word accuracy, N for a shorter portion is more valuable than for the sentence because it bears more word alternatives.
    In an on-line experiment where all the evaluated speech data were fed from tapes, the time lag between the displayed words and input speech was less than 1-2 seconds anytime from the beginning of the sentences. It can be concluded that the early decision is very useful for some real-time applications like closed captioning.

4.3 Comparison with a guaranteed method
    For comparison, we examined a conventional early decision method [5] that guarantees not to make any degradation of word accuracy. The method relies on an initial common path in a search space because that portion is supposed to be a part of the optimized path on the entire sentence (see Fig. 6). However, in an experiment on the broadcast news transcription, the method made early decisions only for 3.6% of words or 1-2 words at the beginning of sentences and the remaining part was decided by the ordinary trace back at the end of the sentences. It is because the purpose of the first pass in the multiple-pass search strategies is to generate possible word alternatives, and it obstructs the initial sole path. The possibility that an initial common path exists becomes lower as the search space becomes larger: larger vocabulary sizes, wider beam widths, or many histories (larger n ). In a practical application, our progressive 2-pass decoder would be one acceptable way of providing early decisions.

Fig. 6 Early decision not to degrade accuracy.
    This paper described the new progressive 2-pass decoder and reported its application to Japanese broadcast news transcription. The decoder has the advantage of progressively outputting the latest available words in contrast to other multiple-pass decoders which make a final decision at the end of speech only. The trade-off between decision delays and word errors can be controlled by the parameters of the second-pass interval and the decision margin of the latest words. In the experiment the proposed decoder continuously output words with very small delays with a negligible increase in word errors. The basic idea of a periodic early decision with some margins can be applied also to a one-pass decoder.

  1. A.Kobayashi, K.Onoe, T.Imai, and A.Ando, "Time Dependent Language Model for Broadcast News Transcription and Its Post-Correction,"Proceedings of International Conference on Spoken Language Processing, pp. 2435-2438, Dec. 1998.

  2. S.Sato, T.Imai, and A.Ando, "Automatic Selection of Training Utterances to Improve an Acoustic Model," Technical Report of the Institute of Electronics, Information and Communication Engineers, SP99-30, Jun. 1999 (in Japanese).

  3. T.Imai, K.Onoe, A.Kobayashi, and A.Ando, "A Decoder for Broadcast News Transcription," Proceedings of Autumn Meeting of the Acoustical Society of Japan, 3-1-12, Sept. 1998 (in Japanese).

  4. R .Schwartz, L.Nguyen, and J.Makhoul, "Multiple-Pass Search Strategies," Automatic Speech and Speaker Recognition Advanced Topics, Kluwer Academic Publishers, MA, pp. 429-456, 1996.

  5. P.F.Brown, J.C.Spohrer, and P.H.Hochschild, "Partial Traceback and Dynamic Programming," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1629-1632, May 1982.

  6. D.S.Pallett, J.G.Fiscus, J.S.Garofolo, A.Martin, and M.Przybocki, "1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate Performance Measures," Proceedings of DARPA Broadcast News Workshop, pp. 5-10, March 1999.

  7. R. Schwartz and S. Austin, "A Comparison of Several Approximate Algorithms for Finding Multiple (N-Best) Sentence Hypotheses," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 701-704, May 1991.

  8. H.Ney, R .Haeb-Umbach, B.-H.Tran, M.Oerder, "Improvements in Beam Search for 10000-Word Continuous Speech Recognition," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 9-12, Mar. 1992.

  9. J.J.Odell, V.Valtchev, P.C.Woodland, and S.J.Young, "A One Pass Decoder Design for Large Vocabulary Recognition," Proceedings of Human Language Technology Workshop, pp. 405-410, Mar. 1994.

Copyright 2000 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.