Simultaneous Subtitling of Live Broadcast Programs by Automatic Recognition of Re-spoken Speech
In December 2001, NHK began caption broadcasting of the program "Kohaku Utagassen". This was the first-ever implementation of superimposed captioning of an entertainment program based on speech recognition technology. The program is a music-based variety program broadcast every New Year's Eve, featuring many popular singers, comedy and other performances. Since it is the top-rated TV program in Japan, demand for caption broadcasting of this program for hearing-impaired viewers has been substantial.
The "re-speak" strategy was used to create the captions. A re-speaker, listening to the speech in the program, re-speaks the content of that speech into the speech recognizer, which automatically produces the caption script. The re-speaker can summarize the spoken content if necessary. The features of this captioning strategy are as follows:

It can be applied to programs with high levels of background noise because the re-speaker speaks in a silent studio.
Filled pauses or sounds indicating hesitation are not in general re-spoken, and so do not interfere with the speech recognition process.
Incomplete sentences without a subject word, which often appear in Japanese conversational speech, are supplemented with a subject by the re-speaker. This makes the caption easier to understand and improves a speech recognition accuracy.

Figure 1 shows the block diagram of the captioning system. The speech recognizer consists of a language model and an acoustic model. The language model represents the relation between word sequences in terms of probability of occurrence. It was trained from transcriptions of "Kohaku Utagassen" shows broadcast from 1994 through 2000 and similar NHK music shows, "Kayou Concert" and "Pop-jam" shows, broadcast from January 2000 through December 2001, in addition to manuscripts for "Kohaku Utagassen" 2001. The contents of the manuscripts were very close to the real show, even though their specific expressions often differed.
The acoustic model represents the re-speakers' voices as phonemes. It was adapted to each re-speaker for better recognition. Four re-speakers took turns re-speaking, because "Kohaku Utagassen" was a long show, over four hours. The system achieved a recognition accuracy of more than 95% and provided captions with a delay of less than three seconds.

Despite the fact that the captions generated by the speech recognizer were broadcast as is, including recognition errors, almost all of the large number of FAX responses received were extremely positive. Hearing-impaired viewers expressed delight at finally being able to enjoy the program together with their families. NHK plans to use the same system for caption broadcasting of the Winter Olympic Games from Salt Lake City in February 2002.

Figure 1: Block diagram