Simultaneous Subtitling of Live Broadcast Programs by Automatic Recognition of Re-spoken Speech |
In
December 2001, NHK began caption broadcasting of the program "Kohaku
Utagassen". This was the first-ever implementation of superimposed
captioning of an entertainment program based on speech recognition
technology. The program is a music-based variety program broadcast
every New Year's Eve, featuring many popular singers, comedy and
other performances. Since it is the top-rated TV program in Japan,
demand for caption broadcasting of this program for hearing-impaired
viewers has been substantial.
The "re-speak" strategy was used to create the captions. A re-speaker, listening to the speech in the program, re-speaks the content of that speech into the speech recognizer, which automatically produces the caption script. The re-speaker can summarize the spoken content if necessary. The features of this captioning strategy are as follows:
|
It can be applied to programs with high levels of background noise because the re-speaker speaks in a silent studio. |
|
Filled pauses or sounds indicating hesitation are not in general re-spoken, and so do not interfere with the speech recognition process. |
|
Incomplete sentences without a subject word, which often appear in Japanese conversational speech, are supplemented with a subject by the re-speaker. This makes the caption easier to understand and improves a speech recognition accuracy. |
Figure 1 shows the block diagram of the captioning system. The speech recognizer consists of a language model and an acoustic model. The language model represents the relation between word sequences in terms of probability of occurrence. It was trained from transcriptions of "Kohaku Utagassen" shows broadcast from 1994 through 2000 and similar NHK music shows, "Kayou Concert" and "Pop-jam" shows, broadcast from January 2000 through December 2001, in addition to manuscripts for "Kohaku Utagassen" 2001. The contents of the manuscripts were very close to the real show, even though their specific expressions often differed.
The
acoustic model represents the re-speakers' voices as phonemes. It
was adapted to each re-speaker for better recognition. Four re-speakers
took turns re-speaking, because "Kohaku Utagassen" was a long show,
over four hours. The system achieved a recognition accuracy of more
than 95% and provided captions with a delay of less than three seconds.
Despite the fact that the captions generated by the speech recognizer
were broadcast as is, including recognition errors, almost all
of the large number of FAX responses received were extremely positive.
Hearing-impaired viewers expressed delight at finally being able
to enjoy the program together with their families. NHK plans to
use the same system for caption broadcasting of the Winter Olympic
Games from Salt Lake City in February 2002.
|
Figure 1: Block diagram |
|
|