No.312002/02

First "Kohaku Utagassen" Closed Captioning Service
-Live broadcast closed captioned by the automatic recognition of repeated content-

In December 2001, NHK began caption broadcasting of the program "Kohaku Utagassen". This was the first-ever implementation of superimposed captioning of an entertainment program based on speech recognition technology. "Kohaku Utagassen" is a variety show broadcast every New Year's Eve, featuring many popular singers, comedy and other performances. Since it is the top-rated TV program in Japan, demand for caption broadcasting of this program for hearing-impaired viewers has been substantial.
The 're-speak' strategy was used to create the captions. A 're-speaker', listening to the performer's speech during the program, utters the content of that speech into the speech recognizer, which automatically produces a caption script. The re-speaker can summarize the spoken content if necessary. The features of this strategy are as follows:
  • It can be applied to programs with high levels of background noise because the re-speaker speaks in a silent studio.
  • Filled pauses or sounds indicating hesitation are not, in general, re-spoken, and as such, do not interfere with the speech recognition process.
  • Incomplete sentences, for example, ones without a subject word, which often appear in Japanese conversational speech, are supplemented with the subject by the re-speaker. This makes the caption easier to understand and improves speech recognition accuracy.

The speech recognizer consists of a language model and an acoustic model. The language model represents the relation between word sequences in terms of probability of occurrence. It was trained from transcriptions of "Kohaku Utagassen" shows broadcast from 1994 through 2000 and other similar NHK's music shows, "Kayou Concert" and "Pop-jam", broadcast from January 2000 through December 2001, in addition to manuscripts for "Kohaku Utagassen" 2001. The contents of the manuscripts were very close to the real show, even though specific expressions often differed.
The acoustic model represents the re-speakers' voices as pronunciation symbols of phonemes. It was adapted to each re-speaker for better recognition. Re-speakers had to take turns re-speaking, because "Kohaku Utagassen" was over four hours long. The system achieved a recognition accuracy of more than 95% and provided captions with a delay of less than three seconds.

Despite the fact that the captions generated by the speech recognizer were broadcast as is, including recognition errors, almost all of the large number of FAX responses received were extremely positive. Hearing-impaired viewers expressed delight at finally being able to enjoy the program together with their families. NHK has just used this system for caption broadcasting of the Winter Olympic Games in Salt Lake City in February 2002.




Copyright 2002 NHK (Japan Broadcasting Corporation) All rights reserved. Unauthorized copy of the pages is prohibited.


STRL Newsletter NHK STRL