Human-friendly Broadcasting Service
Enhancing Closed Captioning
We have developed a prototype system to recognize speech from program audio with background noise and music. The system estimates speech signals apart from background music and other noise so that it can accurately recognize spoken words.
Unscripted programs with multiple guests can be full of speech pronounced inaccurately. The system has automatically built up a database including inaccurate speech of approximately 1,000 hours from aired programming. Words pronounced inarticulately are recognized by estimating their accurate pronunciations from the database.
We will continue to improve our speech recognition technology and apply it to programs with complex speech patterns.
*1 Re-speaker: A method where the words spoken by people appearing in a program are rephrased by another speaker (called a re-speaker) for the
purpose of generating captions. It makes it possible to produce captions for programs with multiple speakers or large amounts of background noise.
*2 Acoustic model: a model that probabilistically estimates vowels and consonants from audio signals.