Font Size

Human-friendly Broadcasting Service


Speech Recognition for Live Captioning Inarticulate Program

Enhancing Closed Captioning


Closed captions of program audio are essential to aid viewers, especially the hearing impaired, but conventional methods require a “re-speaker*1” in order to create accurate captions, especially when content includes expressive and unscripted speech. We are working on applying speech recognition technologies to produce live closed captioning directly from program audio by developing ways to reduce background noise and improve recognition of inarticulate speech.


●Reducing background noise

We have developed a prototype system to recognize speech from program audio with background noise and music. The system estimates speech signals apart from background music and other noise so that it can accurately recognize spoken words.

●Recognizing inaccurate pronunciation

Unscripted programs with multiple guests can be full of speech pronounced inaccurately. The system has automatically built up a database including inaccurate speech of approximately 1,000 hours from aired programming. Words pronounced inarticulately are recognized by estimating their accurate pronunciations from the database.

Future plans

We will continue to improve our speech recognition technology and apply it to programs with complex speech patterns.

*1 Re-speaker: A method where the words spoken by people appearing in a program are rephrased by another speaker (called a re-speaker) for the
purpose of generating captions. It makes it possible to produce captions for programs with multiple speakers or large amounts of background noise.
*2 Acoustic model: a model that probabilistically estimates vowels and consonants from audio signals.

Speech recognition technology to work out inarticulate and complex speech audio
Return to List