Automated Audio Descriptions for Live Sports Programs

A timing determination method for the insertion of audio descriptions

NHK STRL is conducting research on automated audio descriptions (AADs) to help visually impaired people enjoy live TV programs. In a live sports program, AADs supplement the breaks between the announcer's sports commentary vocally with information that cannot be conveyed in live commentary, such as game scores displayed on the screen (Fig. 1).
This technology makes use of real-time game data (describing, for example, who, when, and what is done) that are originally generated for sports-related internet sites. The data are turned into audio descriptions by using a speech synthesizer, and such audio descriptions are delivered through the internet separately from the broadcast audio. In this way, people who have difficulty obtaining information from images can get basic information about a sporting event in real time, without broadcasters having to verbally present the information.

Listen to AAD service example: table tennis
(A synthesized female voice delivers audio descriptions)

Note: our AAD technology is intended for Japanese language, and this example demonstrates how the AAD would sound in English.

Technical Features

Since both audio descriptions and the announcer’s commentary need to be clearly audible, it is important to establish a timing determination method for appropriately inserting audio descriptions without overlapping with the announcer’s commentary. Therefore, we considered the fact that

the fundamental frequency (F0) of an announcer’s commentary tends to become lower from the start to the end of an utterance in Japanese language (Fig. 2).

We made an AAD system that

predicts the end of each utterance in a commentary by looking at the changing F0 and inserts an audio description immediately after the live commentary.

Fig. 2: Predicting audio description insertion timing by looking at the fundamental frequency of live commentary

We had visually impaired people evaluate the ease of listening to both commentaries and AADs, and confirmed that the method was effective and made it easier for them to listen.
We are planning to perform trials in sports events to further improve the system. This AAD system can be used not only for live sports programs but also for other programs, giving it a wide range of applicability.
Regarding the applicability of this AAD technology to other languages, it depends on the tone change characteristics of the language; To predict the end of utterance, the use of fundamental frequency (f0) may be effective for some languages as is effective for Japanese, or the use of voice loudness may be effective for other languages.