Research Area

4.3  Speech transcription technology

  The transcription of speech in video footage is indispensable to produce programs. A system to produce transcription efficiently is needed for the more swift delivery of accurate programs to viewers. In FY 2018, we worked to increase the accuracy of speech recognition with the aim of realizing a transcription production system using speech recognition and began developing a transcription interface that can also handle the transcription of live content in addition to recorded video footage.


Speech recognition technology for transcription assistance

  When the speech of a broadcast program is recognized for closed captioning, word sequences are usually estimated using conversational speech as input. Meanwhile, it is difficult to recognize conversational speech in video footage targeted for transcription assistance only with the sound information because the speech is not always recorded under a favorable sound recording condition and also contains many informal phrases. To address this problem, we researched a recognition technique that uses video information as well as sound information, focusing on the fact that footage used by broadcasters contains video information. We developed a new method that utilizes the information of the middle layer of a deep neural network (DNN) aimed at the object recognition of images and the information of video captions generated by DNNs for training a language model that expresses multiple word chains. This method improved the accuracy of the language model(1).
  Since speech in video footage contains unclear sentence structures as well as many hesitations and repetitions, the recognition result lacks readability if displayed as is. It is therefore necessary to rewrite the speech recognition result in various ways, such as inserting appropriate punctuation marks and deleting unnecessary words, to improve the readability, but rule-based automatic formatting is not sufficient for improving the readability. To address this problem, we constructed a sentence format model that formats speech recognition results while making corrections using Encoder-Decorder networks based on DNNs. Using the closed captions of broadcast programs as the correct answer, we conducted experiments to see how this model improves the rate of disagreement with the closed captions(2). We also conducted comparative experiments using training data with different rates of disagreement with the closed captions and the results demonstrated that there are criteria for selecting training data suited for sentence formatting.
  Footage targeted for transcription also contains telephone speech, for which there is a strong demand for transcription assistance as with other materials. However, it was difficult to recognize telephone speech with the speech recognition system that we previously developed because there are restrictions on the frequency bandwidth for telephone speech materials. We therefore began an effort to convert the speech of our existing training data to telephone speech in a simulated manner and use it for training an acoustic model for telephone speech.


Transcription interface

  We continued to develop an interface that allows the user to refer to speech recognition results efficiently and to modify the recognition errors as necessary with a minimum operation. In FY 2017, we developed a modification interface for recorded content. In FY 2018, we developed a real-time transcription system for live content such as live broadcast programs and transmission contributions(3)(4).
  For the prompt transcription of live content, this system employs real-time speech recognition process and an HTTP Live Streaming (HLS) delivery technology for enabling the preview of arbitrary parts of live content (Figure 4-4). It also has a function to immediately deliver modified parts and corrected characters to all terminals because the transcription of live content is likely to involve cooperative work by multiple people.
  To verify the effectiveness of this system, we introduced two sets of this system to news program production sites where the transcription of live content is frequently performed and began verification experiments.



Figure 4-4. Real-time transcription system

 

[References]
(1) A. Hagiwara, H. Ito, M. Ichiki, T.Kobayakawa, T. Mishima and S. Sato: "Language model utilizing image features for automatic speech recognition," Autumn Meetings of the Acoustical Society of Japan, 2-Q-13, pp.1071-1074 (2018) (in Japanese)
(2) H. Ito, A. Hagiwara, T.Kobayakawa, T. Mishima, S. Sato and A. Kobayashi and S. Sato: "Transcription formation using encoder-decoder network," Autumn Meetings of the Acoustical Society of Japan, 2-Q-13, pp.1021-1022 (2018) (in Japanese)
(3) T. Mishima, M. Ichiki, A. Hagiwara, H. Ito, T.Kobayakawa and S. Sato: "A Real-Time Transcription System Using Speech Recognition," ITE Annual Convention, 21D-4 (2018) (in Japanese)
(4) A. Hagiwara, H. Ito, T.Kobayakawa, T. Mishima and S. Sato: "Development of Transcription System Using Speech Recognition for Video Footage," IPSJ SIG Technical Report, Vol. 2018-SLP-124, No. 5, pp.1-6, (2018) (in Japanese)