英文摘要 |
This paper embarks on alleviatingThis paper embarks on alleviating the problems caused by a multiple-speaker situation occurring frequently in a meeting for improved automatic speech recognition (ASR). There are a wide variety of ways for speakers to utter in the multiple-speaker situation. That is to say, people do not strictly follow the grammar when speaking and usually have a tendency to stutter while speaking, or often use personal idioms and some unique ways of speaking. Nevertheless, the existing language models employed in automatic transcription of meeting recordings rarely account for these facts but instead assume that all speakers participating in a meeting share the same speaking style or word-usage behavior. In turn, a single language model is built with all the manual transcripts of utterances compiled from multiple speakers that were taken holistically as the training set. To relax such an assumption, we endeavor to augment additional information cues into the training phase and the prediction phase of language modeling to accommodate the variety of speaker-related characteristics, through the process of speaker adaptation for language modeling. To this end, two disparate scenarios, i.e., 'known speakers' and 'unknown speakers,' for the prediction phase are taken into consideration for developing methods to extract speaker-related information cues to aid in the training of language models. Extensive experiments respectively carried out on automatic transcription of Mandarin and English meeting recordings show that the proposed language models along with different mechanisms for speaker adaption achieve good performance gains in relation to the baseline neural network based language model compared in this study. |