中文摘要 |
近年來鑑別式訓練(Discriminative training)的目標函數Lattice-free Maximum Mutual Information(LF-MMI)在自動語音辨識(Automatic speech recognition, ASR)上取得了重大的突破。儘管LF-MMI在監督式環境下斬獲最好的成果,然而在半監督式設定下,由於種子模型(Seed model)常因為語料有限而效果不佳。且由於LF-MMI屬於鑑別式訓練之故,易受到轉寫正確與否的影響。本論文利用兩種思路於半監督式訓練。其一,引入負條件熵(Negative conditional entropy, NCE)權重與詞圖(Lattice),前者是最小化詞圖路徑的條件熵(Conditional entropy),等同對MMI的參考轉寫(Reference transcript)做權重平均,權重的改變能自然地加入MMI訓練中,並同時對不確定性建模。其目的希望無信心過濾器(Confidence-based filter)也可訓練模型。後者加入詞圖,比起過往的只使用最佳辨識結果,可保留更多假說空間,進而提升找到參考轉寫(Reference transcript)的可能性;其二,我們借鑒整體學習(Ensemble learning)的概念,使用弱學習器(Weak learner)修正彼此的錯誤,分為假說層級合併(Hypothesis-level combination)和音框層級合併(Frame-level combination)。實驗結果顯示,加入NCE與詞圖皆能降低詞錯誤率(Word error rate, WER),而模型合併(Model combination)則能在各個階段顯著提升效能,且兩者結合可使詞修復率(WER recovery rate, WRR)達到60.8%。 |
英文摘要 |
In recent years, the so-called Lattice-free Maximum Mutual Information (LF-MMI) criterion has been proposed with good success for supervised training of state-of-the-art acoustic models in various automatic speech recognition (ASR) applications. However, when moving to the scenario of semi-supervised acoustic model training, the seed models of LF-MMI are often show inadequate competence due to limited available manually labeled training data. This is because LF-MMI shares a common deficiency of discriminative training criteria, being sensitive to the accuracy of the corresponding transcripts of training utterances. This paper sets out to explore two novel extensions of semi-supervised training in conjunction with LF-MMI. First, we capitalize more fully on negative conditional entropy (NCE) weighting and utilize word lattices for supervision in the semi-supervised setting. The former aims to minimize the conditional entropy of a lattice, which is equivalent to a weighted average of all possible reference transcripts. The minimization of the lattice entropy is a natural extension of the MMI objective for modeling uncertainty. The latter one, utilizing word lattices for supervision, manages to preserve more cues in the hypothesis space, by using word lattices instead of one-best results, to increase the possibility of finding reference transcripts of training utterances. Second, we draw on the notion stemming from ensemble learning to develop two disparate combination methods, namely hypothesis-level combination and frame-level combination. In doing so, the error-correcting capability of the acoustic models can be enhanced. The experimental results on a meeting transcription task show that the addition of NCE weighting, as well as the utilization of word lattices for supervision, can significantly reduce the word error rate (WER) of the ASR system, while the model combination approaches can also considerably improve the performance at various stages. Finally, fusion of the aforementioned two kinds of extensions can achieve a WER recovery rate (WRR) of 60.8%. |