結合鑑別式訓練與模型合併於半監督式語音辨識之研究

羅天宏; 陳柏琳

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	結合鑑別式訓練與模型合併於半監督式語音辨識之研究
並列篇名	Leveraging Discriminative Training and Model Combination for Semi-supervised Speech Recognition
作者	羅天宏、陳柏琳
中文摘要	近年來鑑別式訓練（Discriminative training）的目標函數Lattice-free Maximum Mutual Information（LF-MMI）在自動語音辨識（Automatic speech recognition, ASR）上取得了重大的突破。儘管LF-MMI在監督式環境下斬獲最好的成果，然而在半監督式設定下，由於種子模型（Seed model）常因為語料有限而效果不佳。且由於LF-MMI屬於鑑別式訓練之故，易受到轉寫正確與否的影響。本論文利用兩種思路於半監督式訓練。其一，引入負條件熵（Negative conditional entropy, NCE）權重與詞圖（Lattice），前者是最小化詞圖路徑的條件熵（Conditional entropy），等同對MMI的參考轉寫（Reference transcript）做權重平均，權重的改變能自然地加入MMI訓練中，並同時對不確定性建模。其目的希望無信心過濾器（Confidence-based filter）也可訓練模型。後者加入詞圖，比起過往的只使用最佳辨識結果，可保留更多假說空間，進而提升找到參考轉寫（Reference transcript）的可能性；其二，我們借鑒整體學習（Ensemble learning）的概念，使用弱學習器（Weak learner）修正彼此的錯誤，分為假說層級合併（Hypothesis-level combination）和音框層級合併（Frame-level combination）。實驗結果顯示，加入NCE與詞圖皆能降低詞錯誤率（Word error rate, WER），而模型合併（Model combination）則能在各個階段顯著提升效能，且兩者結合可使詞修復率（WER recovery rate, WRR）達到60.8％。
英文摘要	In recent years, the so-called Lattice-free Maximum Mutual Information (LF-MMI) criterion has been proposed with good success for supervised training of state-of-the-art acoustic models in various automatic speech recognition (ASR) applications. However, when moving to the scenario of semi-supervised acoustic model training, the seed models of LF-MMI are often show inadequate competence due to limited available manually labeled training data. This is because LF-MMI shares a common deficiency of discriminative training criteria, being sensitive to the accuracy of the corresponding transcripts of training utterances. This paper sets out to explore two novel extensions of semi-supervised training in conjunction with LF-MMI. First, we capitalize more fully on negative conditional entropy (NCE) weighting and utilize word lattices for supervision in the semi-supervised setting. The former aims to minimize the conditional entropy of a lattice, which is equivalent to a weighted average of all possible reference transcripts. The minimization of the lattice entropy is a natural extension of the MMI objective for modeling uncertainty. The latter one, utilizing word lattices for supervision, manages to preserve more cues in the hypothesis space, by using word lattices instead of one-best results, to increase the possibility of finding reference transcripts of training utterances. Second, we draw on the notion stemming from ensemble learning to develop two disparate combination methods, namely hypothesis-level combination and frame-level combination. In doing so, the error-correcting capability of the acoustic models can be enhanced. The experimental results on a meeting transcription task show that the addition of NCE weighting, as well as the utilization of word lattices for supervision, can significantly reduce the word error rate (WER) of the ASR system, while the model combination approaches can also considerably improve the performance at various stages. Finally, fusion of the aforementioned two kinds of extensions can achieve a WER recovery rate (WRR) of 60.8%.
起訖頁	19-34
關鍵詞	自動語音辨識、鑑別式訓練、半監督式訓練、模型合併、LF-MMIAutomatic Speech Recognition、Discriminative Training、Semi-supervised Training、Model Combination、LF-MMI
刊名	中文計算語言學期刊
期數	201812 (23:2期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	使用長短期記憶類神經網路建構中文語音辨識器之研究
該期刊-下一篇	結合鑑別式訓練聲學模型之類神經網路架構及優化方法的改進