中文摘要 |
"近年來,電腦輔助發音訓練(Computer assisted pronunciation training, CAPT)系統的需求日益上升。然而,現階段基於端對端(End-to-End)類神經網路架構之系統在錯誤發音檢測(Mispronunciation detection)的效能仍未臻完美,其原因是此類系統的內部模型本質上仍是屬於自動語音辨識(Automatic speech recognition, ASR)模型。ASR目的是儘量正確地辨識出語者所說內容,縱使其發音是有偏誤的;而CAPT目的恰巧相反,是要能儘量正確地偵測出語者的錯誤發音。有鑒於此,本論文基於CAPT任務通常會有文本提示的特殊性,嘗試將文本提示資訊融入於端對端模型架構。我們研究使用兩個編碼器(Encoders)分別處理發音特徵以及文本特徵,並以分層式注意力機制(Hierarchical attention mechanism, HAN)來動態地結合不同編碼器產生特徵表示。本論文在一套華語學習者語料庫進行一系列實驗;透過不同評估準則所獲得結果顯示,我們所提出的方法較現有方法有較佳的錯誤發音檢測效能。" |
英文摘要 |
More recently, there is a growing demand for the development of computer assisted pronunciation training (CAPT) systems, which can be capitalized to automatically assess the pronunciation quality of L2 learners. However, current CAPT systems that build on end-to-end (E2E) neural network architectures still fall short of expectation for the detection of mispronunciations. This is partly because most of their model components are simply designed and optimized for automatic speech recognition (ASR), but are not specifically tailored for CAPT. Unlike ASR that aims to recognize the utterance of a given speaker (even when poorly pronounced) as correctly as possible, CAPT manages to detect pronunciation errors as subtlety as possible. In view of this, we seek to develop an E2E neural CAPT method that makes use of two disparate encoders to generate embedding of an L2 speaker's test utterance and the corresponding canonical pronunciations in the given text prompt, respectively. The outputs of the two encoders are fed into a decoder through a hierarchical attention mechanism (HAM), with the purpose to enable the decoder to focus more on detecting mispronunciations. A series of experiments conducted on an L2 Mandarin Chinese speech corpus have demonstrated the effectiveness of our method in terms of different evaluation metrics, when compared with some state-of-the-art E2E neural CAPT methods. |