基於多視角注意力機制語音增強模型於強健性自動語音辨識

趙福安; 洪志偉; 陳柏琳

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	基於多視角注意力機制語音增強模型於強健性自動語音辨識
並列篇名	Multi-view Attention-based Speech Enhancement Model for Noise-robust Automatic Speech Recognition
作者	趙福安、洪志偉、陳柏琳
中文摘要	"仰賴深度學習（Deep Learning）的發展，近年來許多研究發現相位（Phase）資訊在語音增強（Speech Enhancement, SE）中至關重要。亦有學者發現，透過時域單通道語音增強技術，可以有效地去除雜訊，進而顯著提升語音辨識的精確度。啟發於此，本研究從時域及頻域面分別探討兩種考慮相位資訊的語音增強技術，並提出多視角注意力機制語音增強模型、融合時域及頻域兩者特徵運用於語音辨識中。我們藉由Aishell-1中文語料庫評估這些語音增強技術，透過使用各種雜訊源，模擬不同的雜訊狀態作為訓練及測試，進而驗證所提出的新方法皆優於基於其他時域及頻域的方法。具體而言，當測試於訊噪比為-5dB、5dB、15dB的三種環境下，使用新提出之方法中重新訓練（Retraining）之聲學模型（Acoustic Model, AM），與基於時域的方法相比較，在已知雜訊的測試集，分別使相對字錯誤率下降3.4%、2.5%及1.6%；而在未知雜訊的測試集，則使相對字錯誤率分別下降了3.8%、4.8%及2.2%。"
英文摘要	Recently, many studies have found that phase information is crucial in Speech Enhancement (SE), and time-domain single-channel speech enhancement techniques have been proved effective on noise suppression and robust Automatic Speech Recognition (ASR). Inspired by this, this research investigates two recently proposed SE methods that consider phase information in time domain and frequency domain of speech signals, respectively. Going one step further, we propose a novel multi-view attention-based speech enhancement model, which can harness the synergistic power of the aforementioned time-domain and frequency-domain SE methods and can be applied equally well to robust ASR. To evaluate the effectiveness of our proposed method, we use various noise datasets to create some synthetic test data and conduct extensive experiments on the Aishell-1 Mandarin speech corpus. The evaluation results show that our proposed method is superior to some current state-of-the-art time-domain and frequency-domain SE methods. Specifically, compared with the time-domain method, our method achieves 3.4%, 2.5% and 1.6% in relative character error rate (CER) reduction at three signal-to-noise ratios (SNRs), -5 dB, 5 dB and 15 dB, respectively, for the test set of pre-known noise scenarios, while the corresponding CER reductions for the test set of unknown noise scenarios are 3.8%, 4.8% and 2.2%, respectively.
起訖頁	1-16
關鍵詞	語音強化、自動語音辨識、深度學習、單通道語音增強、重新訓練、聲學模型、Speech Enhancement、Automatic Speech Recognition、Deep Learning、Single-Channel Speech Enhancement、Re-training、Acoustic Models
刊名	ROCLING論文集
期數	2020 (2020期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	探究文本提示於端對端發音訓練系統之應用
該期刊-下一篇	Lectal Variation of the Two Chinese Causative Auxiliaries