中文摘要 |
"仰賴深度學習(Deep Learning)的發展,近年來許多研究發現相位(Phase)資訊在語音增強(Speech Enhancement, SE)中至關重要。亦有學者發現,透過時域單通道語音增強技術,可以有效地去除雜訊,進而顯著提升語音辨識的精確度。啟發於此,本研究從時域及頻域面分別探討兩種考慮相位資訊的語音增強技術,並提出多視角注意力機制語音增強模型、融合時域及頻域兩者特徵運用於語音辨識中。我們藉由Aishell-1中文語料庫評估這些語音增強技術,透過使用各種雜訊源,模擬不同的雜訊狀態作為訓練及測試,進而驗證所提出的新方法皆優於基於其他時域及頻域的方法。具體而言,當測試於訊噪比為-5dB、5dB、15dB的三種環境下,使用新提出之方法中重新訓練(Retraining)之聲學模型(Acoustic Model, AM),與基於時域的方法相比較,在已知雜訊的測試集,分別使相對字錯誤率下降3.4%、2.5%及1.6%;而在未知雜訊的測試集,則使相對字錯誤率分別下降了3.8%、4.8%及2.2%。" |
英文摘要 |
Recently, many studies have found that phase information is crucial in Speech Enhancement (SE), and time-domain single-channel speech enhancement techniques have been proved effective on noise suppression and robust Automatic Speech Recognition (ASR). Inspired by this, this research investigates two recently proposed SE methods that consider phase information in time domain and frequency domain of speech signals, respectively. Going one step further, we propose a novel multi-view attention-based speech enhancement model, which can harness the synergistic power of the aforementioned time-domain and frequency-domain SE methods and can be applied equally well to robust ASR. To evaluate the effectiveness of our proposed method, we use various noise datasets to create some synthetic test data and conduct extensive experiments on the Aishell-1 Mandarin speech corpus. The evaluation results show that our proposed method is superior to some current state-of-the-art time-domain and frequency-domain SE methods. Specifically, compared with the time-domain method, our method achieves 3.4%, 2.5% and 1.6% in relative character error rate (CER) reduction at three signal-to-noise ratios (SNRs), -5 dB, 5 dB and 15 dB, respectively, for the test set of pre-known noise scenarios, while the corresponding CER reductions for the test set of unknown noise scenarios are 3.8%, 4.8% and 2.2%, respectively. |