即時中文語音合成系統

鄭安傑; 陳嘉平

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	即時中文語音合成系統
並列篇名	Real-Time Mandarin Speech Synthesis System
作者	鄭安傑、陳嘉平
中文摘要	本論文研究與實作即時中文語音合成系統。此一系統採用文字序列到梅爾頻譜序列的轉換模型，再串接一個從梅爾頻譜到合成語音的聲碼器。我們使用Tacotron2實作序列到序列轉換模型，配合數種不同的聲碼器，包括Griffin-Lim，World-Vocoder，與WaveGlow。其中以實作可逆編碼解碼函數的WaveGlow神經網路聲碼器最為突出，無論在合成速度或語音品質方面，皆令人印象深刻。我們使用單人12小時的標貝語料實作系統。在語音品質方面，使用WaveGlow聲碼器的合成系統語音的MOS為4.08，略低於真實語音的4.41，而遠勝另兩種聲碼器（平均2.93）。在處理速度方面，若使用GeForce RTX 2080 TI GPU，使用WaveGlow聲碼器的合成系統產生10秒48 kHz的語音僅需1.4秒，故為即時系統。
英文摘要	This thesis studies and implements the real time Chinese speech synthesis system. This system uses a conversion model of the text sequence to the Mel spectrum sequence, and then concatenates a vocoder from the Mel spectrum to the synthesized speech. We use Tacotron2 to implement a sequence-to-sequence conversion model with several different vocoders, including Griffin-Lim, World-Vocoder, and WaveGlow. The WaveGlow neural network vocoder, which implements the reversible codec function, is the most prominent, and is impressive in terms of synthesis speed or speech quality. We use a single speaker with 12-hour corpus implementation system. In terms of voice quality, the MOS of the synthesized system voice using the WaveGlow vocoder is 4.08, which is slightly lower than the 4.41 of the real voice, and far better than the other two vocoders(average 2.93). In terms of processing speed, if the GeForce RTX 2080 TI GPU isused, the synthesis system using the WaveGlow vocoder produces a voice of 10 seconds and 48 kHz in 1.4 seconds, so it is a real time system.
起訖頁	53-61
關鍵詞	文字轉語音、Tacotron2、WaveGlow、TTS、Tacotron2、WaveGlow
刊名	中文計算語言學期刊
期數	201912 (24:2期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	適合漸凍人使用之語音轉換系統初步研究
該期刊-下一篇	基於訊息回應配對相似度估計的聊天記錄解構