單語者台灣腔中文即時語音合成系統

王奕雯; 陳嘉平

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	單語者台灣腔中文即時語音合成系統
並列篇名	Real-Time Single-Speaker Taiwanese-Accented Mandarin Speech Synthesis System
作者	王奕雯、陳嘉平
中文摘要	本論文研究單語者台灣腔中文即時語音合成系統，架構上採用文字序列端到梅爾頻譜圖序列端的合成器，再串接一個從梅爾頻譜圖到語音訊號的聲碼器。首先，我們嘗試使用GST Tacotron-2合成器串接Griffin-Lim聲碼器，搭配不同的資料集，包括北京腔中文語料與台灣腔中文語料等等，以及不同的訓練方式，包括遷移式學習（Transfer Learning）與集成式學習（Ensemble Learning）等等，進行了三種系統設定實驗。接著我們使用Tacotron-2串接Griffin-Lim架構與中文語料，實驗是否使用預訓練模型（Pretrained Model），再進行了兩種系統設定實驗。最後，我們從上述五種系統設定中挑選出MOS最高者，再將其聲碼器從Griffin-Lim替換成WaveGlow，評估兩種聲碼器對MOS的影響。我們使用的資料集包含單人中文12小時的標貝語料、單人中文4.5小時的個人錄製語料、單人中文2.2小時的教育廣播電台語料，以及單人英文24小時的LJSpeech語料。最終MOS最高的單語者台灣腔中文即時語音合成系統為，使用標貝語料預訓練、再使用教育廣播電台語料接續訓練的Tacotron-2模型，並串接使用LJSpeech語料預訓練、再使用標貝語料接續訓練的WaveGlow模型，MOS評分可達4.32，且該語音合成系統產生10秒48kHz的語音只須1.3秒，因此為即時語音合成系統。
英文摘要	In this paper, we study a real-time single-speaker Taiwanese-accented Mandarin speech synthesis system. This system uses an end-to-end sequence-to-sequence model from the text sequence to the Mel spectrogram sequence, and a vocoder to map the Mel spectrogram sequence to synthesized speech waveform. We first use the GST Tacotron-2 sequence-to-sequence model and the Griffin-Lim vocoder. The system is trained with several datasets, such as Mainland-accented Mandarin corpus and Taiwanese-accented Mandarin corpus, and with different training methods including transfer learning and ensemble learning. In this stage, three experiments were carried out. In addition, we use Tacotron-2 and Griffin-Lim with the same data sets and experimented with using model pretraining. In this stage, two experiments were carried out. Finally, the system setting with the highest MOS in the experiments is selected, and the Griffin-Lim vocoder is replaced by WaveGlow vocoder. The datasets we use include 12-hour Biaobei Mandarin corpus, 4.5-hour personal recording Mandarin corpus, 2.2-hour National Education Radio Mandarin corpus, and 24-hour LJSpeech English. At the end of day, the Real-Time Single-Speaker Taiwanese-Accented Mandarin Speech Synthesis System with the highest MOS we achieved is the system as follows: Tacotron-2 is pretrained with the Biaobei corpus, and then trained with the National Education Radio corpus, and the WaveGlow vocoder is pretrained with the LJSpeech corpus, and then trained with the Biaobei corpus. This system achieves the MOS score of 4.32 and generates 10 seconds of 48kHz speech in 1.3 seconds.
起訖頁	1-15
關鍵詞	Tacotron-2、GST Tacotron-2、Griffin-Lim、WaveGlow、Transfer Learning、Ensemble Learning、Pretrained Model
刊名	ROCLING論文集
期數	2020 (2020期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	以「語言學理論」為基礎用「非機率模型」建立的數學應用問題作答系統
該期刊-下一篇	自適應中文維度型情感詞典之建立

新書閱讀

元照讀書館

優惠活動

月旦品評家

元照讀書館

．研討會新訊

月旦知識庫

月旦法律分析庫
月旦醫事法網
月旦會計財稅網

期刊數位服務

社群平台

讀者服務

關於元照

讀者服務專線：+886-2-23756688　傳真：+886-2-23318496
地址：臺北市館前路28 號 7 樓　客服信箱