語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度

王聖堯; 黃奕欽

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度
並列篇名	Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System
作者	王聖堯、黃奕欽
中文摘要	近年來在語音合成的研究之中，單一語者的合成系統已經有著高品質的表現，但對於多語者系統來說，合成語音的品質與語者相似度仍是一大挑戰，本研究針對合成語音的品質與語者相似度兩個議題來建立出一套可合成多語者之文字轉語音系統，首先針對多語者的議題中，目標為透過少量樣本(Zero-Shot)來達成語者轉換，我們透過語者嵌入向量(Speaker Embedding)的引入來實作多語者語音合成系統，並比較針對不同任務所建立的語者嵌入向量的效果差異。在此我們比較了用於語者辨識 (Speaker Verification)以及單純用於語音轉換(Voice Conversion)的語者嵌入向量。接著，為了提升合成的語者相似度以及語音品質，我們嘗試置換類神經網路架構中，作為提升頻譜的 Post-Net 的部分，在此處我們使用了一個後置濾波器(Post-Filter)的網路來取代，且比較和 Post-Net 所產生的頻譜差異以及探討其模型參數量之差異性。實驗結果表明，透過疊加性注意力機制來整合語者嵌入向量進入到類神經網路架構的語音合成系統的確能夠有效地產生具有目標語者的合成語音，並且在加入後置濾波器網路後能夠比傳統透過 Post-Net 的方式來強化合成語音的語者特性以及語音品質，且合成一般長度語音句的時間約為 2 秒鐘，已接近即時合成個人化語音之成果。未來的研究方向會加入更多資訊來幫助語者嵌入向量在 TTS 的效能上改進。
英文摘要	"In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add the speaker embedding into encoding module of the end-to-end TTS framework while using small amount of the speech data of the training speakers. For speaker embedding method, in our study, two speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared for deciding which one is suitable for the personalized TTS system. Besides, we substituted the conventional post-net module, which is conventionally adopted to enhance the output spectrum sequence, to a post-filter network, which is further improving the speech quality of the generated speech utterance. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, other feature such as prosody information will be incorporated to help the TTS framework to improve the performance. "
起訖頁	49-65
關鍵詞	多語者語音合成、語音轉換、語者識別、少量樣本、後置濾波器、Multi-speaker Text-to-Speech、Voice Conversion、Speaker Verification、Zero-Shot、Post-Filter
刊名	中文計算語言學期刊
期數	202112 (26:2期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	使用低通時序列語音特徵訓練理想比率遮罩法之語音強化
該期刊-下一篇	Answering Chinese Elementary School Social Studies Multiple Choice Questions