中文摘要 |
多數語音轉換模型仰賴以傳統來源濾波器模型(source-filter model)為基礎之聲碼器(vocoder)對語音訊號進行語音參數抽取以及合成語音。然而,受限於傳統聲碼器的諸多理論與假設,以傳統聲碼器為架構進行語音轉換所生成的語音,其自然度以及與目標語者的相似度均無法進一步提升。在深度學習(deep learning)領域中,WaveNet是現階段最成功的語音生成技術之一,能產生與過去方法相比自然度更高的語音。WaveNet聲碼器為WaveNet的一個延伸,具備產生超越傳統聲碼器的高品質語音的能力,並已逐漸被從事語音轉換研究之國外團隊所採用。過去,國內研究團隊所開發的語音轉換模型多以傳統聲碼器為基礎進行語音轉換,本論文試圖將WaveNet聲碼器引入國內幾個新近提出的語音轉換模型,以評估WaveNet聲碼器在這些語音轉換模型上的應用潛力。於實驗中,我們比較了三種語音轉換模型分別使用傳統聲碼器與WaveNet聲碼器所得到的結果。其中,所比較的語音轉換模型包括1)變分式自動編碼器(variational auto-encoder, VAE)、2)結合生成式對抗型網路之變分式自動編碼器、以及3)跨特徵領域變分式自動編碼器(cross domain VAE, CDVAE)。實驗結果顯示,三個語音轉換模型在使用WaveNet聲碼器後,與目標語者的相似度均獲得顯著的改善。在自然度方面,則僅有以VAE為基礎之語音轉換模型在使用WaveNet聲碼器後有顯著的提升。 |
英文摘要 |
Most voice conversion models rely on vocoders based on the source-filter model to extract speech parameters and synthesize speech. However, the naturalness and similarity of the converted speech are limited due to the vast theories and constraints posed by traditional vocoders. In the field of deep learning, a network structure called WaveNet is one of the stateof- the-art techniques in speech synthesis, which is capable of generating speech samples of extremely high quality compared with past methods. One of the extensions of WaveNet is the WaveNet vocoder. Its ability to synthesize speech of quality higher than traditional vocoders has made it gradually adopted by several foreign voice conversion research teams. In this work, we study the combination of the WaveNet vocoder with the voice conversion models recently developed by domestic research teams, in order to evaluate the potential of applying the WaveNet vocoder to these voice conversion models and to introduce the WaveNet vocoder to the domestic speech processing research community. In the experiments, we compared the converted speeches generated by three voice conversion models using a traditional WORLD vocoder and the WaveNet vocoder, respectively. The compared voice conversion models include 1) variational auto-encoder (VAE), 2) variational autoencoding Wasserstein generative adversarial network (VAW-GAN), and 3) cross domain variarional auto-encoder (CDVAE). Experimental results show that, using the WaveNet vocoder, the similarity between the converted speech generated by all the three models and the target speech is significantly improved. As for naturalness, only VAE benefits from the WaveNet vocoder. |