英文摘要 |
In this paper, we study a real-time single-speaker Taiwanese-accented Mandarin speech synthesis system. This system uses an end-to-end sequence-to-sequence model from the text sequence to the Mel spectrogram sequence, and a vocoder to map the Mel spectrogram sequence to synthesized speech waveform. We first use the GST Tacotron-2 sequence-to-sequence model and the Griffin-Lim vocoder. The system is trained with several datasets, such as Mainland-accented Mandarin corpus and Taiwanese-accented Mandarin corpus, and with different training methods including transfer learning and ensemble learning. In this stage, three experiments were carried out. In addition, we use Tacotron-2 and Griffin-Lim with the same data sets and experimented with using model pretraining. In this stage, two experiments were carried out. Finally, the system setting with the highest MOS in the experiments is selected, and the Griffin-Lim vocoder is replaced by WaveGlow vocoder. The datasets we use include 12-hour Biaobei Mandarin corpus, 4.5-hour personal recording Mandarin corpus, 2.2-hour National Education Radio Mandarin corpus, and 24-hour LJSpeech English. At the end of day, the Real-Time Single-Speaker Taiwanese-Accented Mandarin Speech Synthesis System with the highest MOS we achieved is the system as follows: Tacotron-2 is pretrained with the Biaobei corpus, and then trained with the National Education Radio corpus, and the WaveGlow vocoder is pretrained with the LJSpeech corpus, and then trained with the Biaobei corpus. This system achieves the MOS score of 4.32 and generates 10 seconds of 48kHz speech in 1.3 seconds. |