中文摘要 |
In this paper, the implementation of a high-performance Mandarin TTS system is presented. The system is com-posed of four main parts: text analysis (TA), prosodic in-formation generation (PIG), waveform table (WT) of 411 base-syllables, and PSOLA-based waveform synthesis (PSOLA). In TA, a statistical model based method is first employed to automatically tag the input text to obtain the word sequence and the associated part-of-speech (POS) sequence. A lexicon containing about 80000 words is used in the tagging process. Then the corresponding base-syllable sequence is found and used to get from WT the basic wave-form sequence. Some linguistic features used in PIG are also extracted in TA. In PIG, a four-layer recurrent neural network (RNN) is employed to generate some prosodic information including pitch contour, energy level, initial duration and nal duration of syllable as well as inter-syllable pause duration. Finally, in PSOLA the basic waveform sequence is modi ed using the prosodic information to generate output synthetic speech. The whole system is implemented by software on a PC/AT 486 with a 16-bit Sound Blaster add-on card. Only 3.2 Mbyte memory space is re-
quired. It can synthesize speech in real-time for any input Chinese text. Informal listening tests by many native Chinese living in Taiwan con rmed that the synthetic speech sounded very uent and natural. |