英文摘要 |
In this paper, a new HMM structure is proposed to work with a limited training corpus in order to obtain improved synthetic-speech fluency. Spectral fluency is improved because this HMM structure can model the context-dependent spectral characteristics of a speech unit. In addition, instead of using a decision tree to cluster contexts, the knowledge of phoneme articulation is based to cluster contexts and reduce the enormous quantity of context combinations. To evaluate the proposed HMM structure, we construct three Mandarin speech synthesis systems each uses one different HMM structure for comparisons. In these systems, the prosodic parameters are all generated with same ANN modules studied previously but the spectral coefficients are generated with different HMM adopted by its corresponding system. As to the synthesis of signal waveform, the signal model, harmonic plus noise model (HNM), studied previously is commonly adopted in the three systems. According to the results of listening tests, the speech synthesized by the system using the proposed HMM structure is indeed more fluent than the speeches synthesized by the other two systems. In addition, average spectral distances are measured between recorded sentences and synthetic sentences. The results show that the HMM structure proposed here also obtains smaller average spectral distance than the other two HMM structures. |