英文摘要 |
In this paper, a method to detect prosodic phrase structure of Mandarin speech is proposed. It first employs an RNN to discriminate each input frame of an utterance among three broad classes of syllable initial, syllable final, and silence. Outputs of the RNN are then used to drive an FSM for segmenting the input utterance into four types of segment. They include three stable-segment - I (initial), F (final), and S (silence), and a transition-segment - T (transition). Appropriate modeling features are thus extracted from the vicinities of F-segments, and used to model the prosodic states for inter-F-segment intervals. Two prosodic-state modeling schemes are studied. One uses VQ to encode the modeling features and directly classify inter-F-segment intervals into 8 prosodic states. The other uses an RNN, trained with relevant linguistic features as output targets, to implicitly represent the prosodic status by the outputs of its hidden layer. Prosodic states can be obtained by vector-quantizing the outputs of the hidden layer of the RNN. Experimental results showed that linguistically meaningful interpretation~ of these prosodic states can be observed. |