英文摘要 |
This prototype system demonstrates a novel method of word segmentation based on corpus statistics. Since the central technique we used is unsupervised training based on a large corpus, we refer to this approach as unsupervised word segmentation. The unsupervised approach is general in scope and can be applied to both Mandarin Chinese and Taiwanese. In this prototype, we illustrate its use in word segmentation of Taiwanese Bible written in Hanzi and Romanized characters. Basically, it involves:1.Computing mutual information, MI, between Hanzi and Romanized characters A and B. If A and B have a relatively high MI, we lean toward treating AB as a word. 2.Using a greedy method to form words of 2 to 4 characters in the input sentences. 3.Building an N-gram model from the results of first-round word segmentation.4.Segmenting words based on the N-gram model.5.Iterating between the above two steps: building N-gram and word segmentation. |