英文摘要 |
Research on Chinese part-of-speech tagging has been very active recently. However, there are several problems the research must confront before a successful tagger can be realized. Among them are word definition, segmentation, lexicon, tag set, tagging guideline, and tagged corpora. We propose machine-clustered word classes as an alternative for part-of-speech to be used in class n-gram models. Chinese characters and words are automatically clustered into a predefined number of classes using a simulated annealing approach. The 1991 United Daily text corpus of approximately 10 million characters is used to collect the statistics of character and word collocation. We will show and discuss some preliminary experimental results, which are considered promising and interesting. |