英文摘要 |
Unknown word, in general, is the main factor that causes the performance of word segmentation to be unsatisfied. To recognize the words which are derived from highly productive morphemes, a set of 17 morphological rules is proposed in this paper to recognize those regular unknown words. In addition, an unknown word model is further proposed to deal with the unknown words of irregular forms such as proper name etc. With the unknown word resolution procedures, the error reduction rate of 78.34% in word and 81.87% in sentence are obtained in the task of smoothing technical manuals. To examine the procedures in more general task, a corpus of newspaper is also tested and the error reduction rate of 40.15% in word and 34.78% in sentence are observed. |