中文摘要 |
This paper presents a stochastic model to tackle the problem of Chinese named
entity recognition. In this research, we unify component tokens of named entity and
their contexts into a generalized role set, which is like part-of-speech (POS). The
probabilities of role emission and transition are acquired after machine learning on
a role-labeled data set, which is transformed from a hand-corrected corpus after
word segmentation and POS tagging are performed. Given an original string, role
Viterbi tagging is employed on tokens segmented in the initial process. Then
named entities are identified and classified through maximum matching on the best
role sequence. In addition, named entity recognition using role model is
incorporated along with the unified class-based bigram model for word
segmentation. Thus, named entity candidates can be further selected in the final
process of Chinese lexical analysis. Various evaluations conducted using one month of news from the People’s Daily and MET-2 data set demonstrate that the
role modeled can achieve competitive performance in Chinese named entity
recognition. We then survey the relationship between named entity recognition and
Chinese lexical analysis via experiments on a 1,105,611-word corpus using
comparative cases. It was found that: on one hand, Chinese named entity
recognition substantially contributes to the performance of lexical analysis; on the
other hand, the subsequent process of word segmentation greatly improves the
precision of Chinese named entity recognition. We have applied the role model to
named entity identification in our Chinese lexical analysis system, ICTCLAS,
which is free software and available at the Open Platform of Chinese NLP
(www.nlp.org.cn). ICTCLAS ranked first with 97.58% in word segmentation
precision in a recent official evaluation, which was held by the National 973
Fundamental Research Program of China. |