英文摘要 |
This paper proposes a novel feature for conditional random field (CRF) model in Chinese word segmentation system. The system uses a conditional random field as machine learning model with one simple feature called term contributed boundaries (TCB) in addition to the“BIEO”character-based label scheme. TCB can be extracted from unlabeled corpora automatically, and segmentation variations of different domains are expected to be reflected implicitly. The dataset used in this paper is the closed training task in CIPS-SIGHAN-2010 bakeoff, including simplified and traditional Chinese texts. The experiment result shows that TCB does improve“BIEO”tagging domain-independently about 1% of the F1 measure score. |