A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models

Jiang, Wei; Guan, Yi; Wang, Xiao-long

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models
作者	Jiang, Wei (Jiang, Wei)、Guan, Yi (Guan, Yi)、Wang, Xiao-long (Wang, Xiao-long)
中文摘要	A pragmatic Chinese word segmentation approach is presented in this paper based on mixing language models. Chinese word segmentation is composed of several hard sub-tasks, which usually encounter different difficulties. The authors apply the corresponding language model to solve each special sub-task, so as to take advantage of each model. First, a class-based trigram is adopted in basic word segmentation, which applies the Absolute Discount Smoothing algorithm to overcome data sparseness. The Maximum Entropy Model (ME) is also used to identify Named Entities. Second, the authors propose the application of rough sets and average mutual information, etc. to extract special features. Finally, some features are extended through the combination of the word cluster and the thesaurus. The authors’ system participated in the Second International Chinese Word Segmentation Bakeoff, and achieved 96.7 and 97.2 in F-measure in the PKU and MSRA open tests, respectively.
起訖頁	393-415
關鍵詞	Word segmentation、N-Gram、Maximum entropy model、Rough sets、Word cluster、Machine learning
刊名	中文計算語言學期刊
期數	200612 (11:4期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	Using a Small Corpus to Test Linguistic Hypotheses: Evaluating ‘People’ in the State of the Union Addresses