A Realistic and Robust Model for Chinese Word Segmentation

Huang Chu-Ren; Yo Ting-Shuo; Petr Simon; Hsieh Shu-Kai

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	A Realistic and Robust Model for Chinese Word Segmentation
並列篇名	A Realistic and Robust Model for Chinese Word Segmentation
作者	Huang Chu-Ren (Huang Chu-Ren)、Yo Ting-Shuo (Yo Ting-Shuo)、Petr Simon (Petr Simon)、Hsieh Shu-Kai (Hsieh Shu-Kai)
英文摘要	A realistic Chinese word segmentation tool must adapt to textual variations with minimal training input and yet robust enough to yield reliable segmentation result for all variants. Various lexicon-driven approaches to Chinese segmentation, e.g., achieve high f-scores yet require massive training for any variation. Text-driven approach, e.g., can be easily adapted for domain and genre changes yet has difficulty matching the high f-scores of the lexicon-driven approaches. In this paper, we refine and implement an innovative text-driven word boundary decision (WBD) segmentation model proposed in. The WBD model treats word segmentation simply and efficiently as a binary decision on whether to realize the natural textual break between two adjacent characters as a word boundary. The WBD model allows simple and quick training data preparation converting characters as contextual vectors for learning the word boundary decision. Machine learning experiments with four different classifiers show that training with 1,000 vectors and 1 million vectors achieve comparable and reliable results. In addition, when applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall rates that are higher than all published results. Unlike all previous work, our OOV recall rate is comparable to our own F-score. Both experiments support the claim that the WBD model is a realistic model for Chinese word segmentation as it can be easily adapted for new variants with robust result. In conclusion, we will discuss linguistic ramifications as well as future implications for the WBD approach.
起訖頁	1-14
關鍵詞	segmentation
刊名	ROCLING論文集
期數	2008 (2008期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	形音相近的易混淆漢字的搜尋與應用
該期刊-下一篇	國台語無聲調拼音輸入法實作

新書閱讀

元照讀書館

優惠活動

月旦品評家

元照讀書館

．研討會新訊

月旦知識庫

月旦法律分析庫
月旦醫事法網
月旦會計財稅網

期刊數位服務

社群平台

讀者服務

關於元照

讀者服務專線：+886-2-23756688　傳真：+886-2-23318496
地址：臺北市館前路28 號 7 樓　客服信箱