STATISTICAL MODELS- FOR WORD SEGMENTATION AND UNKNOWN WORD RESOLUTION

Tung-Hui Chiang; ling-Shin Chang; Ming-Yu Lin; Keh-Yih Su

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	STATISTICAL MODELS- FOR WORD SEGMENTATION AND UNKNOWN WORD RESOLUTION
作者	Tung-Hui Chiang (Tung-Hui Chiang)、ling-Shin Chang (ling-Shin Chang)、Ming-Yu Lin (Ming-Yu Lin)、Keh-Yih Su (Keh-Yih Su)
英文摘要	In a Chinese sentence, there are no word delimiters, like blanks, between the 'words'. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation model; the various probabilistic models for word segmentation are then derived based on the generalized model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, a robust adaptive learning algorithm is proposed to adjust the parameters of the baseline models so as to increase the discrimination power and robustness of the models. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. It is possible to achieve accuracy in word recognition rate of 99.39% and sentence recognition rate of 97.65% in the testing corpus by incorporating word length information to a context-independent word model and applying a robust adaptive learning algorithm in the segmentation process. Since not all lexical items could be found in the system dictionary in real applications, the performance of most word segmentation methods in the literature may degraded significantly when unknown. words are encountered. Such an 'unknown word problem' is also examined in this paper. An error recovery mechanism based on the segmentation model is proposed. Preliminary experiments show that the error rates introduced by unknown words could be reduced significantly.
起訖頁	123-146
刊名	ROCLING論文集
期數	1992 (1992期)
出版單位	國立高雄師範大學輔導與諮商研究所
該期刊-上一篇	Acquisition of Unbounded Dependency Using Explanation-Based Learning
該期刊-下一篇	A Modular and Statistical Approach to Machine Translation