應用平行語料建構中文斷詞組件

王瑞平; 劉昭麟

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	應用平行語料建構中文斷詞組件
作者	王瑞平、劉昭麟
中文摘要	不同於直接提供中文斷詞服務，網路上的開放軟體讓人們可以利用自有的訓練語料來訓練中文的斷詞模型，藉以實踐自己的斷詞功能。如若可以全人工方式建構斷詞訓練語料，則以目前的機器學習技術所訓練出來的模型，常常可以達到相當好的斷詞效果。然而，實務上全人工的標記工作常常是難以提供足夠多的訓練語料。本文利用中英平行語料與各類辭典，搭配中文未知詞和近義詞的偵測，先建構一個粗略的斷詞器，藉以產生訓練語料，最後再利用網路上的開放軟體來建構中文斷詞服務。在目前的實驗中，雖然依照我們的程序所得的斷詞服務未能立即獲得優於知名的中文斷詞服務的成效，但是表現卻相去不遠；我們所提出的訓練語料產生程序提供了一個一般人可以考慮的選擇。
英文摘要	Instead of directly providing the service of Chinese segmentation, some open-source software allows us to train segmentation models with segmented text. The resulting models can perform quite well, if training data of high quality are available. In reality, it is not easy to obtain sufficient and excellent training data, unfortunately. We report an exploration of using parallel corpora and various lexicons with techniques of identifying unknown words and near synonyms to automatically generate training data for such open-source software. We achieved promising results of segmentation in current experiments. Although the results fell short of outperforming the well-known Chinese segmenters, we believe that the proposed approach offers a viable alternative for users of the open-source software to generate their own training data.
起訖頁	341-355
關鍵詞	機器學習、語料標記、機器翻譯
刊名	ROCLING論文集
期數	2012 (2012期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	台語關鍵詞辨識之實作與比較
該期刊-下一篇	應用串接方法於連續變化轉速之四行程引擎聲音合成