基於語境特徵及分群模型之中文多義詞消歧

李右元; 周子皓; 劉昭麟

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	基於語境特徵及分群模型之中文多義詞消歧
並列篇名	Using Contextual Information in Clustering Methods for Chinese Word Disambiguation
作者	李右元、周子皓、劉昭麟
中文摘要	多義詞是語言中極為常見的現象，在過去，若要查找多義詞的義項及其使用方式，必須翻查傳統辭典，但礙於篇幅問題並非所有義項都會收錄，因此所提供的例句數目也較少。即使隨著科技進步，發展了數位化的辭典與檢索系統，仍有部分問題存在。因此，人文學者必須耗費大量心力以人工判讀方式辨別義項的不同。本研究以分群模型將大量已向量化之中文語料加以處理，透過purity分數比較出最適之模型，並挑選適量的例句供使用者參考。實驗中以人工標記之例句作為評分依據，結果顯示屬於同形異義（homonymy）之多義詞在 macro-average、weighted-average與accuracy皆能達到0.85以上之水準。
英文摘要	We present preliminary results for searching for useful sentences for learning ambiguous words with clustering methods. First, we search for sentences that contain an ambiguous word (the target word, henceforth). To make the extracted sentences useful for learning the target word, we attempt to guide the clustering methods to separate the sentences that carry different senses of the target word into different clusters. We influence the functioning of a clustering method by providing example sentences that carry specific senses of the target word. In the terminology of machine learning technology, we label a sentence with the sense of the target word in the sentence. Two sample labeled sentences for the ambiguous word “bank” follow. 1.“financial institution”: Mr. Black deposit the money in the Citi bank. 2.“place”: Along the bank of the Charles river, you may see the MIT campus. Assume that we can collect a large number of sentences that contain the target word, for which we need sentences that use a specific sense of the target word. Assume that we are willing to label a few of these original sentences as we described above. A clustering algorithm may employ the labeled sentences to build clusters of sentences for our needs. The algorithm may take advantage of the labeled sentences as informative seeds for initializing the clusters. In addition, when selecting the (unlabeled) sentences from the clustered sentences as the final output, the labeled sentences may also provide guidance for selecting the sentences of “correct” senses. If a cluster has many labeled sentences of a specific sense, the (unlabeled) sentences in this cluster might have the same label of the sample sentences. Furthermore, to select and output the (unlabeled) sentences in this cluster, we may consider the (unlabeled) sentences that are closer to the sample sentences. Assume that we may find thousands of sentences that use a target word, assume that we provide a certain number of labeled sentences to guide a clustering algorithm, assume that we cluster the thousands of sentences into tens of clusters, and assume that we select just tens of sentences from these tens of clusters. If our clustering methods are good and if we select sentences from a cluster conservatively, we may achieve high precision in the final selection of the unlabeled sentences for the target word. Empirical evaluations reported in this paper show promising results. Not surprisingly, we found that it was relatively easier to achieve better results for homonym than for polysemy. We hope our methods can be useful for building corpora for learning ambiguous words.
起訖頁	281-295
關鍵詞	多義詞、一詞多義、同形異義、分群模型、詞向量、句向量、lexical ambiguity、polysemy、homonymy、clustering、word vector、sentence vector
刊名	ROCLING論文集
期數	2019 (2019期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	探究端對端語音辨識於發音檢測與診斷
該期刊-下一篇	Influences of Prosodic Feature Replacement on the Perceived Singing Voice Identity