Using Contextual Information in Clustering Methods for Chinese Word Disambiguation
作者 李右元周子皓劉昭麟
多義詞是語言中極為常見的現象,在過去,若要查找多義詞的義項及其使用方式,必須翻查傳統辭典,但礙於篇幅問題並非所有義項都會收錄,因此所提供的例句數目也較少。即使隨著科技進步,發展了數位化的辭典與檢索系統,仍有部分問題存在。因此,人文學者必須耗費大量心力以人工判讀方式辨別義項的不同。本研究以分群模型將大量已向量化之中文語料加以處理,透過purity分數比較出最適之模型,並挑選適量的例句供使用者參考。實驗中以人工標記之例句作為評分依據,結果顯示屬於同形異義(homonymy)之多義詞在 macro-average、weighted-average與accuracy皆能達到0.85以上之水準。
We present preliminary results for searching for useful sentences for learning ambiguous words with clustering methods. First, we search for sentences that contain an ambiguous word (the target word, henceforth). To make the extracted sentences useful for learning the target word, we attempt to guide the clustering methods to separate the sentences that carry different senses of the target word into different clusters. We influence the functioning of a clustering method by providing example sentences that carry specific senses of the target word. In the terminology of machine learning technology, we label a sentence with the sense of the target word in the sentence. Two sample labeled sentences for the ambiguous word “bank” follow. 1.“financial institution”: Mr. Black deposit the money in the Citi bank. 2.“place”: Along the bank of the Charles river, you may see the MIT campus. Assume that we can collect a large number of sentences that contain the target word, for which we need sentences that use a specific sense of the target word. Assume that we are willing to label a few of these original sentences as we described above. A clustering algorithm may employ the labeled sentences to build clusters of sentences for our needs. The algorithm may take advantage of the labeled sentences as informative seeds for initializing the clusters. In addition, when selecting the (unlabeled) sentences from the clustered sentences as the final output, the labeled sentences may also provide guidance for selecting the sentences of “correct” senses. If a cluster has many labeled sentences of a specific sense, the (unlabeled) sentences in this cluster might have the same label of the sample sentences. Furthermore, to select and output the (unlabeled) sentences in this cluster, we may consider the (unlabeled) sentences that are closer to the sample sentences. Assume that we may find thousands of sentences that use a target word, assume that we provide a certain number of labeled sentences to guide a clustering algorithm, assume that we cluster the thousands of sentences into tens of clusters, and assume that we select just tens of sentences from these tens of clusters. If our clustering methods are good and if we select sentences from a cluster conservatively, we may achieve high precision in the final selection of the unlabeled sentences for the target word. Empirical evaluations reported in this paper show promising results. Not surprisingly, we found that it was relatively easier to achieve better results for homonym than for polysemy. We hope our methods can be useful for building corpora for learning ambiguous words.
關鍵詞 多義詞一詞多義同形異義分群模型詞向量句向量lexical ambiguitypolysemyhomonymyclusteringword vectorsentence vector
