英文摘要 |
We present preliminary results for searching for useful sentences for learning ambiguous words with clustering methods. First, we search for sentences that contain an ambiguous word (the target word, henceforth). To make the extracted sentences useful for learning the target word, we attempt to guide the clustering methods to separate the sentences that carry different senses of the target word into different clusters. We influence the functioning of a clustering method by providing example sentences that carry specific senses of the target word. In the terminology of machine learning technology, we label a sentence with the sense of the target word in the sentence. Two sample labeled sentences for the ambiguous word “bank” follow. 1.“financial institution”: Mr. Black deposit the money in the Citi bank. 2.“place”: Along the bank of the Charles river, you may see the MIT campus. Assume that we can collect a large number of sentences that contain the target word, for which we need sentences that use a specific sense of the target word. Assume that we are willing to label a few of these original sentences as we described above. A clustering algorithm may employ the labeled sentences to build clusters of sentences for our needs. The algorithm may take advantage of the labeled sentences as informative seeds for initializing the clusters. In addition, when selecting the (unlabeled) sentences from the clustered sentences as the final output, the labeled sentences may also provide guidance for selecting the sentences of “correct” senses. If a cluster has many labeled sentences of a specific sense, the (unlabeled) sentences in this cluster might have the same label of the sample sentences. Furthermore, to select and output the (unlabeled) sentences in this cluster, we may consider the (unlabeled) sentences that are closer to the sample sentences. Assume that we may find thousands of sentences that use a target word, assume that we provide a certain number of labeled sentences to guide a clustering algorithm, assume that we cluster the thousands of sentences into tens of clusters, and assume that we select just tens of sentences from these tens of clusters. If our clustering methods are good and if we select sentences from a cluster conservatively, we may achieve high precision in the final selection of the unlabeled sentences for the target word. Empirical evaluations reported in this paper show promising results. Not surprisingly, we found that it was relatively easier to achieve better results for homonym than for polysemy. We hope our methods can be useful for building corpora for learning ambiguous words. |