中文摘要 |
There is a need to measure word similarity when processing natural languages,
especially when using generalization, classification, or example -based approaches.
Usually, measures of similarity between two words are defined according to the
distance between their semantic classes in a semantic taxonomy . The taxonomy
approaches are more or less semantic -based that do not consider syntactic
similarit ies. However, in real applications, both semantic and syntactic similarities
are required and weighted differently. Word similarity based on context vectors is
a mixture of syntactic and semantic similarit ies.
In this paper, we propose using only syntactic related co-occurrences as context
vectors and adopt information theoretic models to solve the problems of data
sparseness and characteristic precision. The probabilistic distribution of
co-occurrence context features is derived by parsing the contextual environment
of each word , and all the context features are adjusted according to their IDF
(inverse document frequency) values. The agglomerative clustering algorithm is
applied to group similar words according to their similarity values . It turns out
that words with similar syntactic categories and semantic classes are grouped
together. |