英文摘要 |
Previous research on automatic thesaurus construction most focused on extracting relevant terms for each term of concern from a small-scale and domain-specific corpus. This study emphasizes on utilizing the Web as the rich and dynamic corpus source for term association estimation. In addition to extracting relevant terms, we are interested in finding concept-level information for each term of concern. For a single term, our idea is that to send it into Web search engines to retrieve its relevant documents and we propose a Greedy-EMbased document clustering algorithm to cluster them and determine an appropriate number of relevant concepts for the term. Then the keywords with the highest weighted log likelihood ratio in each cluster are treated as the label(s) of the associated concept cluster for the term of concern. With some initial experiments, the proposed approach has been shown its potential in finding relevant concepts for unknown terms. |