英文摘要 |
Search engines return thousands of pages per query. Many of them are relevant to the“query words”but not interesting to the“users”due to different domain-specific meanings of the query terms. Re-classification of the returned documents based on domain specific meanings of the query terms would therefore be most effective. A cross domain entropy (CDE) measure is proposed to extract characteristic domain specific words (DSW's) for each node of existing hierarchical web document trees. Domain specific class models are built based on the respective DSW's. Such class models are then used for directly classifying new documents into the hierarchy, instead of using hierarchical clustering techniques. High accuracy can be achieved with very few domain specific words. With only the top 5~10% DSW's and a maximum entropy based classifier, 99% accuracy is observed when classifying documents of a news web site into 63 domains. The precision and recall of the extracted domain specific words are also higher than those extracted with conventional TF-IDF term weighting method. |