英文摘要 |
In view of the phenomenon of too much repeated webpage on the Internet, this paper proposes an approximately duplicate webpage detection algorithm and system , which combined multi-feature fingerprint cluster detection with document similarity detection. In this scheme, the multi-feature fingerprint cluster detection is used first to ensure the precision and efficiency of the algorithm; for small portion of the document that not be recalled, approximately duplicate webpage detection algorithm is used to guarantee the recall rate. The scheme has good improvements in the aspects of precision and recall rate, and at the same time has a good balance on performance. |