英文摘要 |
How to get useful information is the key to the success of the information system. One way to get a lot of information is through web crawlers. It is a common solution to use a fast and efficient distributed crawler system architecture. In recent years, containerization technology has become a hot topic. Main characters of containers are characterized by light weight, saving system resources, and can quickly implement the execution environment and reduce maintenance costs. Therefore, many large international companies are now using or moving toward containerization technology. Therefore, this study will build a distributed crawler system based on containerization technology. The target of the crawler system is the book e-commerce platforms, which allow the crawler to execute in the container and retrieve the book information from the webpage. The difference between the distributed crawler architecture and the traditional crawler architecture is that a distributed crawler needs to manage multiple crawling tasks those are being executed at the same time. The system will arrange which nodes to execute each task through container scheduling and load balancing. As a result, the architecture of a distributed crawler system is faster and more efficient than a traditional singlethreaded or multi-threaded crawler system. In addition, this study will apply the book materials collected by the distributed crawler system to the book price comparison platform, so that users can find the highest discount purchase site of the book through the book price comparison platform. |