以容器化技術提升網頁擷取效能

陳志達; 郭柏均

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	以容器化技術提升網頁擷取效能
並列篇名	Improving Webpage Capture Performance with Containerization Technology
作者	陳志達、郭柏均
中文摘要	資料是近代重要的資產，想取得大量資料的方法之一就是透過快速且有效率的分散式爬蟲系統架構來獲得。近年來容器化技術成為熱門話題，特色在於輕量、節省系統資源，且能快速建置執行環境以及減少維護成本，很多大型的國際企業都正在使用或正往容器化技術邁進。因此，本研究將基於容器化技術來建構分散式爬蟲系統，擷取目標為書籍電商平台，讓爬蟲在容器中執行，並從網頁中擷取書籍資料。分散式爬蟲和傳統架構的差異在於可同時管理多個的爬蟲任務，因此分散式架構會比傳統的單或多執行緒架構的爬蟲系統還要更快速且高效率。此外，本研究會將分散式爬蟲系統所擷取的書籍資料應用於書籍比價平台，讓使用者能找到該本書籍最高優惠的購買站點。
英文摘要	How to get useful information is the key to the success of the information system. One way to get a lot of information is through web crawlers. It is a common solution to use a fast and efficient distributed crawler system architecture. In recent years, containerization technology has become a hot topic. Main characters of containers are characterized by light weight, saving system resources, and can quickly implement the execution environment and reduce maintenance costs. Therefore, many large international companies are now using or moving toward containerization technology. Therefore, this study will build a distributed crawler system based on containerization technology. The target of the crawler system is the book e-commerce platforms, which allow the crawler to execute in the container and retrieve the book information from the webpage. The difference between the distributed crawler architecture and the traditional crawler architecture is that a distributed crawler needs to manage multiple crawling tasks those are being executed at the same time. The system will arrange which nodes to execute each task through container scheduling and load balancing. As a result, the architecture of a distributed crawler system is faster and more efficient than a traditional singlethreaded or multi-threaded crawler system. In addition, this study will apply the book materials collected by the distributed crawler system to the book price comparison platform, so that users can find the highest discount purchase site of the book through the book price comparison platform.
起訖頁	37-55
關鍵詞	容器化技術、網路爬蟲、分散式爬蟲、負載平衡、任務排程、Container、Containerization、Web Crawler、Distributed Crawler、Load Balance、Task Scheduler
刊名	資訊與管理科學
期數	202007 (13:1期)
出版單位	資訊與管理科學期刊編輯委員會
該期刊-上一篇	使用RPA技術輔助系統重構之資料探勘