中文摘要 |
隨著行動裝置的普及,區域搜尋成為了一項新興的熱門服務。然而區域搜尋要能提供完整的服務,必須要讓使用者能夠準確地搜尋到附近的興趣點(Point of Interest, POI),如餐廳、旅館、巴士站、卡拉OK、圖書館、藥局等包含食衣住行育樂的地點。為此我們要建構一個完整的POI資料庫供使用者查詢。另外由於網際網路的盛行,越來越多的使用者會在他們的部落格或是社交網路上分享旅遊經驗或是POI的資料,同時也有更多的商家或組織建立官方網頁,並且在網頁上詳細的介紹他們的資料。隨著這類型網頁的數量累積,整個網際網路成為了最大的POI資訊來源。在本篇論文中我們提出一個基於Web資訊的POI建置系統,系統可以分為兩大部分,第一部分為包含地址網頁(Address-bearing Page, ABP)的爬取,目的在透過網頁中的地址找尋可能的POI以及可用來做為檢索的POI相關描述訊息。第二部分為POI擷取系統,透過條件隨機域(Conditional Random Field, CRF)作為學習演算法產生的中文組織名稱辨識模型及中文地址辨識模型,找出網頁中所有出現的地址和組織名稱,接著再將地址與組織名稱配對成POI資料,最後再為每一個POI擷取其相關資訊。 |
英文摘要 |
With the increased popularity of mobile devices, local search has become a new popular service. Therefore, we need a powerful POI (Points of Interest) database to support local search. In recent years, the web has become the largest data source of POIs. With the prevalence of Internet, people will share their travel experience and information of POIs that they had been visited on social network, their blogs, and even check-in post. Besides, many companies and organizations publish their business on their own websites, resulting a large number of POIs. In this paper, we propose a POI database construction system from the immense data of the Web. Our system consists of two parts: the query-based crawler, and the POI extraction system. The goal of query-based crawler is to collect address-bearing pages (ABP) from the web as address is a good indicator of POIs. The second part is POI extraction system. We use CRF (Conditional Random Field) to train a Chinese postal address recognition model and a Chinese organization recognition model. After the extraction of addresses and POI names from ABP with these two CRF models, we then leant a model to pair an address and a POI name as a POI. Finally, we extract POI associated information for each POI to construct a complete POI data. |