中文摘要 |
隨著資訊科技與網際網路的快速發展,從自然語言中擷取所需資訊(Information Extraction)技術也愈顯重要,本研究希望針對國內最大的電子佈告欄系統(BBS, Bulletin Board System)「PTT」中的「Food」版發展出一套自動化擷取文章中餐廳相關資訊並判斷餐廳類別的方法,讓餐廳資訊的取得更加快速且便利。本文架構主要分為三個部分,第一部分為餐廳相關資訊擷取,透過PTT Crawler擷取PTT Food版上的文章進行格式化處理,並藉由關鍵字比對的方式擷取特定文章標題,以及正規表達式(Regular Expression)擷取內文包含的餐廳名稱、電話、地址及URL資訊。第二部分則是文章標題作為餐廳類別(例:咖啡、涮涮鍋、台式料理)的擷取來源,隨機挑選10,000筆標題資料針對隱含其中的餐廳類別進行人工標記;最後再透過WIDM實驗室研究室整合了條件式隨機域(Conditional Random Field, CRF)所開發的WIDM NER TOOL分別進行監督式學習與半監督式學習的實驗,並從實驗結果得知利用此法在餐廳類別的擷取可獲得不錯的效果。 |
英文摘要 |
In this study, we hope to develop a system to automatically extract restaurant type from the FOOD board of PTT, the largest BBS web site in Taiwan. This paper is divided into three parts. The first part is pre-processing, where we crawl articles from the PTT FOOD board and extract title、restaurant name、telephone 、address and URL information via regular expressions. The second part is restaurant type labeling from title data. We used WIDM NER TOOL to train a model for restaurant type extraction. The last part of the article is experiment. We randomly selected 10,000 titles for manual labeling and testing. We used the labeled data for supervised learning and included unlabeled data for Semi-Supervised learning. Finally we got a good result using this method in restaurant type extraction. |