月旦知識庫
 
  1. 熱門:
 
首頁 臺灣期刊   法律   公行政治   醫事相關   財經   社會學   教育   其他 大陸期刊   核心   重要期刊 DOI文章
ROCLING論文集 本站僅提供期刊文獻檢索。
  【月旦知識庫】是否收錄該篇全文,敬請【登入】查詢為準。
最新【購點活動】


篇名
基於已知名稱搜尋結果的網路實體辨識模型建立工具
並列篇名
A Tool for Web NER Model Generation Using Search Snippets of Known Entities
作者 黃雅筠張嘉惠周建龍
中文摘要
在過去,命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主,相對地以網路的非正式文章則著墨較少。因此,現有的辨識模組對於網頁內容的辨識效果顯得較差,當需要辨識網頁內容中的命名實體時,勢必要重新訓練辨識模組。然而,訓練一個模型的時間和人力成本非常高,包含前置的大量訓練資料準備、人工收集及標記答案,且為了提升模組辨識效果,必須要為資料做適當切割、符號統一、正規化,以及特徵值的設計、準備已知關鍵詞庫(Dictionary)等,工作非常瑣碎複雜。此外,對於不同語言或不同辨識主題則需重複上述工作。本論文的目的,期能解決上述命名實體辨識工作過於費力耗時的問題,經由給定已知實體名稱的搜尋結果來自動標記訓練資料,並結合Chou及Chang於2014年在網頁中文人名的辨識研究之Tri-training半監督式訓練架構來產生NER模組。實驗證實,使用本工具可以套用在不同語言及類型的命名實體辨識,在中文組織名稱辨識的效能可達到86.1%,在日文組織名稱辨識的效能可達到80.3%,在英文組織名稱辨識的效能可達到83.2%,辨識不同主題的中文地點名稱辨識效能可達到84.5%,另外,辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。
英文摘要
Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER models are trained mainly on journalistic documents such as news articles. Since they have not been trained to deal with informal documents, the performance drops on Web documents, which may lack sentence structure and contain colloquial expression. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER model training via automatic labeling and tri-training, which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.
起訖頁 148-163
關鍵詞 命名實體辨識協同訓練Tri-TrainingNamed Entity RecognitionCo-TrainingTri-Training
刊名 ROCLING論文集  
期數 2015 (2015期)
出版單位 中華民國計算語言學學會
該期刊-上一篇 類神經網路訓練結合環境群集及專家混合系統於強健性語音辨識
該期刊-下一篇 Word Co-occurrence Augmented Topic Model in Short Text
 

新書閱讀



最新影音


優惠活動




讀者服務專線:+886-2-23756688 傳真:+886-2-23318496
地址:臺北市館前路28 號 7 樓 客服信箱
Copyright © 元照出版 All rights reserved. 版權所有,禁止轉貼節錄