月旦知識庫
 
  1. 熱門:
 
首頁 臺灣期刊   法律   公行政治   醫事相關   財經   社會學   教育   其他 大陸期刊   核心   重要期刊 DOI文章
ROCLING論文集 本站僅提供期刊文獻檢索。
  【月旦知識庫】是否收錄該篇全文,敬請【登入】查詢為準。
最新【購點活動】


篇名
應用直方圖均化於統計式未知詞萃取之研究
並列篇名
Histogram Equalization for Statistical Unknown Word Extraction
作者 陳弈璁林伯慎
中文摘要
隨著人們的生活方式的演變以及資訊普及的加速,新事物、新觀念不斷的產生,新的詞彙自然而然地快速增加。因此,學習與辨識新詞彙是一個自然語言處理系統能與時俱進的重要能力。本論文利用統計式的機器學習方法,結合不同特性的統計特徵訓練出一個詞彙的分類器,進行詞彙的萃取與驗證。然而,自然語言處理技術的應用範疇非常廣,用來訓練或測詴的語料庫其領域或大小也都不盡相同,這使得以統計為基礎的方法,會產生訓練集與測詴集的特徵分佈不匹配的問題。我們提出應用直方圖均化(Histogram Equalization)將描述長度增益(Description Length Gain)特徵值進行正規化,讓測詴集與訓練集的特徵值分佈能互相匹配,解決語料庫大小或領域不同所造成特徵值範圍變動及分佈差異的問題。這使得本論文的統計式詞彙萃取方法更具有一般性,可以適用於不同領域的詞彙萃取。我們使用SIGHAN2的繁體語料庫進行測詴,在結合四種統計特徵,並且經過特徵值分佈正規化後,會有最佳的詞彙驗證效能。對於中研院資訊所組庫小組及香港城市大學所提供的語料庫,F-measure分別可以達到68.43%和71.40%。我們將此詞彙萃取方法應用於萃取新穎領域的未知詞時,發現本論文方法可以萃取出具有統計特性顯著但較難透過語意結構資訊萃取出來的未知詞,例如「海角7號」、「金融海嘯」等專有名詞。但是相對地,因為並未使用語意結構規則,於人名、地方名或組織名的未知詞萃取,則顯得能力較為不足。我們並觀察到,本論文的統計萃取方法與上述兩套斷詞系統所萃取的未知詞之間具有良好的互補性,適當地將這些方法結合將可以達到截長補短的效果。
英文摘要
With the evolution of human lives and the accelerated spread of information, new things and concepts are generated quickly, and new words emerge every day. It is therefore important for natural language processing systems to identify new words. This paper used the scheme for Chinese word extraction based on machine learning approaches to combining various statistical features. Due to the broad areas for the natural language applications, however, it is quite probable that the mismatch of statistical characteristics between the training and the testing domains occurs, which degrades the performance for word extraction inevitably. This paper proposes the scheme of utilizing the histogram equalization for feature normalization in statistical approaches. Through this scheme, the mismatch of the feature distributions for the training set and the testing set, with different sizes or in different domains, can be compensated. This makes the statistical approaches of unknown word extraction more robust for novel domains. This scheme was tested on the corpora provided by SIGHAN2. The best results, 68.43% and 71.40% of F-Measure for the CKIP corpus and the HKCU corpus respectively, can be achieved with four features with normalization and histogram equalization. When applied to unknown word extraction in an novel domain, it can be found that this scheme is capable of identifying such pronouns as “Cape No. 7”(海角七號), “Financial Tsunami”(金融海嘯) and so on, which are not easy to be extracted by those approaches based on semantic characteristics. This scheme appears not good enough for extracting such new terms as the names of humans, places and organizations, in which the semantic structures are prominent. When compared with the results of unknown word extraction for two Chinese word segmentation systems, it can be observed that this scheme exhibits to be complementary with other approaches, and it is promising to combine approaches with different capabilities.
起訖頁 364-378
關鍵詞 未知詞萃取機器學習多層次類神經網路中文詞彙萃取直方圖均化Unknown Word ExtractionMachine LearningMultilayer PerceptronsChinese Word ExtractionHistogram Equalization
刊名 ROCLING論文集  
期數 2010 (2010期)
出版單位 中華民國計算語言學學會
該期刊-上一篇 Discerning Emotions of Bloggers based on Topics–a Supervised Coreference Approach in Bengali
該期刊-下一篇 Qualia Modification in Noun-Noun Compounds: A Cross-Language Survey
 

新書閱讀



最新影音


優惠活動




讀者服務專線:+886-2-23756688 傳真:+886-2-23318496
地址:臺北市館前路28 號 7 樓 客服信箱
Copyright © 元照出版 All rights reserved. 版權所有,禁止轉貼節錄