  1. 熱門:
首頁 臺灣期刊   法律   公行政治   醫事相關   財經   社會學   教育   其他 大陸期刊   核心   重要期刊 DOI文章
中文計算語言學期刊 本站僅提供期刊文獻檢索。

A Posteriori Individual Word Language Models for Vietnamese Language
作者 Le Quan Ha (Le Quan Ha)Tran Thi Thu Van (Tran Thi Thu Van)Hoang Tien Long (Hoang Tien Long)Nguyen Huu Tinh (Nguyen Huu Tinh)Nguyen Ngoc Tham (Nguyen Ngoc Tham)Le Trong Ngoc (Le Trong Ngoc)
It is shown that the enormous improvement in the size of disk storage space in recent years can be used to build individual word-domain statistical language models, one for each significant word of a language that contributes to the context of the text. Each of these word-domain language models is a precise domain model for the relevant significant word; when combined appropriately, they provide a highly specific domain language model for the language following a cache, even a short cache. Our individual word probability and frequency models have been constructed and tested in the Vietnamese and English languages. For English, we employed the Wall Street Journal corpus of 40 million English word tokens; for Vietnamese, we used the QUB corpus of 6.5 million tokens. Our testing methods used a priori and a posteriori approaches. Finally, we explain adjustment of a previously exaggerated prediction of the potential power of a posteriori models. Accurate improvements in perplexity for 14 kinds of individual word language models have been obtained in tests, (i) between 33.9% and 53.34% for Vietnamese and (ii) between 30.78% and 44.5% for English, over a baseline global tri-gram weighted average model. For both languages, the best a posteriori model is the a posteriori weighted frequency model of 44.5% English perplexity improvement and 53.34% Vietnamese perplexity improvement. In addition, five Vietnamese a posteriori models were tested to obtain from 9.9% to 16.8% word-error-rate (WER) reduction over a Katz trigram model by the same Vietnamese speech decoder.
起訖頁 103-125
關鍵詞 A posterioriStop wordsIndividual word language modelsFrequency models
刊名 中文計算語言學期刊  
期數 201006 (15:2期)
出版單位 中華民國計算語言學學會
該期刊-上一篇 A Punjabi to Hindi Machine Transliteration System
該期刊-下一篇 Improving the Template Generation for Chinese Character Error Detection with Confusion Sets




讀者服務專線:+886-2-23756688 傳真:+886-2-23318496
地址:臺北市館前路28 號 7 樓 客服信箱
Copyright © 元照出版 All rights reserved. 版權所有,禁止轉貼節錄