A Way to Extract Unknown Words Without Dictionary from Chinese Corpus and Its Applications

林義証; 余明興; 黃世陽; 吳明哲

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	A Way to Extract Unknown Words Without Dictionary from Chinese Corpus and Its Applications
作者	林義証、余明興、黃世陽、吳明哲
英文摘要	We propose a way to detect the unknown words from the corpus. We call such unknown words Chinese frequent strings(CFS). The strings could be the combinations of some common Chinese words that are defined in a traditional dictionary. Such Chinese frequent strings appear more than once in some Chinese texts. The method we proposed can automatically detect such strings without using any lexicon, and no word segmentation is needed. We retrieve 55,518 Chinese frequent strings (reached for 13-gram in character) from a corpus consisting of 536,171 characters. To show that the strings we got are useful, we use these strings in Chinese phoneme-to-character and character-to-phoneme tasks. The test corpus contains manually-tagged phonetic symbols for each character. The correctness of the phoneme-to-character test is 96.5% and the correctness of the character-to-phoneme test is 99.7%. We make an MOS test about the determination of prosodic segments. The MOS score is 4.66 relative to the prosodic segments in spontaneous speech. This shows that the strings we retrieved are helpful in this aspect.
起訖頁	217-226
關鍵詞	unknown words、phoneme-to-character、character-to-phoneme、prosodic segment
刊名	ROCLING論文集
期數	1998 (1998期)
出版單位	國立高雄師範大學輔導與諮商研究所
該期刊-上一篇	Corpus-based Evaluation of Language Processing Systems Using Information Restoration Model
該期刊-下一篇	應用隱藏式馬可夫模型於口述對話系統之研究

新書閱讀

元照讀書館

優惠活動

月旦品評家

元照讀書館

．研討會新訊

月旦知識庫

月旦法律分析庫
月旦醫事法網
月旦會計財稅網

期刊數位服務

社群平台

讀者服務

關於元照

讀者服務專線：+886-2-23756688　傳真：+886-2-23318496
地址：臺北市館前路28 號 7 樓　客服信箱