臺灣客語斷詞前導研究與模型建立

葉秋杏; 賴惠玲; 劉吉軒

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	臺灣客語斷詞前導研究與模型建立
並列篇名	The Pilot Study and Model Construction for Word Segmentation in Taiwan Hakka
作者	葉秋杏、賴惠玲 (Hui-Ling Lai)、劉吉軒
中文摘要	斷詞是自然語言處理以及資料檢索查詢的關鍵角色，本文《臺灣客語語料庫》的斷詞前導研究，乃是運用華語斷詞系統Stanford CoreNLP，以華客對應轉換方式為客語文本進行斷詞與標記。然而，斷詞效能不盡理想，因華客之間有許多字詞難以對譯，且臺灣客語次方言之間詞彙與語音也存在差異。有鑑於此，臺灣客語語料庫提出客語專屬的斷詞模型，建構客語詞庫，以六腔分列詞目，並採用長詞優先演算法以及動態規劃演算法設計。斷詞效能評估測試結果顯示，詞庫查找與詞頻統計（透過長詞優先演算法以及N-gram語言模型）兩者相輔相成，無論是斷詞效能或是斷詞準確率皆有著明顯提升。
英文摘要	Word segmentation plays a key role in natural language processing and data retrieval queries. The pilot study employed Stanford CoreNLP, a word segmentation system for Chinese, for segmentation and tagging of Hakka texts in Taiwan Hakka Corpus. Nevertheless, the performance was unsatisfactory due to the intractable correspondent translations between Mandarin and Hakka and the lexical and phonetic varieties among the six dialects of Hakka. In view of these reasons, a tailormade Hakka segmentation model is constructed that encompasses Hakka lexicon with six accents and that applies Maximum Matching Algorithm (MM) and dynamic programming algorithm in the system. The segmentation performance evaluation test results show that combining lexicon lookup and word frequency statistics (with Maximum Matching Algorithm and N-gram Language Model) significantly improves both segmentation performance and accuracy.
起訖頁	44-53
關鍵詞	客語斷詞、長詞優先演算法、N-gram語言模型、詞頻統計、臺灣客語語料庫、Hakka word segmentation、Maximum Matching Algorithm (MM)、N-gram Language Model、word frequency statistics、Taiwan Hakka Corpus
刊名	ROCLING論文集
期數	202310 (2023期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	基於多重注意力機制的輔助損失函數用於端到端語者標記
該期刊-下一篇	應用對話語篇剖析於兩階段會議摘要之研究