| 英文摘要 |
Word segmentation plays a key role in natural language processing and data retrieval queries. The pilot study employed Stanford CoreNLP, a word segmentation system for Chinese, for segmentation and tagging of Hakka texts in Taiwan Hakka Corpus. Nevertheless, the performance was unsatisfactory due to the intractable correspondent translations between Mandarin and Hakka and the lexical and phonetic varieties among the six dialects of Hakka. In view of these reasons, a tailormade Hakka segmentation model is constructed that encompasses Hakka lexicon with six accents and that applies Maximum Matching Algorithm (MM) and dynamic programming algorithm in the system. The segmentation performance evaluation test results show that combining lexicon lookup and word frequency statistics (with Maximum Matching Algorithm and N-gram Language Model) significantly improves both segmentation performance and accuracy. |