英文摘要 |
In this paper, we propose a POS tagging method using more than 60 thousand entries of Taiwanese-Mandarin translation dictionary and 10 million words of Mandarin training data to tag Taiwanese. The literary written Taiwanese corpora have both Romanization script and Han-Romanization mixed script, the genre includes prose, fiction and drama. We follow tagset drawn up by CKIP. We develop word alignment checker to help the two scripts word alignment work, and then lookup Taiwanese-Mandarin translation dictionary to find the corresponding Mandarin candidate words, select the most suitable Mandarin word using HMM probabilistic model from the Mandarin training data, and finally tag the word using MEMM classifier. We achieve an accuracy rate of 91.49% on Taiwanese POS tagging work, and analysis the errors. We also get the preliminary Taiwanese training data. |