英文摘要 |
This paper investigates some issues on application of class-based Chinese language models, especially the SA -class bigram model in which the word classes are automatically clustered by simulated annealing. The studied issues include (1) using test-set perplexity as a quality measure for evaluating performance of language models across domains, subdomains, and character codings; (2) using the SA-class bigram model to different applications OCR postprocessing, syllable-to-character conversion, and linguistic decoding for speech recognition; (3) comparing the model with other language models - least-word, word-frequency, inter-word character bigram, and word bigram; and (4) deciding appropriate number of classes based on corpus size. The experimental results show that the test-set perplexity is indeed a good measure for performance evaluation of language models, and the SA -class bigram language model is not only theoretically plausible but also practically feasible - high performance with less resource requirement. |