英文摘要 |
This paper describes a new approach to constructing a class-based language model and reports an estimation of the upper bound of the entropy of Chinese using the model. A class-based n-gram model built on an existing machine readable thesaurus is shown to lower cross entropy between the language model and a balanced corpus of 300,000 words. The cross-entropy of the corpus and the proposed language model is 12.66 bits per word or 3.88 bits per byte, which is better than another class-based language model the inter-word character bigram model by 0.6 bit per word. In the process of estimating the entropy, we found that unknown words take up disproportionately large amount of entropy and are the major bottleneck for obtaining lower entropy or better language models for tasks such as OCR and speech recognition. |