中文摘要 |
Cluster-based n-gram modeling is a variant of normal word-based n-gram
modeling. It attempts to make use of the similarities between words. In this paper,
we present an empirical study of clustering techniques for Asian language
modeling. Clustering is used to improve the performance (i.e. perplexity) of
language models as well as to compress language models. Experimental tests are
presented for cluster-based trigram models on a Japanese newspaper corpus and on
a Chinese heterogeneous corpus. While the majority of previous research on word
clustering has focused on how to get the best clusters, we have concentrated our
research on the best way to use the clusters. Experimental results show that some
novel techniques we present work much better than previous methods, and achieve
more than 40% size reduction at the same level of perplexity. |