中文摘要 |
Recently many new techniques have been proposed for language modeling, such as ME, MEMM and CRF. However, the ngram model is still a staple in practical applications. It is well worthy of studying how to improve the performance of the ngram model. This paper enhances the traditional ngram model by relaxing the stationary hypothesis on the Markov chain and exploiting the word positional information. Such an assumption is made that the probability of the current word is determined not only by history words but also by the words positions in the sentence. The non-stationary ngram model (NS ngram model) is proposed. Several related issues are discussed in detail, including the definition of the NS ngram model, the representation of the word positional information and the estimation of the conditional probability. In addition, three smoothing approaches are proposed to solve the data sparseness problem of the NS ngram model. Several smoothing algorithms are presented in each approach. In the experiments, the NS ngram model is evaluated on the pinyin-to-character conversion task which is the core technique of the Chinese text input method. Experimental results show that the NS ngram model outperforms the traditional ngram model significantly by the exploitation of the word positional information. In addition, the proposed smoothing techniques solve the data sparseness problem of the NS ngram model effectively with great error rate reduction. |