英文摘要 |
Chinese spell check is an important component for many NLP applications, including word processors, search engines, and automatic essay rating. However, compared to spell checkers for alphabetical languages(e.g., English or French) , Chinese spell checkers are more difficult to develop, because there are no word boundaries in Chinese writing system, and errors may be caused by various Chinese input methods. Chinese spell check involves automatically detecting and correcting typos, roughly corresponding to misspelled words in English. Liu et al.(2011) show that people tend to unintentionally generate typos that sound similar(e.g., *措折[cuo zhe] and挫折[cuo zhe]) , or look alike(e.g., *固難[gu nan] and困難[kun nan]) . The methods for spell check can be broadly classified into two types: rule-based methods(Ren et al., 2001; Jiang et al., 2012) and statistical methods(Hung & Wu, 2009; Chen, 2010) . Rule-based methods use knowledge resources such as a dictionary to identify a word as a typo. Statistical methods tend to use a large monolingual corpus to create a language model tovalidate the correction hypotheses. Consider the sentence“心是很重要的。”[xin shi hen zhong yao de] which is correct. However,“心”and“是”are likely to be regarded as an error by a rule-based model for the word“心事”with identical pronunciation. In statistical methods,“心”and“是”are a bigram which has high frequency in a monolingual corpus, so we may determine that“心是”is not a typo after all. In this paper, we propose a model that combines rule-based and statistical approaches. Probable errors, proposed by the rule-based detection module, are verified using statistical machine translation(SMT) model. Our model treats spell check and correction as a kind of translation, where typos are translated into correctly spelled words according to the translation probability and the language model probability. |