中文摘要 |
This paper reveals some important properties of CFSs and applications in Chinese
natural language processing (NLP). We have previously proposed a method for
extracting Chinese frequent strings that contain unknown words from a Chinese
corpus [Lin and Yu 2001]. We found that CFSs contain many 4-character strings,
3-word strings, and longer n-grams. Such information can only be derived from an
extremely large corpus using a traditional language model(LM). In contrast to
using a traditional LM, we can achieve high precision and efficiency by using
CFSs to solve Chinese toneless phoneme-to-character conversion and to correct
Chinese spelling errors with a small training corpus. An accuracy rate of 92.86%
was achieved for Chinese toneless phoneme-to-character conversion, and an
accuracy rate of 87.32% was achieved for Chinese spelling error correction. We
also attempted to assign syntactic categories to a CFS. The accuracy rate for
assigning syntactic categories to the CFSs was 88.53% for outside testing when the
syntactic categories of the highest level were used. |