中文摘要 |
It is shown that for a large corpus, Zipf 's law for both words in English and
characters in Chinese does not hold for all ranks. The frequency falls below the
frequency predicted by Zipf's law for English words for rank greater than about
5,000 and for Chinese characters for rank greater than about 1,000. However,
when single words or characters are combined together with n-gram words or
characters in one list and put in order of frequency, the frequency of tokens in the
combined list follows Zipf’s law approximately with the slope close to -1 on a loglog
plot for all n-grams, down to the lowest frequencies in both languages. This
behaviour is also found for English 2-byte and 3-byte word fragments. It only
happens when all n-grams are used, including semantically incomplete n-grams.
Previous theories do not predict this behaviour, possibly because conditional
probabilities of tokens have not been properly represented. |