中文摘要 |
Sentence tokenization is the process of mapping sentences from character strings into strings of tokens. This paper sets out to study longest tokenization which is a rich family of tokenization strategies following the general principle of maximum tokenization. The objectives are to enhance the knowledge and understanding of the principle of maximum tokenization in general, and to establish the notion of longest tokenization in particular. The main results are as follows: (1) Longest tokenization, which takes a token n-gram as a tokenization object and seeks to maximize the object length in characters, is a natural generalization of the Chen and Liu Heuristic on the table of maximum tokenizations. (2) Longest tokenization is a rich family of distinct and unique tokenization strategies with many widely used maximum tokenization strategies, such as forward maximum tokenization, backward maximum tokenization, forward-backward maximum tokenization, and shortest tokenization, as its members. (3) Longest tokenization is theoretically a true subclass of critical tokenization, as the essence of maximum tokenization is fully captured by the latter. (4) Longest tokenization is practically the same as shortest tokenization, as the essence of length-oriented maximum tokenization is captured by the latter. Results are obtained using both mathematical examination and corpus investigation. |