Longest Tokenization

Guo, Jin

月旦知識庫會員登入｜元照網路書店｜月旦品評家

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	Longest Tokenization
作者	Guo, Jin (Guo, Jin)
中文摘要	Sentence tokenization is the process of mapping sentences from character strings into strings of tokens. This paper sets out to study longest tokenization which is a rich family of tokenization strategies following the general principle of maximum tokenization. The objectives are to enhance the knowledge and understanding of the principle of maximum tokenization in general, and to establish the notion of longest tokenization in particular. The main results are as follows: (1) Longest tokenization, which takes a token n-gram as a tokenization object and seeks to maximize the object length in characters, is a natural generalization of the Chen and Liu Heuristic on the table of maximum tokenizations. (2) Longest tokenization is a rich family of distinct and unique tokenization strategies with many widely used maximum tokenization strategies, such as forward maximum tokenization, backward maximum tokenization, forward-backward maximum tokenization, and shortest tokenization, as its members. (3) Longest tokenization is theoretically a true subclass of critical tokenization, as the essence of maximum tokenization is fully captured by the latter. (4) Longest tokenization is practically the same as shortest tokenization, as the essence of length-oriented maximum tokenization is captured by the latter. Results are obtained using both mathematical examination and corpus investigation.
起訖頁	25-46
關鍵詞	Sentence tokenization、Tokenization disambiguation、Maximum tokenization、Critical tokenization、Word segmentation、Word identification
刊名	中文計算語言學期刊
期數	199708 (2:2期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	Building a Bracketed Corpus Using Φ[feb4]Statistics
該期刊-下一篇	Segmentation Standard for Chinese Natural Language Processing