中文摘要 |
Electronically available multilingual information can be divided into two major
categories: (1) alphabetic language information (English-like alphabetic languages)
and (2) ideographic language information (Chinese-like ideographic languages).
The information available in non-English alphabetic languages as well as in
ideographic languages (especially, in Japanese and Chinese) is growing at an
incredibly high rate in recent years. Due to the ideographic nature of Japanese and
Chinese, complicated with the existence of several encoding standards in use,
efficient processing (representation, indexing, retrieval, etc.) of such information
became a tedious task. In this paper, we propose a Han Character (Kanji) oriented
Interlingua model of indexing and retrieving Japanese and Chinese information.
We report the results of mono- and cross- language information retrieval on a
Kanji space where documents and queries are represented in terms of Kanji
oriented vectors. We also employ a dimensionality reduction technique to compute
a Kanji Conceptual Space (KCS) from the initial Kanji space, which can facilitate
conceptual retrieval of both mono- and cross- language information for these
languages. Similar indexing approaches for multiple European languages through
term association (e.g., latent semantic indexing) or through conceptual mapping
(using lexical ontology such as, WordNet) are being intensively explored. The
Interlingua approach investigated here with Japanese and Chinese languages, and
the term (or concept) association model investigated with the European languages
are similar; and these approaches can be easily integrated. Therefore, the proposed
Interlingua model can pave the way for handling multilingual information access
and retrieval efficiently and uniformly. |