中文摘要 |
This paper describes our initial attempt to design and develop a bilingual reading
comprehension corpus (BRCC). RC is a task that conventionally evaluates the
reading ability of an individual. An RC system can automatically analyze a passage
of natural language text and generate an answer for each question based on
information in the passage. The RC task can be used to drive advancements of
natural language processing (NLP) technologies imparted in automatic RC systems.
Furthermore, an RC system presents a novel paradigm of information search, when
compared to the predominant paradigm of text retrieval in search engines on the
Web. Previous works on automatic RC typically involved English-only language
learning materials (Remedia and CBC4Kids) designed for children/students, which
included stories, human-authored questions, and answer keys. These corpora are
important for supporting empirical evaluation of RC performance. In the present
work, we attempted to utilize RC as a driver for NLP techniques in both English
and Chinese. We sought parallel English, and Chinese learning materials and
incorporated annotations deemed relevant to the RC task. We measured the
comparative levels of difficulty among the three corpora by means of the baseline
bag-of-words (BOW) approach. Our results show that the BOW approach achieves
better RC performance in BRCC (67%) when compared to Remedia (29%) and
CBC4Kids (63%). This reveals that BRCC has the highest degree of word overlap
between questions and passages among the three corpora, which artificially
simplifies the RC task. This result suggests that additional effort should be devoted
to authoring questions with a various grades of difficulty in order for BRCC to
better support RC research across the English and Chinese languages. |