英文摘要 |
One of the most simple and accurate Chinese-word segmentation technique is maximal-matching. However, its performance depends on the coverage of the list of words which are usually derived from a general dictionary. When it is directly applied to segment technical articles instead of general news articles, the error rate degraded significantly from 1.2% (as in the literature) to 15%. This is an important problem in two respect. First, usually the domain-specific terms are not readily available on computer. These terms have to be entered manually by expert or they can be detected automatically from thematic corpora. Second, if corpus analysis is applied to supplement information for the design and development of text processing systems, these analysis depend on the correct word segmentation of these corpora of technical articles. In this paper, we propose to combine the maximal-matching and bigram techniques in Chinese-word segmentation for detecting words in thematic corpora where both techniques overcome each other's short coming. The Hong Kong Basic Law was selected as a representative technical article for evaluation because it has a fair amount of technical terms, compound nouns and names. The segmentation performances of the maximal-matching, bigram and the combined techniques are compared. The combined technique was able to achieve 33% improvement in segmentation performance and identify 33% of the terms in the Basic Law. |