英文摘要 |
With the evolution of human lives and the accelerated spread of information, new things and concepts are generated quickly, and new words emerge every day. It is therefore important for natural language processing systems to identify new words. This paper used the scheme for Chinese word extraction based on machine learning approaches to combining various statistical features. Due to the broad areas for the natural language applications, however, it is quite probable that the mismatch of statistical characteristics between the training and the testing domains occurs, which degrades the performance for word extraction inevitably. This paper proposes the scheme of utilizing the histogram equalization for feature normalization in statistical approaches. Through this scheme, the mismatch of the feature distributions for the training set and the testing set, with different sizes or in different domains, can be compensated. This makes the statistical approaches of unknown word extraction more robust for novel domains. This scheme was tested on the corpora provided by SIGHAN2. The best results, 68.43% and 71.40% of F-Measure for the CKIP corpus and the HKCU corpus respectively, can be achieved with four features with normalization and histogram equalization. When applied to unknown word extraction in an novel domain, it can be found that this scheme is capable of identifying such pronouns as “Cape No. 7”(海角七號), “Financial Tsunami”(金融海嘯) and so on, which are not easy to be extracted by those approaches based on semantic characteristics. This scheme appears not good enough for extracting such new terms as the names of humans, places and organizations, in which the semantic structures are prominent. When compared with the results of unknown word extraction for two Chinese word segmentation systems, it can be observed that this scheme exhibits to be complementary with other approaches, and it is promising to combine approaches with different capabilities. |