英文摘要 |
In machine translation systems, a computer-translated manual is usually concurrently processed by several posteditors; thus, to maintain the consistency of translated terminologies between different posteditors is very important. If all the terminologies used in the manual can be entered into the dictionary before machine translation, the consistency can be automatically maintained, which is a big advantage of machine translation over human translation. However, since new compounds are created from day to day, it is impossible to list them exhaustively in the dictionary being prepared long time ago. To guarantee subsequent parsing and translation to be Correct, new compounds must be extracted from the text every time a new manual is to be translated and then entered into the dictionary. However, it is too costly and time-consuming to let the human inspect the entire text to search for the compounds. Therefore, to extract compounds automatically from the manual is an important problem. Traditional systems are to encode some sets of rules to extract compounds from the corpus. However, the problem with the rule-based approach is that not every compound obtained is desirable since it does not assign preferences to the candidates. It is not clear whether one candidate is more likely to be a compound than the other. The human effort required is still high because the lexicographer has to search for all the compound candidate list to find the preferred compounds. A new method is thus proposed in this paper to automatically extract compounds using the features of mutual information and relative frequency count. This method tests every n-gram (n is equal to 2 or 3 in this paper) formed in the manual to see whether it is a compound by checking those features. Those n-grams that pass the test are then listed in the order of significance to let the lexicographers to build into the dictionary. A significant cutdown in postediting time has been observed in our test. |