英文摘要 |
The digitization of Taiwan Hakka data is immensely complicated due to the many rare characters, missing characters, or character variants found in Taiwan Hakka texts, and is further hindered by inconsistency between non governmental Hakka dictionaries’ writing practice and governmental standards for the Hakka writing system. This study describes how the Taiwan Hakka Corpus Project carried out character correction to ensure the Corpus’s usefulness and robustness. First, the study demonstrates the various types of character correction that take place in our text cleaning process, including converting Hakka spellings into characters, unifying different forms of the same word, deleting redundant or repeated characters, filling in missing characters, swapping reversed characters, and correcting characters similar in shape but dissimilar in meaning. Second, we investigate situations in which rare characters cannot be shown properly, and we provide solutions to each situation. These situations include rare characters in Hakka texts being substituted with (1) Hakka spellings, (2) phonetic or semantic loan characters, (3) unintended glyphs such as squares or symbols (i.e., missing characters), and (4) character decomposition. Finally, issues related to multiple codes for the same character and character variants in Hakka texts are tackled. |