| 英文摘要 |
In this research, we utilized the training dataset from the ROCLING 2023 Chinese Multi-genre Named Entity Recognition in the Healthcare Domain, which comprises the Chinese HealthNER Corpus (Lee and Lu, 2021) and the ROCLING 2022 CHNER Dataset (Lee et al., 2022), along with the test set (Lee et al., 2023). The objective was to address the named entity recognition task within the Chinese healthcare domain. Our initial step involved preprocessing the training dataset. We identified instances in the training set where sentences with identical structural patterns exhibited ambiguities and errors in named entity definitions. Prioritizing data validation, we manually excluded erroneous entries. In specialized domains such as medicine, domain-specific terminologies and proprietary names are often defined within sentences as merged labels, rather than separate ones. Thus, we employed the’Entity Relationship Construction and Merging Strategies’approach to consolidate related named entities. Subsequently, we computed the frequencies of sentence and entity occurrences. We extracted sparsely labeled data and applied two techniques for data augmentation: GPT Paraphrase and entity replacement while preserving sentence structure. These steps resulted in an augmented training set. Finally, we conducted fine-tuning experiments on various state-of-the-art BERT-based models to obtain a model suitable for the ROCLING Shared Task. |