| 英文摘要 |
Access to sufficient, high-quality data is essential for effectively training and validating machine learning classifiers. This study investigates class mapping as a data fusion strategy to enhance training data for research classification. Two versions of the Australian and New Zealand Standard Research Classification, ANZSRC 2008 FoR and ANZSRC 2020 FoR, are used to organize 179,431 documents from eight institutional repositories into plain and mapped datasets. Each dataset is divided into subsets corresponding to the division, group, and field levels of the classification schemes. Results show that 49% to 63% of documents are successfully mapped between schemes. Classifiers by Support Vector Machines (SVM), SciBERT, ModernBERT-base, and ModernBERT-large are trained to assess the effectiveness of this data fusion approach on classification performance. All models show improved performance at the three levels. ModernBERT-large achieved the greatest performance gains, with the improvements in validation F1 scores of 1.0% and 2.5% at the division level, 4.4% and 2.2% at the group level, and 9.9% and 11.5% at the field level. An emergent ability was observed, as performance in non-augmented classes improved with ModernBERT-large but not with ModernBERT-base. Overall, this study demonstrates that class mapping effectively enriches training datasets, enhances classification performance, and underscores the importance of model size and architecture. These findings offer a practical and scalable strategy for improving machine learning performance in research classification tasks. |