| 英文摘要 |
Previous work has demonstrated that multilingual fine-tuning of a pretrained multilingual speech representation model can lead to improved speech recognition accuracy when there is extremely little target language data available. In this paper we show that fine-tuning on labeled speech data from multiple languages sharing common phonological traits, preprocessed by attaching a language identifier to each speech sample, yields competitive results compared to monolingual fine-tuning, even if a moderate amount of target language data is available. In order to further improve the performance of our system, we apply self-training using unlabeled speech data. Our results indicate that fine-tuning a speech recognition model jointly on a combination of multilingual data and pseudo-labeled data yields superior performance compared to using any of the two augmentation techniques individually. We also find that models fine-tuned on multilingual data with language identifiers produce better results even if explicit information about language identity is not provided at inference time. |