預訓練語言模型在漢語上的跨時代學習能力

Chin-Tung Lin; Wei-Yun Ma

熱門：

首頁

臺灣期刊 法律公行政治醫事相關財經社會學教育其他

大陸期刊 核心重要期刊

DOI文章

	本站僅提供期刊文獻檢索。　　【月旦知識庫】是否收錄該篇全文，敬請【登入】查詢為準。最新【購點活動】
篇名	預訓練語言模型在漢語上的跨時代學習能力
並列篇名	HanTrans: An Empirical Study on Cross-Era Transferability of Chinese Pre-trained Language Model
作者	Chin-Tung Lin、Wei-Yun Ma (Wei-Yun Ma)
中文摘要	近年來預訓練的語言模型在自然語言處理中蔚為風潮，以BERT (Bidirectional Encoder Representations form Transformers)模型為代表，其掩碼語言建模(masked language modeling, MLM)被廣泛應用於大型語言模型的預訓練，使得後續微調(fine tuning)後的模型即可在下游任務有很好的表現。然而，在預訓練的語料中，相比於簡體中文，繁體中文只佔了很少的比例，尤其缺乏古漢語（上古、中古、近代等）的語料。這使得古漢語的自然語言處理一直沒有適切的大型預訓練模型可用。基於此，我們訓練與發佈了一個專為古漢語打造的BERT系列模型。我們的預訓練語言模型與原本的中文BERT系列模型相比，能夠成功降低了古漢語的perplexity分數。同時，我們也進一步開發了不同時代的分詞與詞類標記的模型，並探究其對於跨時代語料的遷移學習能力。最後，我們將不同時代模型的人稱代名詞詞向量(word embedding)進行降維，觀察其不同時代的變異情形。我們的程式碼發布在https://github.com/ckiplab/han-transformers。
英文摘要	The pre-trained language model has recently dominated most downstream tasks in the NLP area. Particularly, bidirectional Encoder Representations from Transformers (BERT) is the most iconic pre-trained language model among the NLP tasks. Their proposed maskedlanguage modeling (MLM) is an indispensable part of the existing pre-trained language models. Those outperformed models for downstream tasks benefited directly from the large training corpus in the pretraining stage. However, their training corpus for modern traditional Chinese was light. Most of all, the ancient Chinese corpus is still disappearance in the pretraining stage. Therefore, we aim to address this problem by transforming the annotation data of ancient Chinese into BERT style training corpus. Then we propose a pre-trained Oldhan Chinese BERT model for the NLP community. Our proposed model outperforms the original BERT model by significantly reducing perplexity scores in masked-language modeling (MLM). Also, our fine-tuning models improve F1 scores on word segmentation and part-of-speech tasks. Then we comprehensively study zero-shot cross-eras ability in the BERT model. Finally, we visualize and investigate personal pronouns in the embedding space of ancient Chinese records from four eras. We have released our code at https://github.com/ckiplab/ han-transformers.
起訖頁	164-173
關鍵詞	漢語、預訓練語言模型、零樣本跨時代學習、Chinese Language Model、Zero-shot Cross-Era Transfer Learning
刊名	ROCLING論文集
期數	202212 (2022期)
出版單位	中華民國計算語言學學會
該期刊-上一篇	智慧未來休憩港灣應用服務：以高雄亞灣為例建構複合休憩知識圖譜
該期刊-下一篇	自動口說評估於英語作為第二語言學習者的初步研究