Beyond Rheumatology: GPT-4-turbo’s Superior Performance in Medical Examinations
作者 連育誠
目標:大型語言模型(Large Language Models, LLM)的應用潛力正在各個專業領域中被逐步探索與評估。本研究的目標是通過檢測生成式預先訓練轉換器3.5增強版(Generative Pre-trained Transformer3.5-turbo,簡稱GPT-3.5-turbo)、生成式預先訓練轉換器4(Generative Pre-trained Transformer 4,簡稱GPT-4)和生成式預先訓練轉換器4增強版(Generative Pre-trained Transformer 4-turbo,簡稱GPT-4-turbo)在處理台灣內科專科考試試題上的能力,來探討這些模型在醫學領域的適用性,而初步研究則著聚焦於風濕病學領域能力的評估。
方法:本研究首先分析2018至2022年連續五年內科專科醫師考試中的73道風濕病學問題來評估初步表現。隨後,該研究進一步分析2022年內科專科醫師考試的146道試題,以驗證LLM評估能力。在表現分析方面,本研究主要採用了中文直接問答和零次學習的思考鏈(Chain of thought, CoT)推理技術,來分別評估這些方法的效果。
Objectives: Large Language Models (LLMs) are increasingly being evaluated for their potential use in specialized domains. This study investigates the abilities of Generative Pre-trained Transformer 3.5-turbo (GPT-3.5-turbo), Generative Pre-trained Transformer 4 (GPT-4), and Generative Pre-trained Transformer 4-turbo (GPT-4-turbo) within the medical field by testing their performance on the Taiwan Internal Medicine Board Examination questions, with an initial focus on rheumatology.
Methods: The study evaluated baseline performance by analyzing 73 rheumatology questions taken from five consecutive examination years (2018-2022). This evaluation was then broadened to include a larger set of 146 internal medicine questions from the year 2022 to generalize the findings. Performance was assessed using direct queries in Chinese and the application of zero-shot Chain-of-Thought (CoT) reasoning.
Results: Among the rheumatology questions, no significant improvement was seen in GPT-3.5-turbo’s performance with the CoT reasoning, consistently yielding scores with an average around 62. In contrast, GPT-4 variants excelled, with both GPT-4 using direct queries and GPT-4-turbo with CoT achieving an outstanding average score of 96.5. When broadened to include questions regarding subspecialties of internal medicine questions, notably, GPT-4-turbo exhibited significantly enhanced performance with the CoT methodology.
Conclusions: The study highlights the superior performance of GPT-4 models in interpreting and responding to medical examination questions. It specifically underscores the potential of GPT-4-turbo, in conjunction with CoT reasoning, to optimize the utilization of LLMs in rheumatology and potentially other medical domains, indicating its robust capability in meeting the linguistic and conceptual challenges presented in medical examinations.
起訖頁 25-39
關鍵詞 大型語言模型基於轉換器的生成式預訓練模型醫師執照考試專科醫師考試試題Large Language Models (LLM)Generative Pre-trained Transformer (GPT)Chain of thought reasoningMedical license examination (MLE)Medical board review questions
刊名 中華民國風濕病雜誌  
期數 202406 (38:1期)
出版單位 中華民國風濕病醫學會
