中华医学教育杂志 ›› 2026, Vol. 46 ›› Issue (4): 315-320.DOI: 10.3760/cma.j.cn115259-20250729-00851

• 医学教育评估 • 上一篇    

大语言模型回答放射医学对比剂相关问题能力的评估

马晓雯1, 左勖2, 龚敬1, 顾雅佳1   

  1. 1复旦大学附属肿瘤医院放射诊断科 复旦大学上海医学院肿瘤学系,上海 200032;
    2GE医疗诊断药物事业部医学部,上海 200020
  • 收稿日期:2025-07-29 出版日期:2026-04-01 发布日期:2026-03-27
  • 通讯作者: 顾雅佳, Email: cjr.guyajia@vip.163.com

Evaluating the capability of large language models in answering questions on radiologic contrast agents

Ma Xiaowen1, Zuo Xu2, Gong Jing1, Gu Yajia1   

  1. 1Department of Radiology, Fudan University Shanghai Cancer Center & Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China;
    2Global Medical Services (China), Pharmaceutical Diagnostics, GE Healthcare, Shanghai 200020, China
  • Received:2025-07-29 Online:2026-04-01 Published:2026-03-27
  • Contact: Gu Yajia, Email: cjr.guyajia@vip.163.com

摘要: 目的 评估大语言模型回答放射医学对比剂相关问题的能力。方法 2025年3月,采用多因素重复测量设计,构建了放射医学对比剂测试题库和评价体系,选取DeepSeek-R1、DeepSeek-V3、GPT-4、Phi-4、Llama-3.3大语言模型作答试题,通过重复测量方差分析,对比各模型接入对比剂知识库前后的成绩。结果 未接入对比剂知识库时,各模型的评分分别为DeepSeek-R1(78.94±3.96)分、DeepSeek-V3(76.11±3.31)分、GPT-4(75.92±2.02)分、Phi-4(55.78±2.18)分、Llama-3.3(66.58±4.04)分,其差异具有统计学意义(P<0.05)。接入对比剂知识库后,各模型评分分别为DeepSeek-R1(75.89±2.65)分、DeepSeek-V3(79.64±1.97)分、GPT-4(77.97±2.19)分、Phi-4(73.78±3.49)分、Llama-3.3(80.22±2.71)分,其差异具有统计学意义(P<0.05)。客观单选题,未接入对比剂知识库时,DeepSeek-R1获得满分;接入对比剂知识库后,Llama-3.3获得满分。主观题,未接入对比剂知识库时,DeepSeek-V3和 GPT-4评分超过36分;接入对比剂知识库后,DeepSeek-R1、DeepSeek-V3与 Llama-3.3评分均超过36分。结论 5种模型具备解答放射医学对比剂基础问题的能力,对比剂知识库对它们的作答能力有一定影响。未接入对比剂知识库时,DeepSeek-R1表现最优;接入对比剂知识库后,Llama-3.3性能提升最佳。

关键词: 放射医学, 大语言模型, 对比剂, 能力评估

Abstract: Objective To evaluate the capability of large language models (LLMs) in answering questions related to radiologic contrast agents. Methods In March 2025, this study employed a multifactorial repeated-measures design to develop a test question bank and evaluation system for radiologic contrast agents. DeepSeek-R1, DeepSeek-V3, GPT-4, Phi-4, and Llama-3.3 were selected to answer the test items. Performance was compared before and after integration with a contrast agent knowledge base using repeated-measures analysis of variance (ANOVA). Results Before accessing the contrast agent knowledge base, the scores of the five models were as follows: DeepSeek-R1 (78.94±3.96), DeepSeek-V3 (76.11±3.31), GPT-4 (75.92±2.02), Phi-4 (55.78±2.18), and Llama-3.3 (66.58±4.04), with statistical differences among models (P<0.05). After integration, the scores were as follows: DeepSeek-R1 (75.89±2.65), DeepSeek-V3 (79.64±1.97), GPT-4 (77.97±2.19), Phi-4 (73.78±3.49), and Llama-3.3 (80.22±2.71), again with statistical differences (P<0.05). In multiple-choice questions, DeepSeek-R1 achieved a perfect score without the knowledge base, while Llama-3.3 attained a perfect score after integration. For subjective questions, DeepSeek-V3 and GPT-4 scored above 36 without the knowledge base, whereas DeepSeek-R1, DeepSeek-V3, and Llama-3.3 exceeded 36 after integration. Conclusions The five LLMs demonstrated the ability to answer basic questions on radiologic contrast agents, and the contrast agent knowledge base had a notable impact on their performance. DeepSeek-R1 performed best without the knowledge base, while Llama-3.3 showed the greatest improvement after integration.

Key words: Radiology, Large language models, Contrast agents, Ability evaluation

中图分类号: