大语言模型回答放射医学对比剂相关问题能力的评估

doi:10.3760/cma.j.cn115259-20250729-00851

中华医学教育杂志 ›› 2026, Vol. 46 ›› Issue (4): 315-320.DOI: 10.3760/cma.j.cn115259-20250729-00851

• 医学教育评估 • 上一篇

大语言模型回答放射医学对比剂相关问题能力的评估

马晓雯¹, 左勖², 龚敬¹, 顾雅佳¹

¹复旦大学附属肿瘤医院放射诊断科复旦大学上海医学院肿瘤学系,上海 200032;
²GE医疗诊断药物事业部医学部,上海 200020

收稿日期:2025-07-29 出版日期:2026-04-01 发布日期:2026-03-27
通讯作者: 顾雅佳, Email: cjr.guyajia@vip.163.com

Evaluating the capability of large language models in answering questions on radiologic contrast agents

Ma Xiaowen¹, Zuo Xu², Gong Jing¹, Gu Yajia¹

¹Department of Radiology, Fudan University Shanghai Cancer Center & Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China;
²Global Medical Services (China), Pharmaceutical Diagnostics, GE Healthcare, Shanghai 200020, China

Received:2025-07-29 Online:2026-04-01 Published:2026-03-27
Contact: Gu Yajia, Email: cjr.guyajia@vip.163.com

摘要/Abstract

摘要： 目的评估大语言模型回答放射医学对比剂相关问题的能力。方法 2025年3月,采用多因素重复测量设计,构建了放射医学对比剂测试题库和评价体系,选取DeepSeek-R1、DeepSeek-V3、GPT-4、Phi-4、Llama-3.3大语言模型作答试题,通过重复测量方差分析,对比各模型接入对比剂知识库前后的成绩。结果未接入对比剂知识库时,各模型的评分分别为DeepSeek-R1(78.94±3.96)分、DeepSeek-V3(76.11±3.31)分、GPT-4(75.92±2.02)分、Phi-4(55.78±2.18)分、Llama-3.3(66.58±4.04)分,其差异具有统计学意义(P<0.05)。接入对比剂知识库后,各模型评分分别为DeepSeek-R1(75.89±2.65)分、DeepSeek-V3(79.64±1.97)分、GPT-4(77.97±2.19)分、Phi-4(73.78±3.49)分、Llama-3.3(80.22±2.71)分,其差异具有统计学意义(P<0.05)。客观单选题,未接入对比剂知识库时,DeepSeek-R1获得满分;接入对比剂知识库后,Llama-3.3获得满分。主观题,未接入对比剂知识库时,DeepSeek-V3和 GPT-4评分超过36分;接入对比剂知识库后,DeepSeek-R1、DeepSeek-V3与 Llama-3.3评分均超过36分。结论 5种模型具备解答放射医学对比剂基础问题的能力,对比剂知识库对它们的作答能力有一定影响。未接入对比剂知识库时,DeepSeek-R1表现最优;接入对比剂知识库后,Llama-3.3性能提升最佳。

关键词: 放射医学, 大语言模型, 对比剂, 能力评估

Abstract: Objective To evaluate the capability of large language models (LLMs) in answering questions related to radiologic contrast agents. Methods In March 2025, this study employed a multifactorial repeated-measures design to develop a test question bank and evaluation system for radiologic contrast agents. DeepSeek-R1, DeepSeek-V3, GPT-4, Phi-4, and Llama-3.3 were selected to answer the test items. Performance was compared before and after integration with a contrast agent knowledge base using repeated-measures analysis of variance (ANOVA). Results Before accessing the contrast agent knowledge base, the scores of the five models were as follows: DeepSeek-R1 (78.94±3.96), DeepSeek-V3 (76.11±3.31), GPT-4 (75.92±2.02), Phi-4 (55.78±2.18), and Llama-3.3 (66.58±4.04), with statistical differences among models (P<0.05). After integration, the scores were as follows: DeepSeek-R1 (75.89±2.65), DeepSeek-V3 (79.64±1.97), GPT-4 (77.97±2.19), Phi-4 (73.78±3.49), and Llama-3.3 (80.22±2.71), again with statistical differences (P<0.05). In multiple-choice questions, DeepSeek-R1 achieved a perfect score without the knowledge base, while Llama-3.3 attained a perfect score after integration. For subjective questions, DeepSeek-V3 and GPT-4 scored above 36 without the knowledge base, whereas DeepSeek-R1, DeepSeek-V3, and Llama-3.3 exceeded 36 after integration. Conclusions The five LLMs demonstrated the ability to answer basic questions on radiologic contrast agents, and the contrast agent knowledge base had a notable impact on their performance. DeepSeek-R1 performed best without the knowledge base, while Llama-3.3 showed the greatest improvement after integration.

Key words: Radiology, Large language models, Contrast agents, Ability evaluation

中图分类号:

马晓雯, 左勖, 龚敬, 顾雅佳. 大语言模型回答放射医学对比剂相关问题能力的评估[J]. 中华医学教育杂志, 2026, 46(4): 315-320.

Ma Xiaowen, Zuo Xu, Gong Jing, Gu Yajia. Evaluating the capability of large language models in answering questions on radiologic contrast agents[J]. Chinese Journal of Medical Education, 2026, 46(4): 315-320.

参考文献

[1] 邢腾龙, 邢炳伟, 解丙坤, 等. 碘对比剂不良反应与心脑血管狭窄程度的相关性分析[J].中国CT和MRI杂志,2023,21(1):36-37. DOI: 10.3969/j.issn.1672-5131.2023.01.012.
[2] Cashion W, Weisbord SD. Radiographic contrast media and the kidney[J]. Clin J Am Soc Nephrol, 2022,17(8):1234-1242. DOI: 10.2215/CJN.16311221.
[3] Roditi G, Khan N, van der Molen AJ, et al. Intravenous contrast medium extravasation: systematic review and updated ESUR Contrast Media Safety Committee guidelines[J]. Eur Radiol, 2022,32(5):3056-3066. DOI: 10.1007/s00330-021-08433-4.
[4] 中国抗癌协会肿瘤影像专业委员会. 恶性肿瘤患者CT增强扫描对比剂安全管理专家共识(2022)[J].中华放射学杂志,2022,56(9):941-949. DOI: 10.3760/cma.j.cn112149-20220420-00367.
[5] 中国医师协会神经介入专业委员会, 张桂莲, 焦力群, 等. 对比剂脑病中国专家共识2023[J].中国脑血管病杂志,2024,21(3):207-216. DOI: 10.3969/j.issn.1672-5921.2024.03.009.
[6] 中华医学会临床药学分会, 中国药学会医院药学专业委员会, 中华医学会肾脏病学分会. 碘对比剂诱导的急性肾损伤防治的专家共识[J].中华肾脏病杂志,2022,38(3):265-288. DOI: 10.3760/cma.j.cn441217-20210909-00041.
[7] Rello J, Lorente C, Bodí M, et al. Why do physicians not follow evidence-based guidelines for preventing ventilator-associated pneumonia?: a survey based on the opinions of an international panel of intensivists[J]. Chest, 2002,122(2):656-661. DOI: 10.1378/chest.122.2.656.
[8] Kilsdonk E, Peute LW, Riezebos RJ, et al. From an expert-driven paper guideline to a user-centred decision support system: a usability comparison study[J]. Artif Intell Med, 2013,59(1):5-13. DOI: 10.1016/j.artmed.2013.04.004.
[9] 秦小林, 古徐, 李弟诚, 等. 大语言模型综述与展望[J].计算机应用,2025,45(3):685-696. DOI: 10.11772/j.issn.1001-9081.2025010128.
[10] Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization[J]. Nat Med, 2024,30(4):1134-1142. DOI: 10.1038/s41591-024-02855-5.
[11] Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge[J]. JAMA, 2023,330(1):78-80. DOI: 10.1001/jama.2023.8288.
[12] Scheschenja M, Bastian MB, Wessendorf J, et al. ChatGPT: evaluating answers on contrast media related questions and finetuning by providing the model with the ESUR guideline on contrast agents[J]. Curr Probl Diagn Radiol, 2024,53(4):488-493. DOI: 10.1067/j.cpradiol.2024.04.005.
[13] Gertz RJ, Bunck AC, Lennartz S, et al. GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study[J]. Radiology, 2023,307(5):e230877. DOI: 10.1148/radiol.230877.
[14] 毕枫林,张豈明,张嘉睿,等. 基于滑动窗口策略的大语言模型检索增强生成系统[J]. 计算机研究与发展,2025,62(7):1597-1610. DOI:10.7544/issn1000-1239.202440411.
[15] 中华医学会放射学分会对比剂安全使用工作组. 碘对比剂使用指南(第2版)[J].中华放射学杂志,2013,47(10):869-872. DOI: 10.3760/cma.j.issn.1005-1201.2013.10.001.
[16] Thomsen HS, Morcos SK. Contrast media and the kidney: European Society of Urogenital Radiology (ESUR) guidelines[J]. Br J Radiol, 2003,76(908):513-518. DOI: 10.1259/bjr/26964464.
[17] Adam B, Blair MD, Kirkpatrick Iain DC, et al. CAR/CSACI practice guidance for contrast media hypersensitivity[J]. Can Assoc Radiol J, 2025,76(3):400-416. DOI: 10.1177/08465371241311253.
[18] Macdonald DB, Hurrell C, Costa AF, et al. Canadian Association of Radiologists guidance on contrast associated acute kidney injury[J]. Can Assoc Radiol J, 2022,73(3):499-514. DOI: 10.1177/08465371221083970.
[19] SturuaS, Mohr I, Akram MK, et al. Jina-embeddings-v3: multilingual embeddings with task LoRA[EB/OL]. (2024-09-16)[2025-03-19].https://arxiv.org/abs/2409.10173.
[20] Fukui Y, Kawata Y, Kobashi K, et al. Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring[J]. Radiol Phys Technol, 2025,18(3):861-876. DOI: 10.1007/s12194-025-00941-y.
[21] LlamaIndex Team. LlamaIndex python framework[EB/OL].[2025-03-19].https://developers.llamaindex.ai/python/framework/.
[22] 余红梅,罗艳虹,萨建,等. 组内相关系数及其软件实现[J]. 中国卫生统计,2011,28(5):497-500. DOI:10.3969/j.issn.1002-3674.2011.05.006.
[23] Levin G, Horesh N, Brezinov Y, et al. Performance of ChatGPT in medical examinations: a systematic review and a meta-analysis[J]. BJOG, 2024,131(3):378-380. DOI: 10.1111/1471-0528.17641.
[24] Rau A, Rau S, Zoeller D, et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines[J]. Radiology, 2023,308(1):e230970. DOI: 10.1148/radiol.230970.

大语言模型回答放射医学对比剂相关问题能力的评估

Evaluating the capability of large language models in answering questions on radiologic contrast agents

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价

[1]	魏晓彤, 闻德亮. 基于大语言模型的动态病例在儿科学教学中的应用研究[J]. 中华医学教育杂志, 2026, 46(4): 269-274.
[2]	江哲涵, 奉世聪, 王维民. 生成式大语言模型在医学考试题库建设中的实践探索[J]. 中华医学教育杂志, 2024, 44(8): 561-569.
[3]	王筝扬, 黄一琳, 万姗姗, 冯嘉璐. 德雷福斯模型在临床能力评估与教学中应用的启示[J]. 中华医学教育杂志, 2021, 41(11): 1038-1041.
[4]	余芬芬, 霍丽君, 苏毅华, 陈咏冲, 甘世斌, 万鹏霞. 全科医师不同眼科规范化培训方式的实施效果比较[J]. 中华医学教育杂志, 2019, 39(12): 974-977.