Evaluating the capability of large language models in answering questions on radiologic contrast agents

doi:10.3760/cma.j.cn115259-20250729-00851

Abstract

Abstract: Objective To evaluate the capability of large language models (LLMs) in answering questions related to radiologic contrast agents. Methods In March 2025, this study employed a multifactorial repeated-measures design to develop a test question bank and evaluation system for radiologic contrast agents. DeepSeek-R1, DeepSeek-V3, GPT-4, Phi-4, and Llama-3.3 were selected to answer the test items. Performance was compared before and after integration with a contrast agent knowledge base using repeated-measures analysis of variance (ANOVA). Results Before accessing the contrast agent knowledge base, the scores of the five models were as follows: DeepSeek-R1 (78.94±3.96), DeepSeek-V3 (76.11±3.31), GPT-4 (75.92±2.02), Phi-4 (55.78±2.18), and Llama-3.3 (66.58±4.04), with statistical differences among models (P<0.05). After integration, the scores were as follows: DeepSeek-R1 (75.89±2.65), DeepSeek-V3 (79.64±1.97), GPT-4 (77.97±2.19), Phi-4 (73.78±3.49), and Llama-3.3 (80.22±2.71), again with statistical differences (P<0.05). In multiple-choice questions, DeepSeek-R1 achieved a perfect score without the knowledge base, while Llama-3.3 attained a perfect score after integration. For subjective questions, DeepSeek-V3 and GPT-4 scored above 36 without the knowledge base, whereas DeepSeek-R1, DeepSeek-V3, and Llama-3.3 exceeded 36 after integration. Conclusions The five LLMs demonstrated the ability to answer basic questions on radiologic contrast agents, and the contrast agent knowledge base had a notable impact on their performance. DeepSeek-R1 performed best without the knowledge base, while Llama-3.3 showed the greatest improvement after integration.

Key words: Radiology, Large language models, Contrast agents, Ability evaluation

CLC Number:

Ma Xiaowen, Zuo Xu, Gong Jing, Gu Yajia. Evaluating the capability of large language models in answering questions on radiologic contrast agents[J]. Chinese Journal of Medical Education, 2026, 46(4): 315-320.

References

[1] 邢腾龙, 邢炳伟, 解丙坤, 等. 碘对比剂不良反应与心脑血管狭窄程度的相关性分析[J].中国CT和MRI杂志,2023,21(1):36-37. DOI: 10.3969/j.issn.1672-5131.2023.01.012.
[2] Cashion W, Weisbord SD. Radiographic contrast media and the kidney[J]. Clin J Am Soc Nephrol, 2022,17(8):1234-1242. DOI: 10.2215/CJN.16311221.
[3] Roditi G, Khan N, van der Molen AJ, et al. Intravenous contrast medium extravasation: systematic review and updated ESUR Contrast Media Safety Committee guidelines[J]. Eur Radiol, 2022,32(5):3056-3066. DOI: 10.1007/s00330-021-08433-4.
[4] 中国抗癌协会肿瘤影像专业委员会. 恶性肿瘤患者CT增强扫描对比剂安全管理专家共识(2022)[J].中华放射学杂志,2022,56(9):941-949. DOI: 10.3760/cma.j.cn112149-20220420-00367.
[5] 中国医师协会神经介入专业委员会, 张桂莲, 焦力群, 等. 对比剂脑病中国专家共识2023[J].中国脑血管病杂志,2024,21(3):207-216. DOI: 10.3969/j.issn.1672-5921.2024.03.009.
[6] 中华医学会临床药学分会, 中国药学会医院药学专业委员会, 中华医学会肾脏病学分会. 碘对比剂诱导的急性肾损伤防治的专家共识[J].中华肾脏病杂志,2022,38(3):265-288. DOI: 10.3760/cma.j.cn441217-20210909-00041.
[7] Rello J, Lorente C, Bodí M, et al. Why do physicians not follow evidence-based guidelines for preventing ventilator-associated pneumonia?: a survey based on the opinions of an international panel of intensivists[J]. Chest, 2002,122(2):656-661. DOI: 10.1378/chest.122.2.656.
[8] Kilsdonk E, Peute LW, Riezebos RJ, et al. From an expert-driven paper guideline to a user-centred decision support system: a usability comparison study[J]. Artif Intell Med, 2013,59(1):5-13. DOI: 10.1016/j.artmed.2013.04.004.
[9] 秦小林, 古徐, 李弟诚, 等. 大语言模型综述与展望[J].计算机应用,2025,45(3):685-696. DOI: 10.11772/j.issn.1001-9081.2025010128.
[10] Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization[J]. Nat Med, 2024,30(4):1134-1142. DOI: 10.1038/s41591-024-02855-5.
[11] Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge[J]. JAMA, 2023,330(1):78-80. DOI: 10.1001/jama.2023.8288.
[12] Scheschenja M, Bastian MB, Wessendorf J, et al. ChatGPT: evaluating answers on contrast media related questions and finetuning by providing the model with the ESUR guideline on contrast agents[J]. Curr Probl Diagn Radiol, 2024,53(4):488-493. DOI: 10.1067/j.cpradiol.2024.04.005.
[13] Gertz RJ, Bunck AC, Lennartz S, et al. GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study[J]. Radiology, 2023,307(5):e230877. DOI: 10.1148/radiol.230877.
[14] 毕枫林,张豈明,张嘉睿,等. 基于滑动窗口策略的大语言模型检索增强生成系统[J]. 计算机研究与发展,2025,62(7):1597-1610. DOI:10.7544/issn1000-1239.202440411.
[15] 中华医学会放射学分会对比剂安全使用工作组. 碘对比剂使用指南(第2版)[J].中华放射学杂志,2013,47(10):869-872. DOI: 10.3760/cma.j.issn.1005-1201.2013.10.001.
[16] Thomsen HS, Morcos SK. Contrast media and the kidney: European Society of Urogenital Radiology (ESUR) guidelines[J]. Br J Radiol, 2003,76(908):513-518. DOI: 10.1259/bjr/26964464.
[17] Adam B, Blair MD, Kirkpatrick Iain DC, et al. CAR/CSACI practice guidance for contrast media hypersensitivity[J]. Can Assoc Radiol J, 2025,76(3):400-416. DOI: 10.1177/08465371241311253.
[18] Macdonald DB, Hurrell C, Costa AF, et al. Canadian Association of Radiologists guidance on contrast associated acute kidney injury[J]. Can Assoc Radiol J, 2022,73(3):499-514. DOI: 10.1177/08465371221083970.
[19] SturuaS, Mohr I, Akram MK, et al. Jina-embeddings-v3: multilingual embeddings with task LoRA[EB/OL]. (2024-09-16)[2025-03-19].https://arxiv.org/abs/2409.10173.
[20] Fukui Y, Kawata Y, Kobashi K, et al. Evaluation of a retrieval-augmented generation system using a Japanese Institutional Nuclear Medicine Manual and large language model-automated scoring[J]. Radiol Phys Technol, 2025,18(3):861-876. DOI: 10.1007/s12194-025-00941-y.
[21] LlamaIndex Team. LlamaIndex python framework[EB/OL].[2025-03-19].https://developers.llamaindex.ai/python/framework/.
[22] 余红梅,罗艳虹,萨建,等. 组内相关系数及其软件实现[J]. 中国卫生统计,2011,28(5):497-500. DOI:10.3969/j.issn.1002-3674.2011.05.006.
[23] Levin G, Horesh N, Brezinov Y, et al. Performance of ChatGPT in medical examinations: a systematic review and a meta-analysis[J]. BJOG, 2024,131(3):378-380. DOI: 10.1111/1471-0528.17641.
[24] Rau A, Rau S, Zoeller D, et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines[J]. Radiology, 2023,308(1):e230970. DOI: 10.1148/radiol.230970.