Chinese Journal of Medical Education ›› 2026, Vol. 46 ›› Issue (4): 315-320.DOI: 10.3760/cma.j.cn115259-20250729-00851

• Medical Education Assessment • Previous Articles    

Evaluating the capability of large language models in answering questions on radiologic contrast agents

Ma Xiaowen1, Zuo Xu2, Gong Jing1, Gu Yajia1   

  1. 1Department of Radiology, Fudan University Shanghai Cancer Center & Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China;
    2Global Medical Services (China), Pharmaceutical Diagnostics, GE Healthcare, Shanghai 200020, China
  • Received:2025-07-29 Online:2026-04-01 Published:2026-03-27
  • Contact: Gu Yajia, Email: cjr.guyajia@vip.163.com

Abstract: Objective To evaluate the capability of large language models (LLMs) in answering questions related to radiologic contrast agents. Methods In March 2025, this study employed a multifactorial repeated-measures design to develop a test question bank and evaluation system for radiologic contrast agents. DeepSeek-R1, DeepSeek-V3, GPT-4, Phi-4, and Llama-3.3 were selected to answer the test items. Performance was compared before and after integration with a contrast agent knowledge base using repeated-measures analysis of variance (ANOVA). Results Before accessing the contrast agent knowledge base, the scores of the five models were as follows: DeepSeek-R1 (78.94±3.96), DeepSeek-V3 (76.11±3.31), GPT-4 (75.92±2.02), Phi-4 (55.78±2.18), and Llama-3.3 (66.58±4.04), with statistical differences among models (P<0.05). After integration, the scores were as follows: DeepSeek-R1 (75.89±2.65), DeepSeek-V3 (79.64±1.97), GPT-4 (77.97±2.19), Phi-4 (73.78±3.49), and Llama-3.3 (80.22±2.71), again with statistical differences (P<0.05). In multiple-choice questions, DeepSeek-R1 achieved a perfect score without the knowledge base, while Llama-3.3 attained a perfect score after integration. For subjective questions, DeepSeek-V3 and GPT-4 scored above 36 without the knowledge base, whereas DeepSeek-R1, DeepSeek-V3, and Llama-3.3 exceeded 36 after integration. Conclusions The five LLMs demonstrated the ability to answer basic questions on radiologic contrast agents, and the contrast agent knowledge base had a notable impact on their performance. DeepSeek-R1 performed best without the knowledge base, while Llama-3.3 showed the greatest improvement after integration.

Key words: Radiology, Large language models, Contrast agents, Ability evaluation

CLC Number: