生成式大语言模型在医学考试题库建设中的实践探索

doi:10.3760/cma.j.cn115259-20240524-00520

中华医学教育杂志 ›› 2024, Vol. 44 ›› Issue (8): 561-569.DOI: 10.3760/cma.j.cn115259-20240524-00520

• 专论 • 下一篇

生成式大语言模型在医学考试题库建设中的实践探索

江哲涵¹, 奉世聪², 王维民¹

¹北京大学医学教育研究所,北京 100191;
²北京大学教育学院2023级医学教育专业硕士研究生,北京 100871

收稿日期:2024-05-24 出版日期:2024-08-01 发布日期:2024-07-31
通讯作者: 王维民, Email: wwm@bjmu.edu.cn
基金资助:
国家卫生健康委员会人才交流服务中心项目(202110-335);国家自然科学基金委员会青年科学基金项目(72104006);国家卫生健康委员会国家医学考试中心“十四五”改革重点项目(2022-21)

Exploratory practice of generative large language models in the construction of medical item banks

Jiang Zhehan¹, Feng Shicong², Wang Weimin¹

¹Institute of Medical Education, Peking University, Beijing 100191, China;
²Master Degree Candidate, Medical Education Major, Enrolled in 2023, Graduate School of Education, Peking University, Beijing 100871, China

Received:2024-05-24 Online:2024-08-01 Published:2024-07-31
Contact: Wang Weimin, Email: wwm@bjmu.edu.cn
Supported by:
Project of the Health Human Resources Development Center, National Health Commission, P.R.China(202110-335);National Natural Science Foundation Youth Project(72104006);Key Reform Project of the National Medical Examinations Centre, National Health Commission, P.R.China during the 14th Five-Year Plan Period(2022-21)

摘要/Abstract

摘要： 传统的医学考试题库建设耗时长且依赖于命题专家资源,而大语言模型为题库建设带来了新方式,其试题生成质量很大程度上取决于提示词的设计。为了提高医学试题质量,帮助医学教师有效利用大语言模型开展命题工作,本文介绍了大语言模型中常用的提示工程,并以“术后胆漏”试题生成为例,探索了零样本、少样本、思维链、自洽性思维链、思维树提示工程策略的命题效果。分析结果显示,零样本和少样本提示操作简便,但在试题多样性和深度上存在一定局限。通过增加思维成分的提示策略,可以引导大语言模型执行草稿、打磨、比较和确定等命题过程,从而提高试题质量。同时,虽然通过改进提示词可以有效提高命题效果,但其具体实施与设计仍有极大的挖掘空间,需要进一步的研究和探索。

关键词: 人工智能, 生成式大语言模型, 提示工程, 医学试题, 题库建设, 考试命题

Abstract: Item development in healthcare profession education is time-consuming and heavily reliant on content experts. While large language models (LLMs) introduce a new approach to reduce the burdens, the output quality is largely contingent upon the prompt. This article aims to guide educators in effectively leveraging LLMs for item development, enhancing the quality through prompt engineering. Using ″postoperative bile leakage″ as an example, the paper demonstrates the effectiveness of various prompt engineering strategies, including Zero-shot, Few-shot, Chain of Thought (CoT), CoT with Self-Consistency (CoT-SC), and Tree of Thoughts (ToT). It is found that while Zero-shot and Few-shot methods are straightforward, they have certain limitations in terms of item diversity and depth. Conversely, prompt strategies incorporating ″Thought″ elements can navigate the LLMs through stages of drafting, refining, comparing, and finalizing, thereby elevating question quality. Although refining prompts indeed leads to notable improvements in question formulation efficacy, there remains substantial room for exploring and optimizing prompt formulations and strategies to further augment the quality of generated questions. The pursuit of advancing prompt engineering techniques holds the promise of significantly elevating the standards of question bank development within medical education.

Key words: Artificial intelligence, Generative large language models, Prompt engineering, Medical test questions, Item bank construction, Assessment development

中图分类号:

R-4
TP18

江哲涵, 奉世聪, 王维民. 生成式大语言模型在医学考试题库建设中的实践探索[J]. 中华医学教育杂志, 2024, 44(8): 561-569.

Jiang Zhehan, Feng Shicong, Wang Weimin. Exploratory practice of generative large language models in the construction of medical item banks[J]. Chinese Journal of Medical Education, 2024, 44(8): 561-569.

参考文献

[1] 王萱, 刘时乔, 刘宇, 等. 基于教考分离的医学高校题库建设探索与研究—以《临床中药学》为例[J].时珍国医国药,2020,31(2):445-447. DOI: 10.3969/j.issn.1008-0805.2020.02.065.
[2] Zaidi N, Grob KL, Monrad SM, et al. Pushing critical thinking skills with multiple-choice questions: does bloom's taxonomy work?[J]. Acad Med, 2018,93(6):856-859. DOI: 10.1097/ACM.0000000000002087.
[3] Vegi V, Sudhakar PV, Bhimarasetty DM, et al. Multiple-choice questions in assessment: Perceptions of medical students from low-resource setting[J]. J Educ Health Promot, 2022,11:103. DOI: 10.4103/jehp.jehp_621_21.
[4] Burns ER. ″Anatomizing″ reversed: use of examination questions that foster use of higher order learning skills by students[J]. Anat Sci Educ, 2010,3(6):330-334. DOI: 10.1002/ase.187.
[5] Morrison S, Free KW. Writing multiple-choice test items that promote and measure critical thinking[J]. J Nurs Educ, 2001,40(1):17-24. DOI: 10.3928/0148-4834-20010101-06.
[6] Downing SM. Assessment of Knowledge with Written Test Forms[M]//Norman GR, Van Der Vleuten CPM, Newble D I, et al. International Handbook of Research in Medical Education. Dordrecht: Springer Netherlands, 2002: 647-672.
[7] Surry LT, Torre D, Durning SJ. Exploring examinee behaviours as validity evidence for multiple-choice question examinations[J]. Med Educ, 2017,51(10):1075-1085. DOI: 10.1111/medu.13367.
[8] Cecilio-Fernandes D, Kerdijk W, Bremers AJ, et al. Comparison of level of cognitive process between case-based items and non-case-based items of the interuniversity progress test of medicine in the Netherlands[J]. J Educ Eval Health Prof, 2018,15:28. DOI: 10.3352/jeehp.2018.15.28.
[9] National Board of Medical Examiners. NBME item-writing guide: constructing written test questions for the health sciences[EB/OL]. [2024-05-10] . https://www.nbme.org/sites/default/files/2021-02/NBME_Item%20Writing%20Guide_R_6.pdf.
[10] 席峥, 贾若君, 柳雯, 等. 医师资格考试笔试命题科学性与答题技巧的思考[J].中华医学教育杂志,2008,28(1):126-128. DOI: 10.3760/cma.j.issn.1673-677X.2008.01.051.
[11] 温丽虹, 童杰峰, 雷红梅, 等. 临床医学考试命题现状及技巧[J].中国高等医学教育,2022(3):43-44,47. DOI: 10.3969/j.issn.1002-1701.2022.03.020.
[12] 卢燕, 张颖, 何佳, 等. 我国医师资格考试10年改革回顾与展望[J].中国医疗管理科学,2022,12(5):1-6. DOI: 10.3969/j.issn.2095-7432.2022.05.001.
[13] 陈紫妮, 林小丹, 姚卫光. 我国高等医学教育资源配置效率评价及空间计量分析[J].中国卫生事业管理,2024,41(1):78-83.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[Z]. arXiv:1706.03762 [cs.CL]. DOI:10.48550/arXiv.1706.03762.
[15] Zhao WX, Zhou K, Li J, et al. A survey of large language models[Z]. arXiv:2303.18223 [cs.CL].DOI: 10.48550/arXiv.2303.18223.
[16] Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models[Z]. arXiv:2206.07682 [cs.CL]. DOI: 10.48550/arXiv.2206.07682.
[17] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with humanfeedback[Z]. arXiv:2203.02155 [cs.CL]. DOI: 10.48550/arXiv.2203.02155.
[18] Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models[J]. PLOS Digit Health, 2023,2(2):e0000198. DOI: 10.1371/journal.pdig.0000198.
[19] Lee P, Bubeck S, Petro J. Benefits, limits, and risks of gpt-4 as an AI chatbot for medicine[J]. N Engl J Med, 2023,388(13):1233-1239. DOI: 10.1056/NEJMsr2214184.
[20] Thirunavukarasu AJ, Ting D, Elangovan K, et al. Large language models in medicine[J]. Nat Med, 2023,29(8):1930-1940. DOI: 10.1038/s41591-023-02448-8.
[21] Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns[J]. Healthcare (Basel), 2023,11(6):10322. DOI: 10.3390/healthcare11060887.
[22] Luke N, Taneja R, Ban K, et al. Large language models (ChatGPT) in medical education: Embrace or abjure?[J]. The Asia Pacific Scholar, 2023, 8(4): 50-52. DOI: 10.29060/TAPS.2023-8-4/PV3007.
[23] Wang L, Chen X, Deng X, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs[J]. NPJ Digit Med, 2024,7(1):41. DOI: 10.1038/s41746-024-01029-4.
[24] Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial[J]. J Med Internet Res, 2023,25:e50638. DOI: 10.2196/50638.
[25] Prompt engineering - OpenAI API[EB/OL]. [2024-05-10] . https://platform.openai.com/docs/guides/prompt-engineering.
[26] Heston T F, Khun C. Prompt engineering in medical education[J]. International Medical Education, 2023, 2(3): 198-205. DOI: 10.3390/ime2030019.
[27] Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review[J]. Postgrad Med J, 2024, 11:23061.DOI: 10.1093/postmj/qgae065.
[28] Wei J, Bosma M, Zhao VY, et al. Finetuned language models are zero-shot learners[Z].arXiv:2109.01652 [cs.CL]. DOI: 10.48550/arXiv.2109.01652.
[29] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[Z]. arXiv:2005.14165 [cs.CL]. DOI: 10.48550/arXiv.2005.14165.
[30] Wei J, Wang X, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[Z]. arXiv:2201.11903 [cs.CL]. DOI: 10.48550/arXiv.2201.11903.
[31] Wang X, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models[Z]. arXiv:2203.11171 [cs.CL]. DOI: 10.48550/arXiv.2203.11171.
[32] Yao S, Yu D, Zhao J, et al. Tree of thoughts: deliberate problem solving with large language models[Z]. arXiv:2305.10601 [cs.CL]. DOI: 10.48550/arXiv.2305.10601.
[33] 通义千问_大模型服务平台百炼-阿里云帮助中心[EB/OL]. [2024-05-10] . https://help.aliyun.com/document_detail/2713153.html.
[34] 李国建, 何惧, 丁一民, 等. 医学考试中A2型客观题长度、难度和区分度关系的初步研究[J].中华医学教育探索杂志,2017,16(7):653-656. DOI: 10.3760/cma.j.issn.2095-1485.2017.07.002.
[35] 李国建, 朱智威, 何惧. 融合型试题在医学考试中的应用探索[J].高校医学教学研究(电子版),2016,6(2):49-53. DOI: 10.3969/j.issn.2095-1582.2016.02.015.
[36] Coşkun Ö, Kıyak YS, Budako lu I . ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: a randomized controlled experiment[J]. Med Teach, 2024:1-7. DOI: 10.1080/0142159X.2024.2327477.

生成式大语言模型在医学考试题库建设中的实践探索

Exploratory practice of generative large language models in the construction of medical item banks

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价

[1]	王维民. 新科技革命背景下的医学教育范式转型[J]. 中华医学教育杂志, 2024, 44(6): 401-406.
[2]	张永明, 陈艳佳, 郭威, 钟梅, 孙轶楠. 基于科学引文数据库的医学教育领域人工智能应用研究的可视化分析[J]. 中华医学教育杂志, 2024, 44(5): 339-345.
[3]	王静, 齐惠颖, 王路漫, 王晨. 医学本科生人工智能通识课程的设计和教学实践[J]. 中华医学教育杂志, 2024, 44(2): 89-92.
[4]	刘畅, 李广普, 陈梦欢, 王培松, 刘东伟, 李胜云. 基于深度学习的心肺复苏术人工智能实时辅助培训和考核系统的构建[J]. 中华医学教育杂志, 2023, 43(6): 418-422.
[5]	丁敏, 孙晓靓, 罗茜, 康宝丽, 朱亚琴, 曹颖, 徐增光, 陈迟, 徐玮. 住院医师规范化培训全程导师智能管理系统的应用效果分析[J]. 中华医学教育杂志, 2023, 43(3): 218-221.
[6]	杨静文, 杨宗瀚, 李健, 刘云松. 人工智能人脸识别技术在实时评估医学生课堂专注度中的应用研究[J]. 中华医学教育杂志, 2023, 43(1): 31-34.
[7]	魏路通, 万艳丽, 陈庆锟, 胡红濮. 卫生统计学智能组卷系统的设计[J]. 中华医学教育杂志, 2021, 41(4): 359-362.
[8]	黄广仕, 周梦强, 韩春梅, 吕萍, 吴及. 人工智能深度学习技术在医学考试试题难度预估中的应用研究[J]. 中华医学教育杂志, 2021, 41(10): 932-935.