• J Gen Intern Med · Oct 2024

    Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education.

    • Radhika Sreedhar, Linda Chang, Ananya Gangopadhyaya, Peggy Woziwodzki Shiels, Julie Loza, Euna Chi, Elizabeth Gabel, and Yoon Soo Park.
    • University of Illinois College of Medicine, Chicago, IL, USA. sreedhar@uic.edu.
    • J Gen Intern Med. 2024 Oct 14.

    BackgroundThe Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses.ObjectiveTo explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education.Design And ParticipantsThis was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year.InterventionAn initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade.Main MeasuresDifferences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied.Key ResultsIn this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours.ConclusionsThis study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.© 2024. The Author(s), under exclusive licence to Society of General Internal Medicine.

      Pubmed     Copy Citation     Plaintext  

      Add institutional full text...

    Notes

     
    Knowledge, pearl, summary or comment to share?
    300 characters remaining
    help        
    You can also include formatting, links, images and footnotes in your notes
    • Simple formatting can be added to notes, such as *italics*, _underline_ or **bold**.
    • Superscript can be denoted by <sup>text</sup> and subscript <sub>text</sub>.
    • Numbered or bulleted lists can be created using either numbered lines 1. 2. 3., hyphens - or asterisks *.
    • Links can be included with: [my link to pubmed](http://pubmed.com)
    • Images can be included with: ![alt text](https://bestmedicaljournal.com/study_graph.jpg "Image Title Text")
    • For footnotes use [^1](This is a footnote.) inline.
    • Or use an inline reference [^1] to refer to a longer footnote elseweher in the document [^1]: This is a long footnote..

    hide…

Want more great medical articles?

Keep up to date with a free trial of metajournal, personalized for your practice.
1,694,794 articles already indexed!

We guarantee your privacy. Your email address will not be shared.