April 19th 2023

The Study posted on medRxiv preprint server has brought to light the superior performance of GPT-4, the latest language model from OpenAI, over GPT-3.5 and Google Bard in a Neurosurgery Oral Board exam. The study was carried out by researchers in the United States who assessed the performance of the three general Large Language Models (LLMs) on higher-order questions that represent the American Board of Neurological Surgery (ABNS) oral board examination.

The National Program of IT as the Alt…
Best Steam Irons in India
The Art of Tactics: Unveiling the Bes…
Revive Your Morning: NAC Hangover Cur…
Another Appliance Company Now Files A…

The ABNS neurosurgery oral board examination is considered a more rigorous assessment than its written counterpart and is taken by doctors two to three years after residency graduation. It comprises three sessions of 45 minutes each, and its pass rate has not exceeded 90% since 2018. The study assessed the performance of GPT-3.5, GPT-4, and Google Bard on a 149-question module imitating the neurosurgery oral board exam.

All three LLMs assessed in this study have shown the capability to pass medical board exams with multiple-choice questions. However, no previous studies have tested or compared the performance of multiple LLMs on predominantly higher-order questions from a high-stake medical subspecialty domain, such as neurosurgery.

The study found that GPT-4 attained a score of 82.6% on the 149-question module, outperforming ChatGPT’s score of 62.4%. Additionally, GPT-4 demonstrated better performance than ChatGPT in the Spine subspecialty, scoring 90.5% compared to ChatGPT’s 64.3%. Google Bard generated correct responses for 44.2% of questions, while GPT-3.5 and GPT-4 never declined to answer a text-based question.

The study findings underscore the urgent need for neurosurgeons to stay informed about emerging LLMs and their varying performance levels for potential clinical applications. With advancements in the AI domain, neurosurgical trainees might use and depend on LLMs for board preparation, providing new clinical insights and serving as a conversational aid to rehearse various clinical scenarios on challenging topics for the boards.

However, there is an urgent need to develop more trust in LLM systems, thus, rigorous validation of their performance on increasingly higher-order and open-ended scenarios should continue. It would ensure the safe and effective integration of these LLMs into clinical decision-making processes. The study highlights the importance of methods to quantify and understand hallucinations, and eventually, only those LLMs that minimize and recognize hallucinations will be incorporated into clinical practice.

The study findings also suggest that multiple-choice examination patterns might become obsolete in medical education, while verbal assessments will gain more importance. Furthermore, the study notes that GPT-4 showed reduced rates of hallucination and the ability to navigate challenging concepts like declaring medical futility. However, it struggled in other scenarios, such as factoring in patient-level characteristics, e.g., frailty.

In conclusion, the study posted on medRxiv preprint server has shown that GPT-4 outperforms GPT-3.5 and Google Bard in a neurosurgery oral board exam. The study findings underscore the need for rigorous validation of language models’ performance on increasingly higher-order and open-ended scenarios. Additionally, the study highlights the importance of neurosurgeons staying informed about emerging language models and their varying performance levels for potential clinical applications.

This post first appeared on TS2 Space, please read the originial post: here