Temporal Evolution of Large Language Models For Response Evaluation Criteria in Solid Tumors-Based Response Evaluation After Locoregional Therapy
1Department of Radiology, Recep Tayyip Erdogan University, Training and Research Hospital, Rize, Türkiye
2Department of Radiology, Università degli Studi di Udine, Udine, Italy
3School of Medicine, Guilan University of Medical Sciences, Rasht, Gilan, Iran
4Department of Diagnostic and Interventional Radiology, Frankfurt University Hospital, Frankfurt, Germany
Eur Arch Med Res 2025; 41(3): 146-153 DOI: 10.14744/eamr.2025.24861
Full Text PDF

Abstract

Objective: This study aimed to evaluate the response evaluation criteria in solid tumors (RECIST) using tumor measurements from computed tomography (CT) reports of hepatocellular carcinoma (HCC) patients before and after transcatheter arterial chemoembolization with various large language models (LLMs).
Materials and
Methods: Ninety-three patients were included after the exclusion criteria were applied. RECIST assessments were performed using Bard, Bing, and ChatGPT-4 in 2023, and their updated versions–ChatGPT-4, Gemini, and Copilot–in 2025. Evaluations were based on RECIST categories determined by baseline and follow-up measurements of the longest tumor diameters from contrast-enhanced CT scans. A zero-shot prompting was used for the LLM inputs. LLM-generated RECIST classifications were compared with radiologist reports. Model performance was assessed in both years, and changes over time were analyzed.

Results: ChatGPT-4 (both 2023 and 2025) and Copilot (2025) achieved perfect scores across accuracy, precision, recall, and F1 (all 1.000). Gemini improved significantly, with accuracy rising from 0.581 in 2023 (as Bard) to 0.989 in 2025. Bing’s accuracy also increased from 0.839 to 1.000 after being updated to Copilot. Cohen’s Kappa showed moderate agreement between ChatGPT-4 and Bing in 2023 (κ=0.612, p<0.001) and perfect agreement between ChatGPT-4 and Copilot in 2025 (κ=1.000). McNemar’s test showed no significant change for ChatGPT-4 between 2023 and 2025 (p=1.000), while Gemini and Copilot improved significantly (p<0.0001 and p=0.0003).

Conclusion: LLMs demonstrate strong potential in RECIST evaluation from CT reports in HCC patients, and ongoing improvements suggest they may increasingly aid radiological assessments in the future.