General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

Source Domain: www.nature.com

Specialist clinical artificial intelligence (AI) tools, developed for medical practice, are showing inferior performance when compared to general-purpose large language models (LLMs).
Evaluations indicate that general-purpose LLMs outperform clinical AI tools across various metrics, including medical knowledge, agreement with expert clinicians, and practical clinical use.
A study comparing clinical AI tools (OpenEvidence and UpToDate Expert AI) against leading general-purpose LLMs (OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview, and Anthropic Claude Opus 4.6) found general-purpose LLMs to have superior accuracy in answering MedQA questions.
Frontier LLMs also scored higher in practical, real-world clinical query evaluations compared to specialist clinical AI tools, demonstrating better performance across clinical correctness, completeness, safety, and clarity.
The study suggests that current proprietary clinical AI tools may not need to be overhauled but instead may benefit from faster iteration cycles, larger training corpora, and greater alignment, although the long-term value of domain-specific tuning is also a consideration.
The study has limitations, including a lack of public APIs for clinical tools which impacted the evaluation method and potential contamination in benchmarks.
Independent evaluation is essential as industry-developed benchmarks might favor their own systems, although real-world clinical query benchmarks evaluated by clinicians provide critical additional data.
The study highlights the need for rigorous, independent evaluation of generative models for real-world medical tasks and suggests a future direction towards hospital-specific LLMs that can leverage institutional data to deliver more useful medical recommendations.

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

States Step in Where Congress Stalls on AI Safeguards for Kids

The Dual-Edged Sword of Artificial Intelligence: Structuring Governance Protocols for High-Velocity Financial Solutions

The Dual-Edged Sword of Artificial Intelligence: Structuring Governance Protocols for High-Velocity Financial Solutions

Cybersecurity needs secure software – PubAffairs Bruxelles

States Step in Where Congress Stalls on AI Safeguards for Kids

The Dual-Edged Sword of Artificial Intelligence: Structuring Governance Protocols for High-Velocity Financial Solutions

The Dual-Edged Sword of Artificial Intelligence: Structuring Governance Protocols for High-Velocity Financial Solutions

Cuentas names Kilinsky interim CFO, Suchard AI officer

More Stories

You may have missed