General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

https://www.nature.com/articles/s41591-026-04431-5

Publish Date: 2026-06-12 06:03:00

Source Domain: www.nature.com

  • Specialist clinical artificial intelligence (AI) tools, developed for medical practice, are showing inferior performance when compared to general-purpose large language models (LLMs).
  • Evaluations indicate that general-purpose LLMs outperform clinical AI tools across various metrics, including medical knowledge, agreement with expert clinicians, and practical clinical use.
  • A study comparing clinical AI tools (OpenEvidence and UpToDate Expert AI) against leading general-purpose LLMs (OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview, and Anthropic Claude Opus 4.6) found general-purpose LLMs to have superior accuracy in answering MedQA questions.
  • Frontier LLMs also scored higher in practical, real-world clinical query evaluations compared to specialist clinical AI tools, demonstrating better performance across clinical correctness, completeness, safety, and clarity.
  • The study suggests that current proprietary clinical AI tools may not need to be overhauled but instead may benefit from faster iteration cycles, larger training corpora, and greater alignment, although the long-term value of domain-specific tuning is also a consideration.
  • The study has limitations, including a lack of public APIs for clinical tools which impacted the evaluation method and potential contamination in benchmarks.
  • Independent evaluation is essential as industry-developed benchmarks might favor their own systems, although real-world clinical query benchmarks evaluated by clinicians provide critical additional data.
  • The study highlights the need for rigorous, independent evaluation of generative models for real-world medical tasks and suggests a future direction towards hospital-specific LLMs that can leverage institutional data to deliver more useful medical recommendations.