AI is failing ‘Humanity’s Last Exam’. So what does that mean for machine intelligence?

https://theconversation.com/ai-is-failing-humanitys-last-exam-so-what-does-that-mean-for-machine-intelligence-274620

Publish Date: 2026-01-29 18:54:00

Source Domain: theconversation.com

New AI Benchmark “Humanity’s Last Exam” Released: Published in Nature, this benchmark features 2,500 questions designed to identify AI’s current limitations across diverse academic fields, compiled with input from nearly 1,000 international experts.
Benchmark Highlights AI’s Current Weaknesses: Early results showed that leading AI models scored very poorly, achieving only about 2.7-8% accuracy, indicating tasks beyond modern AI capabilities.
Why thebenchmark Doesn’t Indicate AI is Nearing Human Intelligence: The test measures specific knowledge and performance rather than true understanding. For AI, high scores don’t mean they have evolved to think like humans but have simply learned patterns from test questions.
Human and Machine Intelligence Are Different: Human intelligence results from continuous learning from experiences, while AI processes information based on patterns from training data. AI lacks the intrinsic understanding humans possess.
AI’s Purpose Isn’t to Mimic Human Learning: Instead of learning, AI models optimize for specific benchmark tasks, which focuses them on the exact types of questions the test contains. This isn’t indicative of a broader, human-like intellect.
Score Improvements Show Optimization, Not Superintelligence: As AI models improved over time on this exam, it demonstrates targeted optimization rather than approaching human-like or general intelligence.
Real World Relevance of Benchmarks: The benchmark’s questions skew toward STEM fields. For tasks in writing, communication, etc., it’s less predictive. Real utility should be evaluated based on specific tasks relevant to use scenarios.
Practical Advice on AI Adoption: For professionals using or considering AI tools, benchmark scores shouldn’t sway decisions since specialized tasks may not align with exam questions. It’s better to create custom tests assessing how AI performs the specific tasks required.