AI benchmarks are broken. Here’s what we need instead.

Traditional AI benchmarks evaluate machines on isolated tasks, often oversimplifying complex real-world applications where AI will interact with human teams and organizational workflows.
Current benchmarking methods do not accurately reflect AI’s performance within real-world environments, leading to unmet expectations and waste of resources once deployed.
When benchmarked AI models don’t deliver expected results in actual use, it can lead to the models being abandoned, resulting in wasted time, effort, and financial costs.
Current benchmarks create regulatory blind spots because they do not capture real-world AI use and its systemic impacts, risking oversight not aligned with practical realities.
To better predict AI’s real-world performance, a new benchmarking approach called HAIC benchmarks (Human–AI, Context-Specific Evaluation) needs to be adopted. These benchmarks evaluate AI’s performance over time within teams and workflows, emphasizing long-term impact, systemic effects, and error detectability.
HAIC benchmarks aim to shift from task-based accuracy to the broader organizational outcomes, examining how AI integrates into the collaboration, decision-making processes, and workflows of human teams.
Adopting HAIC benchmarks means evaluating AI as an integral part of human teams and workflows, recognizing the necessity of continuous, longitudinal assessments, and looking at broader systemic consequences rather than just isolated task performance.