Scientists built the hardest AI test ever and the results are surprising

ScienceDaily
Researchers created "Humanity's Last Exam" (HLE), a 2,500-question test, because current AI benchmarks are too easy.

Summary

As advanced AI models began scoring too highly on existing academic benchmarks like MMLU, nearly 1,000 international researchers developed a new, rigorous assessment called "Humanity's Last Exam" (HLE). This 2,500-question exam covers specialized fields like ancient languages and advanced mathematics, with questions designed to require deep, verifiable human expertise and resist simple internet searches. Questions that leading AI models could answer were removed to ensure the test remained challenging. Early results showed even top models struggled significantly, with GPT-4o scoring 2.7% and the best models reaching only 40-50% accuracy. Dr. Tung Nguyen of Texas A&M noted that HLE measures depth and context beyond pattern recognition, emphasizing that accurate assessment tools are crucial for policymakers to understand AI's true capabilities and risks. The exam is intended as a durable benchmark, with most questions kept private to prevent memorization, highlighting the remaining gap between current AI and genuine human expertise.

(Source:ScienceDaily)