Humanity's Last Exam Challenges AI
Analysis based on 7 articles · First reported Feb 25, 2026 · Last updated Mar 13, 2026
The development of Humanity's Last Exam provides a new, more challenging benchmark for AI systems, which could influence investment and development strategies in the AI industry. The low scores of leading models like OpenAI===GPT-4o and Anthropic===Claude (language model) highlight areas where current AI technology still falls short, potentially tempering expectations for immediate human-level AI.
A global group of nearly 1,000 researchers, including Dr. Tung Nguyen from Texas A&M University, developed 'Humanity's Last Exam' (HLE), a 2,500-question assessment designed to measure the limits of advanced AI systems. This new benchmark covers diverse academic fields, including ancient languages, natural sciences, and mathematics, with questions specifically crafted to be beyond the current capabilities of AI. Leading AI models like OpenAI===GPT-4o and Anthropic===Claude (language model) scored very low, while Google===Gemini 3.1 Pro and Anthropic===Claude (language model) performed better but still demonstrated significant gaps compared to human intelligence. The exam aims to provide a durable and transparent tool for evaluating AI progress and identifying areas where human expertise remains unique.
Set up alerts, explore entity relationships, search across thousands of events, and build custom intelligence feeds.
Open Dashboard