This event is archived. Final snapshot from when the story concluded. View on Dashboard
Tech AI benchmark development

Humanity's Last Exam Challenges AI

Analysis based on 7 articles · First reported Feb 25, 2026 · Last updated Mar 13, 2026

Sentiment
20
Attention
4
Articles
7
Market Impact
Direct
Live prominence charts, article sentiment distribution, and event development timeline available on the NewsDesk Dashboard

The development of Humanity's Last Exam provides a new, more challenging benchmark for AI systems, which could influence investment and development strategies in the AI industry. The low scores of leading models like OpenAI===GPT-4o and Anthropic===Claude (language model) highlight areas where current AI technology still falls short, potentially tempering expectations for immediate human-level AI.

Artificial intelligence Technology Education

A global group of nearly 1,000 researchers, including Dr. Tung Nguyen from Texas A&M University, developed 'Humanity's Last Exam' (HLE), a 2,500-question assessment designed to measure the limits of advanced AI systems. This new benchmark covers diverse academic fields, including ancient languages, natural sciences, and mathematics, with questions specifically crafted to be beyond the current capabilities of AI. Leading AI models like OpenAI===GPT-4o and Anthropic===Claude (language model) scored very low, while Google===Gemini 3.1 Pro and Anthropic===Claude (language model) performed better but still demonstrated significant gaps compared to human intelligence. The exam aims to provide a durable and transparent tool for evaluating AI progress and identifying areas where human expertise remains unique.

70 Tung Nguyen helped write and refine exam questions
60 OpenAI===GPT-4o scored low on new AI benchmark
60 Anthropic===Claude (language model) scored low on new AI benchmark
60 Google===Gemini 3.1 Pro achieved high score on new AI benchmark
60 Anthropic===Claude (language model) achieved high score on new AI benchmark
50 OpenAI had models tested on new benchmark
50 Anthropic had models tested on new benchmark
50 Google had model tested on new benchmark
per
Tung Nguyen, an instructional associate professor at Texas A&M University, was a key contributor to Humanity's Last Exam, helping to write and refine many questions, especially in mathematics and computer science.
Importance 70 Sentiment 20
subs
OpenAI===GPT-4o, an AI model, scored 2.7 percent on Humanity's Last Exam, indicating its current limitations in handling complex, specialized human knowledge.
Importance 60 Sentiment -10
subs
Anthropic===Claude (language model), an AI model, achieved a score of 4.1 percent on Humanity's Last Exam, demonstrating its struggle with the new benchmark.
Importance 60 Sentiment -10
subs
Google===Gemini 3.1 Pro, an AI model, achieved an accuracy level between 40 percent and 50 percent on Humanity's Last Exam, making it one of the most capable systems tested so far.
Importance 60 Sentiment 10
subs
Anthropic===Claude (language model), an AI model, reached an accuracy level between 40 percent and 50 percent on Humanity's Last Exam, positioning it among the top-performing AI systems.
Importance 60 Sentiment 10
priv
OpenAI's models, including OpenAI===GPT-4o and o1, were tested against Humanity's Last Exam, with o1 performing slightly better than OpenAI===GPT-4o but still showing significant gaps compared to human intelligence.
Importance 50 Sentiment 0
priv
Anthropic's AI models, Anthropic===Claude (language model) and Anthropic===Claude (language model), were evaluated using Humanity's Last Exam, with Anthropic===Claude (language model) showing stronger performance.
Importance 50 Sentiment 0
+ 1 more entities View on Dashboard
NEWSDESK
Track this event live

Set up alerts, explore entity relationships, search across thousands of events, and build custom intelligence feeds.

Open Dashboard

About NewsDesk

NewsDesk is a news intelligence platform that converts raw news articles into structured data. It tracks events, entities, and the relationships between them, with sentiment and attention metrics derived from thousands of articles. Pages on this site are daily static snapshots from the platform's live database. For real-time tracking, search, and alerts, the full dashboard is at app.newsdesk.dev.