Tech AI benchmark development

Humanity's Last Exam Challenges AI

Analysis based on 7 articles · First reported Feb 25, 2026 · Last updated Mar 13, 2026

Sentiment

Attention

Articles

Market Impact

Direct

Live prominence charts, article sentiment distribution, and event development timeline available on the NewsDesk Dashboard

Market Impact

The development of Humanity's Last Exam provides a new, more challenging benchmark for AI systems, which could influence investment and development strategies in the AI industry. The low scores of leading models like OpenAI===GPT-4o and Anthropic===Claude (language model) highlight areas where current AI technology still falls short, potentially tempering expectations for immediate human-level AI.

Artificial intelligence Technology Education

Event Summary

A global group of nearly 1,000 researchers, including Dr. Tung Nguyen from Texas A&M University, developed 'Humanity's Last Exam' (HLE), a 2,500-question assessment designed to measure the limits of advanced AI systems. This new benchmark covers diverse academic fields, including ancient languages, natural sciences, and mathematics, with questions specifically crafted to be beyond the current capabilities of AI. Leading AI models like OpenAI===GPT-4o and Anthropic===Claude (language model) scored very low, while Google===Gemini 3.1 Pro and Anthropic===Claude (language model) performed better but still demonstrated significant gaps compared to human intelligence. The exam aims to provide a durable and transparent tool for evaluating AI progress and identifying areas where human expertise remains unique.

Key Actions

70 Tung Nguyen helped write and refine exam questions

60 OpenAI===GPT-4o scored low on new AI benchmark

60 Anthropic===Claude (language model) scored low on new AI benchmark

60 Google===Gemini 3.1 Pro achieved high score on new AI benchmark

60 Anthropic===Claude (language model) achieved high score on new AI benchmark

50 OpenAI had models tested on new benchmark

50 Anthropic had models tested on new benchmark

50 Google had model tested on new benchmark

Entities Involved

per

Tung Nguyen

Tung Nguyen, an instructional associate professor at Texas A&M University, was a key contributor to Humanity's Last Exam, helping to write and refine many questions, especially in mathematics and computer science.

Importance 70 Sentiment 20

subs

OpenAI===GPT-4o

OpenAI===GPT-4o, an AI model, scored 2.7 percent on Humanity's Last Exam, indicating its current limitations in handling complex, specialized human knowledge.

Importance 60 Sentiment -10

subs

Anthropic===Claude (language model)

Anthropic===Claude (language model), an AI model, achieved a score of 4.1 percent on Humanity's Last Exam, demonstrating its struggle with the new benchmark.

Importance 60 Sentiment -10

subs

Google===Gemini 3.1 Pro

Google===Gemini 3.1 Pro, an AI model, achieved an accuracy level between 40 percent and 50 percent on Humanity's Last Exam, making it one of the most capable systems tested so far.

Importance 60 Sentiment 10

subs

Anthropic===Claude (language model)

Anthropic===Claude (language model), an AI model, reached an accuracy level between 40 percent and 50 percent on Humanity's Last Exam, positioning it among the top-performing AI systems.

Importance 60 Sentiment 10

priv

OpenAI

OpenAI's models, including OpenAI===GPT-4o and o1, were tested against Humanity's Last Exam, with o1 performing slightly better than OpenAI===GPT-4o but still showing significant gaps compared to human intelligence.

Importance 50 Sentiment 0

priv

Anthropic

Anthropic's AI models, Anthropic===Claude (language model) and Anthropic===Claude (language model), were evaluated using Humanity's Last Exam, with Anthropic===Claude (language model) showing stronger performance.

Importance 50 Sentiment 0

+ 1 more entities View on Dashboard

Relationships

OpenAI===GPT-4o related ↔ OpenAI

Anthropic===Claude (language model) related ↔ Anthropic

Google===Gemini 3.1 Pro related ↔ Google

Anthropic===Claude (language model) related ↔ Anthropic

NEWSDESK

Track this event live

Set up alerts, explore entity relationships, search across thousands of events, and build custom intelligence feeds.

Open Dashboard