Nature Study: AI Agents Fail at Complex Scientific Tasks

While the tech industry is busy celebrating AI agents as the next revolution, a study published in Nature delivers a reality check: The best AI agents achieve only about half the performance of PhD-level experts on complex tasks.

The findings come from the Stanford AI Index Report 2026, released this week by the Institute for Human-Centered AI. Nature picked up the results in a detailed analysis.

What Was Tested

The researchers pitted current AI agents — systems that can autonomously execute actions and handle multi-step workflows — against human experts. On simple tasks, agents now perform respectably. But as soon as things get complex, as soon as real domain expertise, creative problem-solving, and synthesizing information from multiple sources come into play, they fall significantly behind.

This matters because AI agents are being positioned as game-changers everywhere right now. OpenAI has Codex, Anthropic has Claude Code Routines, Google is building agents into Gemini. The investment is enormous.

The Nuance in the Data

What the study also shows: AI tools are incredibly useful despite their limitations. Researchers who work with AI support publish 3x more papers and receive nearly 5x more citations than colleagues without AI assistance.

There’s a catch, though: research focus is simultaneously narrowing. When everyone uses the same AI tools, they tend to ask similar questions and choose similar methods. More output, less diversity.

My Take

This study matches what I see in practice. AI agents are fantastic for clearly defined, repeatable tasks. But as soon as you venture into unknown territory — real research, real creativity, real problem-solving — you still need a human setting the direction.

This isn’t an argument against agents. It’s an argument for seeing them as what they are: extremely powerful tools that amplify humans but don’t replace them. At least not yet.

The Stanford AI Index Report 2026 also found that Anthropic currently leads in model performance, followed by xAI, Google, and OpenAI. But that’s a different story.

Sources: