Claude Outperforms Scientist Panels: Anthropic's New Bioinformatics Benchmark

Anthropic published a research paper yesterday that caught my attention: BioMysteryBench — a benchmark that tests AI models on real, unsolved bioinformatics problems.

What Is BioMysteryBench?

The benchmark consists of 99 questions from various fields of bioinformatics. What makes it special: the questions are based on real, messy datasets — not clean textbook examples. Domain experts designed the questions to target objective, verifiable properties of the data.

Questions are split into two categories: ‘human-solvable’ (experts can solve them) and ‘human-difficult’ (where even experienced bioinformaticians struggle).

How Does Claude Perform?

The latest generation of Claude reliably solves the majority of human-solvable problems. But here’s the exciting part: on a meaningful fraction of ‘human-difficult’ tasks, Claude outperforms panels of five domain experts.

Claude Mythos Preview achieves a 30% solve rate on the hardest tasks. That might not sound like much — until you consider that human expert panels often land at zero on these same problems.

Claude Opus 4.6 scores 81% overall and 69% on the hardest questions on the related CompBioBench (developed by Genentech and Roche).

Why This Matters

This isn’t another ‘AI beats humans at trivia’ moment. Bioinformatics problems require genuine scientific analysis: interpreting data, forming hypotheses, applying statistical methods. If Claude can keep up with or outperform experts here, it changes the research landscape.

Models are improving across generations — they’re no longer merely keeping pace, they’re pulling ahead in some areas. For researchers, this means Claude is evolving from tool to colleague.

Sources: Anthropic Research - BioMysteryBench