Who actually checks whether an AI model lives up to what the lab promises? On May 29, OpenAI published a playbook for trustworthy third-party evaluations — a set of ground rules for how independent teams should probe frontier models for their capabilities and risks. Sounds dry at first, but it’s exactly the kind of homework that’s slowly becoming standard.
What It’s About
OpenAI works with a range of external organizations that bring deep expertise in specific risk areas. The idea: instead of only measuring internally, you let independent labs in with their own methods — open-ended testing, where the outside teams arrive at their own assessment.
For GPT-5, OpenAI had already coordinated a broad set of external evaluations: long-horizon autonomy, deception and oversight subversion, wet-lab planning feasibility, offensive cybersecurity. The new playbook distills the lessons from that into reusable ground rules.
Embedded in the Frontier Governance Framework
The playbook isn’t a solo piece — it complements OpenAI’s Frontier Governance Framework. That covers risk assessment and mitigation across several areas — cyber offense, CBRN risks, harmful manipulation, loss of control — plus things like model reporting, security risk management, incident response, and external expert input.
In short: OpenAI is trying to make its own safety promise more verifiable. Not “trust us,” but “here’s how you can check.”
Why It Matters
I think these governance topics get underrated, because they don’t shine like a new model does. But this is exactly where it’s decided whether you can trust the whole system. If independent evaluations become the norm — and ideally comparable across labs — we as users get a much better basis for judging the claims.
The contrast is interesting: Anthropic has recently emphasized independent testing and shared standards in much the same way, especially around cybersecurity and agents. Two labs that compete head to head — and yet move in the same direction on transparency. That’s a good development. Because in the end everyone benefits if not every lab just shows off its own benchmarks, but third parties get to look with their own yardstick.
Sources: OpenAI: A shared playbook for trustworthy third party evaluations, OpenAI: Frontier Governance Framework