Ornith: Open-Source Models That Build Their Own Scaffolding

Most coding models get an instruction and dive in. Ornith does it differently: it builds a scaffold first. The new open-source family from the DeepReinforce research collective takes your task, turns it into what it calls a scaffold — a learnable object — and then lets your harness build an agent from it to do the actual work.

What the scaffold does

Instead of executing the instruction directly, Ornith uses the scaffold to design the architecture for the job: reasoning sequences, memory organization, debugging strategy, tool invocation order, execution planning. The harness interprets that scaffold and generates the agent. When the job is done, the scaffold is deleted. When a new task comes up, Ornith builds a fresh one.

The clever bit: model and scaffold are optimized jointly. That’s meant to stop the model from losing the plot on long, complex jobs — the classic failure mode once a task gets too big. The scaffold is built from rules the model learned during training through deep reinforcement learning.

Four variants for four situations

Ornith comes in four sizes, all built on the open-source Gemma 4 and Qwen 3.5 models:

9B Dense — runs on a laptop, good for small scripts and single-file cleanups.
31B Dense — needs a workstation with up to 48GB of VRAM, but takes in whole multi-file repositories.
35B MoE — ideal for continuous integration patching and code review in the cloud.
397B MoE — the flagship, which the team pitches as an Opus 4.7 competitor. Needs a GPU cluster.

In the team’s own tests, Ornith-1.0-397B beat Claude Opus 4.7 on Terminal-Bench 2.1 — a benchmark for LLMs in terminal environments — scoring 77.5 to 70.3.

My take

Two things stand out here. First, the scaffold concept cleanly separates what should be built from the strategy for how to build it — and makes that strategy explicitly optimizable. That’s a different way of thinking than the usual ‘one model that does everything’.

Second, an open model beating a current Anthropic flagship on a benchmark is a signal. Benchmarks aren’t the whole story — a single score tells you little about day-to-day work. But the speed at which open source is catching up remains one of the most interesting developments of the year. All four variants are on Hugging Face if you want to try them.

Sources: DevOps.com: Ornith Models Automate Agentic Coding With Self-Scaffolding · DeepReinforce: Ornith 1.0 · Ornith on Hugging Face