Most evaluations of coding models treat code generation as a single-shot problem: give the model a description, get back a function. That's useful, but it's not how real software engineering works. Real engineering is iterative, contextual, and messy.
The gap between benchmarks and practice
Single-function generation benchmarks measure something real, but they systematically undervalue the skills that matter most in practice: understanding a large existing codebase, running tests and interpreting failures, proposing a fix, verifying it, and iterating.
A model that scores well on function-level benchmarks may completely fail when dropped into a 50,000-line repository with a real GitHub issue and asked to fix it. That gap is what we're working to close.
Agentic trajectories as training data
To train Nico 2.5 for agentic scenarios, we needed training data that looked like agentic work: sequences of observations, actions, and outcomes spread across multiple turns, with tool use, file navigation, and test execution woven in.
We generated and collected these trajectories from multiple sources, including existing open datasets of agent-based software engineering sessions and synthetically generated debugging and development trajectories built from our own code corpus. Each trajectory represents the full arc of solving a real or realistic engineering problem.
What the model needs to learn
Agentic capability requires a different kind of generalization than single-shot generation. The model must learn when to search for more context before proposing a change. It must learn to treat a failing test as a signal rather than a problem to route around. It must learn to recognize when its previous action made things worse, not better, and course-correct.
These are metacognitive skills — skills about the process of solving problems, not just the solutions themselves. Training them requires data that captures the process, not just the outcome.
Evaluation on real-world benchmarks
We evaluate agentic performance on SWE-bench, a benchmark constructed from real GitHub issues submitted to major open-source repositories. The model must resolve the issue — not just describe a fix, but produce a patch that passes the associated tests.
SWE-bench is hard. It requires exactly the capabilities we described above: navigating an unfamiliar codebase, understanding the context of a bug, and producing a diff that actually works. We believe it's the most honest publicly available measure of practical coding model capability.
The agentic scaffolding
The model doesn't operate alone in an agentic context — it operates within a scaffold that gives it access to tools: file reading, code search, directory listing, test execution, and patch application. The scaffold also manages the context window carefully, summarizing and pruning when necessary.
We designed the scaffold to be minimal — we didn't want the scaffold to be doing the reasoning, we wanted the model to be doing the reasoning and the scaffold to just provide the interface. That design choice is reflected in our evaluation methodology as well.
