ResearchData

Building Training Data at Scale

The quality of a language model depends enormously on the quality of its training data. For Nico 2.5, we spent more time on data than on any other part of the project. Here's what we learned.

Starting from real software

We began by collecting a large corpus of open-source repositories — only those with permissive licenses, and only those with sufficient activity to indicate real-world use. Our filters weren't just about stars or forks; we cared about commit history, test coverage, and the presence of genuine issue-tracking activity.

The goal was a dataset that looked like a real engineering team's output, not a graveyard of one-day projects and tutorial repos.

Supervised pairs, not raw code

Raw code isn't a training signal — it's a corpus. What we wanted were instruction-response pairs: a description of a task on one side, and the correct code to accomplish it on the other.

We extracted these pairs from commit history. A commit message describes what changed; the diff shows how. An issue title and description defines the problem; the pull request that closes it provides the solution. At scale, this turns millions of development decisions into a structured supervised dataset.

We generated over 12 million such pairs, spanning bug fixes, new feature implementations, refactors, UI components, code reviews, and debugging sessions.

The quality problem

Not all commits are created equal. A commit that says “fix stuff” and shuffles four unrelated files is worthless as a training example. A commit that says “fix off-by-one error in pagination when results are empty” and changes three lines of clearly related code is exactly what we want.

We built a quality scoring system that evaluated pairs across several dimensions: the specificity of the instruction, the coherence of the diff, the presence of associated tests, and signals from issue labels and author history. This let us rank 12 million pairs and select a much smaller, denser set for training.

Category balance matters

One early mistake was allowing the dataset to be dominated by whatever was easiest to extract. Bug fixes from large projects are abundant — which is valuable, but a model trained only on bug fixes will underperform on greenfield code generation and UI work.

We set explicit targets for each category and then upsampled and synthesized data where the natural distribution fell short. Frontend component generation required the most synthetic augmentation — real-world UI code in open-source repos tends to be project-specific and hard to generalize from.

What we'd do differently

The most painful lessons came from underestimating the cost of pipeline bugs at scale. A subtle error in how data was accumulated across batch processing runs led to significant duplication early on — we caught it, but it added weeks to the timeline.

If we were starting over, we'd instrument the pipeline more aggressively from the start: checksums on every intermediate artifact, per-batch deduplication, and more aggressive spot-checking of samples at each stage.