Real-world code is the best possible foundation for training a coding model. It reflects the patterns, idioms, and edge cases that actually appear in production software. But it has systematic gaps — and closing those gaps requires synthesis.
What natural data misses
Open-source repositories contain an enormous amount of code, but the distribution of that code is highly skewed. Popular frameworks appear thousands of times. Obscure but important patterns appear rarely or not at all. Subtle bugs — the kind that are hardest to catch and most valuable to recognize — are by definition rare in code that has been merged and shipped.
More fundamentally, natural data skews toward what works. Production code is mostly correct code. But a model that has only seen correct code will be poorly calibrated on incorrect code — which is exactly what it will encounter when doing debugging or code review.
Systematic bug generation
One of the most valuable synthetic data techniques we use is systematic bug injection: taking known-correct code and programmatically introducing realistic bugs, then training the model to identify and fix them.
The key word is “realistic.” A synthetic bug that looks nothing like anything a real engineer would write is not useful training data — the model learns to recognize the synthetic pattern, not the real one. We draw our mutation patterns from empirical analysis of real bugs in our training corpus, so the synthetic examples reflect the distribution of real errors.
Filling category gaps
Not all coding tasks appear equally in open-source code. Frontend component generation, in particular, is underrepresented in natural data — real-world UI code is often project-specific, tightly coupled to design systems that aren't public, and hard to generalize from.
We generate synthetic frontend examples from first principles: take a component description, generate a clean implementation from scratch, then use that as a training example. This gives us the coverage we need without the noise of project-specific context.
Quality control for synthetic data
The risk with synthetic data is obvious: garbage in, garbage out. Synthetic examples that are inconsistent, low quality, or subtly wrong can degrade model performance in hard-to-diagnose ways.
We apply the same quality scoring to synthetic examples as to natural ones — evaluating the coherence of the instruction, the correctness of the response, and the usefulness of the pairing for training. Examples that don't meet the bar are discarded rather than included with lower weight.
The right balance
Our final training mix combines natural and synthetic data in proportions that reflect both the relative quality of each and the gaps we're trying to close. Natural data dominates for high-frequency tasks where real examples are plentiful and high quality. Synthetic data fills in for underrepresented tasks and for capability types — like debugging and error analysis — where natural data is structurally sparse.
Getting this balance right is more art than science today, and it's an area where we expect significant progress across the field over the next few years.
