Synthetic pretraining

In 2026, it might be finally time to talk about the data: LLM training is switching to synthetic.

Generated data is not a new training source and was already commonplace in post-training to make up for the lack of instruction and conversation samples. Yet, the fast scaling of synthetic sources toward trillion tokens alter many fundamental concepts, practices and norms of model training. Suddenly, data is compute (and a significant part of compute budget), data is plastic, data is the model to come — and models make data. Suddenly, there is no longer a clear-cut distinction between pre-/mid-/post-training as a new emerging category of synthetic pretraining seems poised to absorb everything.

This is a complex process: synthetic encompasses now a wide range of methods that are more or less remote from organic training data. Synthetic adoption also varies in scope, from very targeted inclusion of reasoning-rich data to elicit specific capabilities to full-synthetic training. Major datasets are now readily available including Nemotron-CC and our own SYNTH and is inspiring a wided range of research in the open. And yet academic discourse remains constantly lagging: model collapse is partly getting reassessed, on the basis of dated and rudimentary data pipelines), largely disconnected from current practices.

At this point, it becomes necessary to have a better vision of the synthetic turn, why it happened, where it is leading us and to what extent it brings on more fundamental questions that should have been reaised sooner: what is LLM data? and what should it be?

What is synthetic pretraining?

Synthetic pre-training stems from a straightforward question: should synthetic data remain a late adaptation, or be reinjected early on in training to continuously elicit fundamental capacities? This is a simple change (after all we're just moving up data in training curriculum), but a very consequential one: mid-training and post-training were always cut-off from decisions over architecture design. The base model was already trained and most of the compute have been spent already. At best, weights could be adjusted with a few merges or tokenizer extensions, but nothing dramatic.

In contrast synthetic pre-training is a simultaneous space of data and model innovation. With appropriate data, many advanced reasoning capacities can be elicited and assessed very early in training, thus lowering ablations cost,: "synthetic tasks eliminate the noise, randomness, and data contamination of real-world datasets, enabling clean, controlled, apples-to-apples architectural comparisons". Even more crucially, data can be iteratively reshaped to better fit a given architecture. It can conversely advocate for changes of model internals. Synthetic pretraining is an invitation to start thinking in an integrated model/data space, as some of the major challenges of LLM research are clearly requiring integrative solutions: enhancing layer connections and compositionability, extending the exploitable context length, strengthening attention paths…

Synthetic pre-training assumes that human data is imperfect and not a end goal in itself, but something that can be deconstructed and reconstructed. It is inherently associated to a structuralist hypothesis

Deduplication, but we need many more axis.

Where does it come from?

Full synthetic training is not a new idea: it been already briefly attempted in 2023-2024 and it's now instructive to revisit what worked and failed.

Phi 1.5 from Microsoft was the first model of the series to be exclusively trained on synthetic data. A few months later, Cosmo-1B provided an initial reproduction in the open. A major motivation was increasing data and parameter efficiency: "In a nutshell we build phi-1.5, a 1.3 billion parameter model trained on a dataset of 30 billion tokens, which achieves common sense reasoning benchmark results comparable to models ten times its size that were trained on datasets more than ten times larger." And both research teams quickly backtracked: the next version of Phi was built on the now familiar mix of quality pretraining data and synthetic augmentation.

Unusually, Phi was started as an ambitious program of data research, stemming from TinyStories: "we explore the improvement that can be obtained along a different axis: the quality of the data" It was an agglutinative process: rather than removing less performant data from webcrawl, the generative pipelines expanded in scope. Now I feel they underestimated the full extent of this undertaking: synthetic pretraining requires reconstructing an entire training environment from scratch. It's tempting at first to focus on standardized benchmarks and use it as a continuous objective target but then you miss on mutliple capabitilities that are not measured and tacitly expected by future model users. This is the core reason why unstructured web pretraining remained so prevalent until now: it's an effective way to offload data research, ensuring that models will at least cover, more or less well, a range of generative behaviors.

Furthermore, to make it happen, you have allocate significant time and effort on a research axis historically neglected in LLM model develpment. Even today, large ML conference remains outright hostile to data submission (and consequently, these are astonishingly rare). Also complexity of data work almost naturally require data sharing, which until recently was out of reach: capable models (including the OpenAI ones used by Microsoft) prevented data reuse; most data seeds where of unclear provenance, usually large webcrawl and while fair use might cover training, it does not extend to reusability. Consequently, synthetic pipelines remain very crude, limited to straight generation with a handful of hardcoded prompts.

And yet, this changed over the past year. As I anticipated, a driven factor has been the development of mid-training, which has currently grown to be the one space of data experimentation in LLM research. In contrast with the initial experiments over full synthetic training, this is a relatively smooth process where labs suddenly find themselves tricked into caring about the data by simply unrolling more and more data issues. With mid-training growing to encompass an ever more sizable part of compute budget, it's becoming now an open question if training should be now exclusively focused on synthetic training data.

In the later stage Physics of Language Model series,, Zeyuan Allen-Zhu convincingly demonstrates that: "all model architecture fails at simplest 2-hop reasoning" even at 8B scale over 1 trillion tokens, simply because the basic syllogic relationship (X is born in the same year as Y) is never properly learned in unstructured web-based data. Scaling might be a very costly around this fundamental data limitation, as extremely sparse weights seem naturally prone to capture very sparse patterns and rules. But this is still ineffective and, ultimately risky — maybe many of the rumored "pretraining accidents" just stem from it, data and learning design badly adjusted to the capacities desired in the first place. Instead, hop-reasoning and other logical constructs should be simply learned throughout training.

It might appear that some tasks are easily assimilable and some aren't (typically anything involving spatial patterns will be challenging for auto-regressive models). This might either be fixed by more intensive data pipelines or prompt radical changes of architecture. Data generation becomes a neutralized scientific basis to build up ideal playgrounds, oriented toward the acquisition and measurement of specific skills. "synthetic tasks eliminate the noise, randomness, and data contamination of real-world datasets, enabling clean, controlled, apples-to-apples architectural comparisons".

In short, to really move on in the direction of synthetic pre-training, you have to start thinking in an integrated model/data space.

"A key advancement in the pre-training data of Kimi K2 over Kimi K1.5 is the introduction of a synthetic data generation strategy to increase token utility"

If language models are built on top of reasoning primitives, it follow that 1° Reasoning always has to be trained. 2° Existing datasets are sub-optimal, either lossy or outright defective. 3° Any pre-training not involving a significant amount of data design and curation is just offloading reasoning discovery to something else.

In retrospect, it seems to me, the synthetic turn could only happen once data is somehow acknowledged as a key contributor to model learning capabilities. We took it very early at Pleias, almost as an accidental consequence of re-building a pre-training environment in the open and inherently focusing on data as a research axis. We might infer frontier labs have been initially drawn by the growing disconnect between web-based (or book-based) training data and the many more specialized settings that could not be properly anticipated. After all the mid-training division of OpenAI was originally tasked with model adaptations in specific domains like law.

What follows after this unlock, is slow build-up as multiple key features of pre-training data environment are re-engineered and, sometimes, dramatically improved in the process. We'll revisit in this process in three interconnected parts: memorization, logical compilations and simulations.

Synthetic compilation 1: memory

In its original formulation, mid-training was about data quality filtering and, soon enough formatting. And from then on, it was relatively easy to take the jump to data rewrite or rephrasing. The original motivation might have been about data legibility: after all many pretraining samples are deteriorated, especially due to digitization or web scraping artifacts. This was the starting point for REWIRE (from Meta): "The central hypothesis of our REWIRE framework is that web documents contain diverse content and knowledge, but the writing structure can make them not coherent or elaborate enough to serve as informative pre-training examples" Yet further research showed the benefit extended beyond cleaner data: model memorizes through varied repetition.

The archetypal experiment here is Learning Facts at Scale with Active Reading (also Meta) showing saturation of the hardest memorization benchmarks (simpleQA) by 8b model trained on "diverse set of learning strategies, given a document we want to study" brought by the synthetic generation pipelines. The active reading dataset is not really a workable training environment, so further confirmation of viability and scalability really came from BeyondWeb and our own SYNTH where we managed to build a self-sufficent pretraining environment out of a small yet highly qualitative diverse sample of 56,000 Wikipedia articles.

The really fundamental take at this point is that synthetic pipelines solve LLM memorization, as models can now selectively retain knowledge and information that matters for deployment. Realistically, this is mostly a rediscovery of internal research from frontier labs, that all have significantly step up their data pipelines. It's either confirmed, tacitly suggested or rumored that the current generation of models has been trained on a significant amount of synthetic data.

At this point, I would expect to see an increasing abstraction of knowledge and memory building. Until now all the reference pipelines rely on texts commonly available online. Yet, to make a sizable impact on GDP, models will also need to perform on a wide numbers of unavailable inputs, for a variety of reasons: unaccessible texts due to privacy or security reasons, knowledge exclusively avaialble in structured data or, in my casis, never verbalized and parts of the internal culture of an organization. In many of theses cases, we do have the recipes to produce data, in the forms of guidelines, policies, or advice from people on the field. But to actually create it require much more abstract pipelines — you have to imagine a SYNTH-like dataset no longer derived from Wikipedia but from Wikidata, structured input from interconnected knowledge graphs. Closest experiments in this direction come from Nemotron-CC where document classification served to reinforce "correlations between advanced topics that are otherwise rarely observed in web-scale data".

Synthetic stage 2: logic and pipelines

Now let's roll back and openly wonder if the current state of model training does not require a more significant reconsideration.

It is surprisingly overlooked, but transformers are efficient compilers: with the right training design, many formal rule systems can be integrated in the weights themselves. This processes does not even require verbalization. In the circuit transformers experiments from Anthropic, basic math exercises are resolved even before tokens are generated, as standard operations are readily pre-parsed by attentions graphs and pre-computed by model internal flows.

Math example in circuit transformers
So-called non-reasoning model doing non-reasoning things

I would argue that any transformer model is in effect a "reasoning" model, building up on top of multiple reasoning primitives, that can be captured more or less efficiently, from more or less noisy data. Eric Michaud recently suggested that all scaling follows a quanta hypothesis: when you zoom in, model do not acquire capacities continuously but through a sudden unlock, as if models internals were suddenly correctly wired after a long search sequence. All this process is obviously smoothed in the training loss, but also in standard evaluation sets, as they are already too noisy and will look for complex composite of tasks.

Quantum hypothesis
Illustration of the quanta hypothesis: capacities are not learned linearly but through sudden sigmoids

Michaud does not as far suggesting that these quanta are explicitly connected to operationalized reasoning abilities: "we don't have a formal definition of what 'discreteness' means in the learning process". I'd still be tempted to make this jump. We now have multiple projects of logical hardwiring at Pleias in different domains that all require the integration of specific reasoning paths (and no, current frontier models don't manage well). It becomes clear, than similarly to memorization, any formal rule based system can be learned with a high accuracy, through systematic exposure to modelled exercises. And also that this process scales. We have multiple examples of state of the art transformers model used to sort out complex inputs in biology, astronomy or physics without any verbalization. Training here is not simply done on raw data: it relies on clever arrangment with selective destruction or masking of original sources, in effect integrating learning strategies into the data itself.

Adversarial learning strategy for a AION-1, a foundational model in Astronomy
Adversarial learning strategy for [AION-1](https://arxiv.org/pdf/2510.17960), a foundational model in Astronomy

Now logical hardwiring does not happen that differently in language models: you need to endlessly simulate, transform, reshape, hide, complexify data. As soon as 2020, it was immediately obvious the only way to achieve this result was through synthetic pipelines. The earliest published LLM math prover GPT-F relied on "synthetic datasets allowing us to generate proofs for each of these domains at will while controlling precisely by how many proofs we augment our training sequence" as existing data is already to scarce and won't allow to perform selective engineering of problems. The latest generations of prover did not proceed differently. One of the current state-of-the-art series of model, Seed-Prover successfully tackles hard geometrical problems through indefinite synthetic replaying. They run a hard search on trees deduced from "geometry problems over more than the past 20 years": "in more than 7 days, the problem generation program found over 230 million unique problems". By comparison the largest collection of all math problems (not just geometry) Numimath is little short of one million problems, and that includes many instance of poor of repetitive samples (for lean auto-formalization, they only kept less than 10% of the oriignal set).

Of course, you may wonder why logical hardwiring is needed at all. Spatial recognition problems are almost adversarial by nature for auto-regressive model architecture: could they not be simply delegated to external tools? Yes to some extent, but this ignore the benefit of universal transformer integration. Once all subtasks, quanta, concepts inhabit the same latent space, you have a pool of indefinite creation-solving ability. Many elegant mathetimatical solutions involved looking at the problem from a completely different perspective, even radically reshaping the initial terms of the problem. If anything, current LLM provers may have a too restrictive search space: the entire ecosystem of benchmarks, synthetic problem generation and RL environments is structured in readily available high-school math problems.

This issue obviously compounds for industrial use cases that are still largely to happens. Current situation is very analogous to what we see currently in LLMs math. Symbolic routines are widely applied everywhere (from manufacturing to banking). Yet, orchestration is largely hardcoded, time-consuming, prone to endless bitter lessons. And here I feel, we have the recipe, but not yet the ingredients: no evaluations, no initial seed set for synthetic generation and problem search.

Synthetic compilation 3: simulations

At this point we come to the common definition of "reasoning" or draft, whcih could be more properly described as another dimension of reasoning.

This last stage is synthetic data generation is still scarcely documented in the open, but should be tangible to anyone interacting with latest generations of models. There simply isn't any data that matches what Claude Code does and to achieve that you actually need to actively deconstruct data, provide an unlearning process that can recreate complex pieces of code or writing that would never come to be without a wider environment.

A little seen key innovation of Claude 4 was the introduction of "interleaved thinking", that is the process "which enables Claude to think between tool calls and make more sophisticated reasoning after receiving tool results" so that Claude can "reason about the results of a tool call before deciding what to do next" and "chain multiple tool calls with reasoning steps in between". How these traces came to be? Is it a spontaneous creation of RL runs? Here even chinese labs don't say much, as I guess it gets too close to competitive advantage. We know that GLM 4.5 was trained on "large-scale synthetic agent trajectories". Minimax M2.1 was trained on 100,000 programming environments with a constant recursive synthetic feedback: "We are building an automated data flywheel: automatically discovering high-quality Issues and PRs from GitHub; using models to assess task difficulty and perform stratification; automatically augmenting tasks that the current model can easily solve to make them more challenging; and analyzing failure causes for failed cases to generate targeted training data."

Yet we see LLM math research almost organically moving in the same direction, once the core reasoning primitives have been secured and auto-formalization of common high school problems mostly solved. Instantaneous problem resolution is obviously bounded by sequential complexity. To solve any advanced math problem, you have to perform a wide number of intermediary steps, decompose into sub-problems, demonstration, lemmas, come up with a strategy to orchestrate all this. Then logical hardwiring gives you the initial reasoning primitives (or quanta) but won't scale up this far. For this you need something else than language or pattern-matching models: you need actual simulations of systems. Seed-Prover implemented some form of problem-solving orchestration, intermingling intermediary formalization with backtracking checks: "a proposer module accepts an unsolved problem and, optionally, some already proved lemmas as input, and generates 10–50 candidate conjectures about properties of the problem".

A handful of specialized papers highlights what will likely the next stage of synthetic pretraining: highly structured environments. Toucan open sourced a larged dataset of "1.5 million trajectories synthesized from nearly 500 real-world MCP servers" in many domains.

1.5 million trajectories synthesized from nearly 500 real-world Mode

Simulations contribute to blur the lines between models and the data. Interleaved thinking is already by design a form of model self-control but the recursivity can extend much further: models could theoretically control their own prompts, set optimal strategies of context compression or maybe even obtain tool use access to model internals and manage temperature of token generation paths.

It's hardly surprising that actual breakthrough in LLM math came in the same period as Claude Code.

The few available large scale dataset that have proven to elicit comparable capabitlies like Toucan encompass millions of agent trajectories around simulated environments.

Thinking in the data/model design space.

One of the surprising outcomes of SYNTH is that I have been brought to increasingly consider data as model artifacts. Obviously, synthetic training involves LLMs as data producers and, similarly to what we see on the deployment side, I expect synthetic pipelines to be eaten by the model layers. After all what we build is mostly harness and control flows that implements formal verification or enforces data diversification, but all theses processes could be pre-compiled in the weights themselves. But on the reverse side, data is shaping model design. It becomes now clearly visible that some tasks are easily assimilable and some aren't and require considerably more strain on the data side — again anything involving spatial patterns is at loss with auto-regressive generation. The consequences is that frontier data and frontier model research are now converging on an increasingly shared understanding of the challenges that remain. To quote a few examples:

  • Knowledge connections. Learning compartimentalization seems like a key feature of generability: this is something we saw while training Baguettotron, as deep layer architecture provided better learning inertia and premature convergence between the different synthetic pipelines. For Nemotron

And yet beyond leading labs, data is one of the few axis of improvement that is more commonly available. BeyondWeb showed "diminishing returns when increasing rephraser size beyond 3B parameters, with 8B LLMs achieving only marginal gains over 3B (…) given the right recipe) effective synthetic data generation doesn’t necessarily require massive computational resources". We independently converged on the same constation, as SYNTH was built from the ground up exclusively on finetuned 8B and 12B models, with anticipated structured input and outputs. This is a downstream benefit from considering synthetic data not as the straight production of a model (obviously bounded by its internal capacity) but as a complex system involving additional seeding and formal verification (but more on this later on). It becomes likely now that synthetic data can be used to train upward with small integrated specialized model producing training inputs for much larger models. DeepSeek-Prover-V2 combines a "DeepSeek-V3 for lemma decomposition and a 7B prover model to complete the corresponding formal proof details", so that in effect most of the training data was written by the small model.

Physics of the Language Model is unsurprisingly