Training as we know it might end

This post was originally supposed to be about synthetic data. It is finally about a vibe shift.

Over the last months, there is an accumulated amount of evidence that models are changing and synthetic data is roughly at the center of it. Too many developments have suddenly challenged firmly held assumptions: the rise of tiny reasoners, the incredible data efficiency of reinforcement learning, the rapid expansion of agentified training run into entire simulated environment. All this prompts me to radicalize Ilya Sutskever assumption:"Training as we know it will end".

Simply put, I would argue we are increasingly training literal "reasoning model". Not in the sense that they are generating reasoning traces and drafts before an answer. But in the sense they are made for reasoning, designed with tasks, heuristics, sequences of actions in mind. And made possible through the design and maintenance of large scale synthetic playgrounds.

It's a reasoning model? Always had been.

Reasoning models are not a new development. It's a change of focus. If attention is all you need, it's because it is, fundamentally, a soft logical machine, drawing correspondences, relationships, patterns of meaning at every generation step.

Models improve through recurrence and diversity. In contrast with humans (but also some alternative methods, including reinforcement learning), transformers are not data effective. You need a wide range of examples to consolidate enough patterns to build up consistent logical relationships — things as trivial as inferring that a date comes sooner than another date. Through training at scale, reasoning transformers have been the "bitter" realization of the original symbolic logic projects, collecting all kind of inference rules at scale and regenerating them frequently and fittingly enough to be usable.

The expansion of training data created some kind of a mirage: the illusion that models are trained on "everything" and capabilities were "emerging" almost magically through a sheer accumulation of compute and data. Yet, all recent research is pointing in the same direction: you get what you train on. Models trained on large scale math datasets are better at maths. Dataset filtered to perform better at educational MCQ gives you better standard benchmarks.

So again, the important part is what you want. Pretraining dataset gave us a fascinating, endless generator of intertextualities. Unfortunately, you don't really transform the world economy with that. If we look back at what happened over the last five years, we see a gigantic amount of fixes hacks and bolts to retro-actively enforce preferred reasoning patterns. And, more recently, AI labs started to deploy an increasing amount of methods to "denoise" web archives, filter up the "reasoning dense" segments and, increasingly, reformulate, retro-synthetize reasoning, buildin in effect a divergent synthetic duplicate of the web. We don't have yet background information on the 36 trillion token dataset used by AliBaba to train Qwen 3, but given the significant expansion I would expect a large share of this dataset is some form of generated augmentation.

The most precious data is invisible.

Despite all these hacks, we have come to the point were there is an unsolvable hurdle: models are no longer used as language generators. They are increasingly performing a wide range of tasks and actions that bring us to confront with the limitations and bias of the training data. Web archives do not contain everything. It's a caption of a specific range of human activities that disregard, by design, what we are doing "offline". Typically, vision language models currently struggle at reading clocks, gas gauge or even most industrial instruments, simply because nearly all available images are product displays (with all setting off). Models continue to struggle at long context as they are mostly trained on web snippets shorter than 300 tokens. Now if we think about all that actual agents could do — browse the web, manage corporate search infrastructure, process distress signals on railroads, help repair — we are confronted with the reality that most of these tasks have never been scripted online.

Either the data exist but only in siloed environment, or the data existed but as transient processes not meant to be archived or it never existed in the first place. After all, if you ever wanted to automated your desk job, you'll have to keep some kind of logs of all the mundane, infra-ordinary task you perform day to day. Some might seem ridiculously trivial, not worth to be recorded but actual social or technical system live on that.

While there is still little documentation in the open, it seems that large labs have allocated more and more resources to build "simulated" datasets, modeling specific series of unscripted action if not an entire controlled environment. OpenAI DeepResearch was "trained on new browsing datasets" and learned "how to reason through and synthesize a large number of websites to find specific pieces of information". The current version of o3 is likely based on expanded versions of this simulation, moving beyod search to include other tool uses (like processing images). These simulated datasets are not completely artificial and will obviously include a range of "real" resources, either as processed examples of sources of inspiration (what we call "seeding" in synthetic jargon). For the latest Pleias release we strived to provide on the first reproduction in the open of a standard mid-training pipeline, especially stressing some unforeseen consequences such as the need for large and diverse dataset bring enough diversity and resiliency into synthetic data generation and and the reliance on intensive CPU pre-processing to model search tasks.

Synthetic pipeline of Pleias

Now let's add up. We have pretraining datasets uneasily morphing into "reasoning datasets" by becoming more and more artificial. We have a new range of emulated datasets, designed from the ground up to make training feasible on invisibile data and striving to become more and more organic. Finally we have an accumulated collection of hacks, twists and tips as, due to the focus on scale and undocumented datasets. Is it not time for a radical simplification?

From datasets to playgrounds.

Tackling with the difficult issue of assessing model architectures, The Physics of Language Models took on a radical decision: the authors created "ideal" environments ("synthetic playground") allowing to test the assimilations of skills and knowledge with different architectures from transformers to mamba. The tasks retained here are decorrelated from natural language and rely on a formal notation. They still have a wide scope and include fundamental and desirable feature of model in production such as memorization, multi-hop retrieval or hierarchical rules.

Physics of LLM

Synthetic playground are presented as a scientific and experimental design: "We draw inspiration from the physical sciences, where idealized settings—such as friction-less planes or vacuum chambers—reveal first principles by stripping away confounding factors." I now believe this is the future paradigm of training. There is a natural correspondence between playground to assess model capacities, emulators used by the emerging agent systems and, most importantly, the kind of complex pipelines recently used by DeepSeek to push for the frontier of math solving models.

Synthetic playground should not be mistaken with model distillation. A fundamental feature is I that pipelines and processes matter more than models and, counter-intuitively, stronger models are not necessarily stronger teachers. Typically the actual synthetic data used by DeepSeek-Prover-2, was generated by a generalist 7B model ("we use a smaller 7B model to handle the proof search for each subgoal, thereby reducing the associated computational burden"). The critical component of the pipeline lays in the decomposition of existing problems and solutions into a range of sub-goals, their transcription into lean 4 formal proofs, and a recursive series of iteration and evaluation. Similarly, you can overcome the lack of long context data by designing RL exercises leveraging plot following.

Reasoning models to come.

One of the most surprising results from Deep-Seek Prover2 has been the solid performance of the smallest model, a 7 billion parameters which even found a range of solutions missed by the 671 billion parameters models:

The 7B model frequently employs Cardinal.toNat and Cardinal.natCast_inj to handle problems involving finite cardinalities (see examples in Appendix B), which are noticeably absent in the outputs generated by the 671B version. This technique appears to enable the model to effectively solve a subset of problems that require nuanced manipulation of cardinal values.

All this brings me to the main thought experiment: provided we could generate an undetermined amount of generated problems and solution, could we cut off entirely the pretraining/post-training part and train directly a complex prover model on a simulated dataset? That is not just checking abstract capabilities but using experimental playground to build up a new range of frontier models, likely with considerably less compute, data and overall expenses?

If this ever holds on, we can draw a crude picture of reasoning models: Synthetic playground

With the following implications:

  • There is only training. Distinctions between pre-training, mid-training post-training no longer hold as we go straight to the actual point: designing models to perform better at common heuristics and reasoning tasks. Even reinforcement learning is just an expansion of the playground, except it's not only focused on the "frozen" pre-generated data but on "hot" future variations generated by the model being trained (and as shown by Absolute Zero, this could even extend to problem generation).
  • Undefinite data yields undefinite improvement. This comes with the significant caveat to have good synthetic data in the first place and by this I mean qualitative, diverse and complex. Sitll, a very attractive implication is that there is no data shortage (we are only bounded by the available resources) and capable model could be considerably smaller than they are now. There is probably a threshold of model size but it is currently undefinite and likely task-dependent.
  • Models should diversify. There is simply demands for specialization: search agents do not require the same capacities than infrastructure orchestrators. This can and should be assessed through repeated experimentations — and so, yes, even private "AI labs" should come back to being actual labs, focused on rigorous and transparent. We could even see a range of competitions inspired by nanogpt speerun based on pre-determined synthetic datasets.
  • Small models are no longer just toy models. Beyond the capabilities gain in the small range, with agentification, there is a new demand for fast inference in constrained environment. This factor contribute to put model interpretability at the forefront of future models developments. We are not just uneasily transferring some observations to much larger models but simply working directly with the models put in production. Any improvement over, typically, better management of token flows or improved metrics of uncertainty have direct commercial consequences.

This process is likely to be recursive and its not surprising we are only considering it in 2025. To properly exist, synthetic playground almost require the emergence of small specialized agents, able to provide reasoned feeback at scale under the constraints of "bounded rationality". After all, what claims would have AI to automate many human activities if it were not first automating its own training process.