What's the deal with mid-training?

Mid-training is poised to become an AI buzzword in 2025. OpenAI has had a "mid-training" division since July, whose major contributions "include GPT4-Turbo and GPT-4o".XAI is setting up one. Phi 3.5, Yi and, most recently Olmo, devoted a lot of time, resource and efforts to mid-train their latest model.

What is mid-training exactly? It's not pre-training, it's not post-training, it's vaguely in-between. According to OpenAI, "the mid-training team does cross-cutting research, engineering, and execution, including activities classically associated with both pre-training and post-training."

You're confused? It's normal. A cursory look on Google Scholar and arXiv yields not only very few hits (less than 50), most of them are about a different kind of "mid-training" than what OpenAI has in mind. We find allusions about mid-training checkpoints, loss, adjustments, all things that happens during pre-training, definitely at a moment when the base models does not exist yet. It's easy to dismiss mid-training as a bizarre avoidance of actual training — a bit like "interquel" in Hollywood narratology. Neither a prequel, nor a sequel, then what is it if not the thing itself?

In short, I went through all the academic publications, intermediary reports and confusing corporate announcements of mid-training, so that you don't have to.

1. Where does it come from?

Seemingly, out of nowhere. In July, OpenAI quietly announced they had a "mid-training" division in two research job ads (now deleted, 1 and 2). Not a small thing either: the team improves "OpenAI's flagship models, adding new capabilities and making them more efficient. Recent examples of artifacts with major contributions from our team include GPT4-Turbo and GPT-4o". Ever since, it has been rumored the mid-training team has also been instrumental in developing the o-series of models.

A closer look reveal that mid-Training was first introduced not in 2024 but in 2020 for Bleurt, an encoder model assessing the quality of text generation. It could be anachronistically described as a "judge" model, that "can model human judgment with a few thousand possibly biased training examples". Bleurt is initially based on Bert, retrained on million of synthetically generated examples and finally available for fine-tuning on more specific tasks. To describe the creation of Bleurt, authors coin the concept of "mid-training": "The model is then “warmed up” by exposing it to millions of sentence pairs (x, ˜x), obtained by randomly perturbing sentences from Wikipedia (…) We denote this stage as mid-training."

I believe there is some form of institutional heritage at play here. The OpenAI mid-training team is in London and run by ex-Deepminds, including Jacob Menick that originally posted the job ads on Twitter/X [profile now deleted].

If we persist in doing some OpenAI kremlinology, the mid-training team could be an offshot of the internalized continuous pretrain program. In April 2024 OpenAI presents its new custom training service for vertical AI companies like Harvey: "fully custom-trained models imbue new knowledge from a specific domain by modifying key steps of the model training process using novel mid-training and post-training techniques." Citing the use case of the legal tech Harvey, OpenAI states that models have been expanded with "10 billion tokens worth of data", which is the expected size to incorporate a whole new body of knowledge or a whole language.

The concept was formally reintroduced in academic literature by Phi 3.5. We learn very briefly that "phi-3.5-mini and phi-3.5-MoE, which incorporate more multilingual and long-text data during mid-training." (2404.14219). There isn't any additional precision but from what we can gather, (nearly?) all models have been "mid-trained" and the MoE and mini variants are just incorporating a different recipe with more multilingual and long text data.

The Yi report (2412.01253) includes the first standard definition to date, mixing multiple things we are going to delineate in the next section (data filtering, long context, multilingual support…): "In the mid-training stage, we focus on enhancing model capabilities and extending context length through gradual data distribution shifts"

Since 2025, the standard definition is set by Allen AI as a part of a lengthy section of their Olmo 2 report. It's also an opinionated definition that focuses mid-training on late stage curriculum learning and annealing: "In previous OLMo iterations, we also found that both learning rate schedule (OLMo 1; Groeneveld et al. 2024) and data mixture (OLMo-0424; Ai2 2024) play an important role. We refer to interventions at this stage of model development as mid-training." (2501.00656).

Now, let's zoom out. Beyond casual issues of terminology, the rise of mid-training convey two parallel vibe shifts:

  • Base and instruct training are blurred. Scheduling and annealing are now part of the standard approach of training. Introducing instruct-like data and/or filtered data close to expected use has repeatedly yielded performance gains. After spending a lot of time in specialized pretraining, I suspect that Chinchilla scaling laws still largely apply here. You can push a model performance a lot on a given tasks, simply by submitting many more example of this tasks and/or better reasoned examples. Or on the reinforcement learning side, designing a setting where the model can "play" the tasks unendlessly until we start to hit some saturation point.
  • Post-training has scaled. Compute plans, datasets, organization are rebalanced in many organizations. Some of the new reasoning models like O3 may even have been exclusively "post-trained" - fast release iterations do suggest it. In short post-training is the new pre-training or, maybe even, pre-training as we know it will end. While there have been multiple rumors of data walls and failed titanic runs in big labs (especially at Anthropic and XAI), performance gains seemn to have been mostly achieved by base model training, through inference scaling, synthetic data, reinforcement learning, internal model manipulation (SAE), logits optimization.

2. What is mid-training?

It's hard to say: definitions are currently in flux, as some post-training practices continue to scale and to be reintegrated in standard training.

There is at least one constant attribute of mid-training: it is done on a mid-range of dataset sizes. Following on the Chinchilla-optimized approach initiated by Meta, even tiny models are now routinely trained on trillions of tokens. Meanwhile, even the largest open instruct datasets available today hardly number beyond a few hundred million tokens. As we can see with the the Olmo, Phi and Yi reports, mid-training, regardless of its meaning, usally happens in-between, with datasets ranging under with a common sweet spot between 10-300 billion tokens.

Data size does determine data work. With pretraining, it's all heavy pipelines, jobs commissionned through the night on slurm, little maneuvers costing us 51 hours. With post-training it's nearly instantenous, you can just hit the colab notebook (after all the place where gpt got invented). Mid-training is lightly formalized: you may need to launch jobs but they could generally even be done on a personal computer or under a reasonable time (a few hours at most).

Even with this modicum level of agreement, there is still a large amount of conceptual variations. Let's see.

2.1 Domain and language extension

It may not be the most discussed form of mid-training, but it seems to have been the original impetus: at OpenAI the mid-training team was likely set up to deal with custom training at a scale going beyond standard post-training.

Multilingual extension is a radical form of retraining, as it frequently involves changing models internals, especially the tokenizer. Phi 3.5 underwent a mid-training stage for better language support among other things. Yet as on many other aspects, the model report is short on details: "the exact composition of languages in the training corpus is not disclosed – in fact, no languages are explicitly listed, resulting in a lack of transparency regarding the data sources and training procedure" (2412.15450) Yi report also single out enhancing "multilingual capabilities for low-resource language", although more as part of a general data filtering/upsampling strategy. (2412.01253).

These practices are commonly associated with continuous pre-training. The shift we are seeing here is mostly organizational: enhancements are now bundled by the original model trainer. Since in some case the original base model checkpoint is not even released, the semantics of continuation loses value.

2.2 Long context extension.

Models are now routinely released with different context lengths than the one used for training. The dataset used is generally unstructured or can combine a "base" like training with an instruct training. Until Olmo 2, all the early mention of mid-training in model reports are centered on long context, with some variations. Phi 3.5 report is relatively evasive but seems to mention "an initial long-context mid-training, followed by mixed long-short post-training with supervised fine-tuning (SFT) and direct preference optimization DPO)" (2411.05232). Yi report is more straightforward and clearly delineate a mid-training step coming just after the initial pre-training for "enhancing model capabilities and extending context length through gradual data distribution shift" (2412.01253v). Long context extension furthermore integrate some filtering recipes, given it's usually done on a quality sample. Phi 3.5 yielded disappointing results on the long context benchmarks which was attributed to "lack of high-quality long-context data in mid-training" (2404.14219).

While long context extension seems to fit well within the most common definition of mid-training (mid-range size of dataset, generally unstructured) it also blurs any distinction between pre/mid/post training. Fundamental design decisions (like Rope Theta values) are now done in anticipation of context length extension, in order to optimize positional embeddings. Context length extension can even happen early in training (and then nmuch later in the end. Llama 3 was initialized with a 4k context length, almost immediately scaled up to 8k. Consequently we end up with weird schedules where long-context mid-training happens after post-training: "Stages 1 through 3, the process starts with alignment, pre-training, and supervised fine-tuning. In Stage 4, the model undergoes mid-training context extension" (2408.10188).

2.3 Quality training

In that sense, Mid-Training is nearly synonymous with the preparation of a specific dataset for late-stage self-supervised training. It's not clear if Phi really uses mid-training in that sense, even though their model series popularized the similar concept of annealing. The Yi report may be the first academic publication to associate mid-training to curriculum training, as they "implement an incremental upsampling strategy for high-quality data, emphasizing complex reasoning and multilingual capabilities for low-resource language" (2412.01253).

Olmo 2 differentiate "base model training" (or pre-training) from a "second stage, which we refer to as mid-training (5–10% of training FLOPs)". The model report also describe the first systematic mid-training dataset to date, Dolmino Mix 1124, as well as a series of new experiments and ablations. The overall goal is to "identify a mix of high-quality sources to improve performance across the entire development benchmark suite".

Mid-training quality datasets include:

  • Filtered selection from pre-training, using a quality classifier. This is an approach pioneered in open llm research by FineWeb Edu, a subset of refineweb (2406.17557). The classifiers are usually fairly small to be runnable at scale either a bert or a fasttext-based model with the overall idea of leveraging vocabulary proxies for educative/scientific content. An emerging practice is to use larger decoder language model to further filter specific subsets requiring some more reasoning-based. For instace we created a quality code dataset from Common Corpus relying on a fine-tuned "judge" model, ArmoRM.
  • Curated datasets, generally not coming from large web crawl but specific large scale collection (in Olmo 2 case, from math, encyclopedias and scientific publications). Curation aims to improve specialized capacities and knowledge. It also acts as a quality proxy, as not only actual content but even the format, structure, overall reasoning quality of the text can be significantly better than even the filtered crawl. Our pretraining program at Pleias relies entirely on some form of curating, since we only use either uncopyrighted data or under open license.
  • Instructions datasets. This is the part that "blurs" the most the distinction between pre- and post-training as theses collections largely anticipate fine-tuning.

In the end, agregated datasets come under the mid-range dataset size, between 50B and 300B tokens.

2.4 Scaling synthetic data

"Synthetic" data is now ubiquituous in model training and yet not always properly defined. It uncovers a wide range of data augmentation and transformation that is rarely "pure" generation — to the point, I now favor "LLM data processing" in many instances.

In the Olmo 2 report, large scale synthetic data is an integral part of mid-training. Recipes are focused on math, but could be transferable elsewhere (actually roughly what we're doing at Pleias right now):

  • Straight generation from a more powerful model, acting here as a "teacher". This was probably the most widespread initial approach of synthetic data as introduced by Phi. Yet, there are downsides to it, especially in regards to overfitting (a bit excessively dramatized as "model collapse").
  • Question generation from answers, also frequently termed "backtranslation". The core idea is to leverage the fact that quality dataset already contains answers that could be "instructionized". Generating questions is not a hard task. Generating sufficiently diverse and realistic questions in multiple languages (typically the kind users would actually write), here is the actual challenge. There are multiple tricks, some coming from Information Retrieval: artificial variation/steering during generation.
  • Answer generation from formal verification. For math and code the approach is relatively straightforward: run a solution and check whether it is internally consistent. But even "soft" sciences can be occasionally formalized. For RAG, one of the pipeline I use the most is reprinting detection of citation with algorithms initially conceived for genomics.
  • Answer generation with LLM verification — a "judge" model. This brings us back to the original mid-training model, Bleurt that also acted as an evaluation model. Pipelines here can become relatively complex, crossing different types of models, some form of formal assessment. In my experience it is generally best to cut down an evaluation design into granular well defined tasks: the orchestration of smaller specialized models, exclusively focused on their own set of problems to assess is not only more performant but more informative than large teacher feedback.

Once more, key thing is the size. You have to generate at scale, evaluate at scale, correct at scale. John David Pressman very aptly rephrased synthetic data as "distant writing", the mirror image of distant reading in cultural analytics. After trying to read content we can't "read" in the common sense, we are writing texts we can neither realistically read nor write.

2.5 Reinforcement Learning?

I have yet to seen any research paper including RL in mid-train. The few glimpses of information from internal openai management are suggestive at best. Yet, as anything now, RL is scaling and is moving up (or earlier) in the overall training schedule.

Methods are getting more compute heavy and data intensive, especially following on the Chain-of-Thoughts turn: RL is not necessarily just about assessing accurate answers, but accurate solutions. Reasoning traces are currently relatively brittle. As was singled out in the DeepSeek v3 report: "while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length." Other open research reproduction of reasoning models highlighted the same limitations. QwQ from Qwen does weird ramblings and expression switching and has only been released as a preview for now.

There is a range of solutions but which all involve significantly more allocation of computeand data. Process Reward Models aim to assess chain-of-thoughts internally. More broadly, RL rather than inference seems to have become the space where search methods are actually usable: generations with fixed temperature and hyperparameters can be replaced with wider explorations of possible token trees.

Finally, the most striking for me: the common strategies involved in RL are eerily close to synthetic data pipelines. To solve its scaling issue, RL is moving on from RLHF. "Judge" models are ubiquituous: equally used for synthetic answer filtering/check and for designing the reward policy with many opportunities for transfer and cross-pollinization.

3. Wait, it's all mid-training?

What is the future of mid-training? Likely plain training.

Let's recap for a minute. What we have is:

  • Mid-training datasets of superior quality, ever growing in size. Could a model be only "mid-trained" at some point? And drop the previously defined "pre-training" stage that is slowly becoming the cheap low quality warm-up. We've made indirectly this type of experiments at Pleias: the tiny 350 million parameters model we released as part of our first series of SLMs was only trained on the high quality filtered we used for last training cycles of the two other models.
  • Models more and more specialized. Big labs are driving the momentum here: reasoning models seem like a step back from artificial general intelligence. O1 has set a new SOTA for math but at the expenses of many things (especially translation, text writing, etc.) We see the same trend in diffusion models, much more limited than their earlier counterpart for unusual artistic prompts. As a further blurring between pre/mid/post, we have been pioneering specialized pretrained for dedicated tasks, including OCR correction, RAG or creative writing. In all theses cases, the models are technically trained on large instructions sets, yet with an overall design for the tokenizer, the model parameters, the layer structure model with theses tasks in mind.
  • Ripples in training space-time. In theory, one of the primary characteristic of mid-training is that it comes in between a phase called pre-training and a phase called instruction-tuning. Or maybe not. As we have seen, context length extension is ubiquituous now. All pretraining data is bound to be synthetically processed. Even RL could be reinjected further down in time. It's part of a growing frustration in some open LLM research circles (especially, Entropix): post-training and, now mid-training, are reinventing things that were already part of the models internals and could be better supported there with a few architecture twists. As I'm more focused on the data side, I'm always struck by the wide number of latent instructions and chain-of-thoughts already available in the wild, among the unstructured set used for base training.

Until it eats everything, mid-training will trend. You still have some time to rebrand as a mid-training researcher/lab.