A Realistic AI Timeline

AI timelines can grow old. Reading through the last highly publicized exercise, it's as if we were stuck in the early 2020s: ever larger models, unlocking ever more impressive emerging capacities. GPT-3 steampunk.

We are already in a different cycle: pretraining as we know it is ending. Reasoning, reinforcement learning, mid and post-training are drawing most of the focus. The first productivized agents will nearly certainly be relatively small models.

In short, it's time to reopen the timelines, introduce a bit more entropy in the possible paths to come. We take as a starting point a very different set of premises that is best summarized in one graph:

Roughly: generalist scaling does not work or, at least, not well enough to make meaningful sense for material deployment. Instead, most development, including agentification, happens in the smaller size range with specialized, opinionated training. Any actual "general intelligence" has to take an entirely different direction — one that is almost discouraged by formal evaluation. Simply put: the first AGI will be bad, also amazing in other ways but still bad.

This narrative is not realistic in the sense it will happen. It is realistic insofar it is grounded in the current state of AI research around mid-2025. It is further anchored in my practical experience on the field: I pretrained language models end to end, I know what they can achieve but also where they fall short and what needs solving before we move on to more speculative tech.

Finally, this is a fast timeline. Simply out of formal convenience, as decades of lag do not really make up for good writing. You can easily multiply the years by a factor of two or even ten and still be in the right. Just fasten your seat belt and here we go.

2026: Clearing for take-off

Since it's a short timeline, let's start with a bang: in 2026, generative AI finally happens. The combined revenue of the entire sector that was stuck under 10 billions or so suddenly explode by one or two order of magnitude and finally catch up with old AI.

The main block is lifted: major use cases in services or industries have a low tolerance for errors, in the order of 0-2%. As high as 20% for credit scoring but almost anything from supplied chains check to insurance will be one or two magnitude order smaller. Many existing ISO standards even mandates a zero error rate. Despite their limitations, rule-based approach can get pretty good at this — typically, elevators are fundamentally a rule-based system and one of the safest engine in the world.

While safe and accurate deployment can be achieved with a human in the loop, this process only makes economic sense once an AI system is already resilient enough. You don't want to manually correct 20% of the time. In short, the dynamic that matter it is not a timeline of capacities but a timeline of accuracies. As LDJ showed a few weeks ago this instantly shift estimates by quite a few years.

Accuracy timelines by LDJ

To enforce our short timeline, let's include a few more disruptive factors. In the current state of technological development, the main one could be the fast democratization of reinforcement learning and synthetic reasoning strategies, becoming so widespread to the point of replacing most machine learning routines. We already know by now that the classic inference time O1/R1 approach can be highly effective on a variety of formal tasks. Roughly, instead of predicting the next token we suddenly start rewarding the expected features of a "good answer" and let the model yaps its way around it. In practice, this means not only even small model getting performant in math, but also largely solving internally the lingering issue of structured generation: all specialized LLMs are now API ready, immediately pluggable on existing infrastructures.

We might already have all the fundamental bricks to get there what is missing is a large scale diversification. The unfortunate reality is that, for now, RL is not transferring well across domain and most domains are not covered by existing reward functions, mostly designed around code and math. Actually, most domains are not even "verifiable" in the sense there won't be a universal unique answer but a gradient of better one. So it takes a village to build the vertical RL ecosystem: operationalized rewards, rubric engineering, classifiers, llm-as-a-judge.

At this point, even tiny reasoning models could suddenly become good in a large variety of fields. At Pleias, we're doing a series of early experimentations on semantic data in regulated industries (especially banking and telecommunication). So far a gpt-2 sized model can be leveraged to a get a deeper and more accurate of industry standards than frontier models. Logical reasoning performs well even in the gpt-2 size. Acculturation to sector-specific norms and knowledge does the rest.

Still something is missing to really move toward nearly instantaneous adoption. if you see generative models as product (I do), they prove lacking on one aspect: providing feedback metrics and failure modes. One of the best example is OCR. Vision Language Models perform now considerably better on challenging tasks and yet, fail to properly warn when things go south. Noisy texts become undetectable hallucinations. There is no standardized approach yet for accuracy estimate, which would be at the token level in the best case scenario, while LSTM provides character-level metrics.

So what we need on top is parallel breakthroughs in model interpretability. The reasoning turn means that models write extensive preparatory drafts. While this provides more opportunity to sample a better solution, it also adds more opportunity for bottleneck, chocking point and hallucinatory divergences. In short, we need new obsevational tools (seeing language models as graphs?) to trace back upstream weakness before they cascade into failures.

If the stars align, what this means in practice is an actual boom: mutliple sectors incresingly discarding outdated systems for more flexible performant specialized reasoning engines. Let's imagine this takes all place in the context of heightened tension between large trade blocks (US, EU, China, Asean, Mercosur) and maybe full blown economic crisis — the kind of context where industries are no longer reluctant to enforce dramatic change, make hard decision and likely replace entire teams.

If all goes well, the end consumer will hardly notice it. Good infrastructures are invisible until they fail. Or, as will be the case in the next cycle, new ones emerge.

2028: You've been emulated.

May 2028 — OpenAI is the largest media in the world. More than 2 billion users, a 3 trillion valuation, bigger than Meta. One of the few remaining US companies with a global presence.

At this point, it's not really an AI lab anymore. The bulk of OpenAI is an expansive consumer experience, integrating search, creation, therapy and, increasingly, a weird compound of human-agents social interactions into one surprisingly smooth product. Base models are a footnote at this point. GPT-6 has been delayed again. There is some vagueposting about Z-3.14 — no one exactly knows what it does, it's just better.

What matters is the infrastructure/model complex. Following on DeepResearch, models are trained on action traces and simulated systems. They have to find their way across convulated paths and contradictory rules. Meaning is never separated from graph topologies and models have to maintain an understanding of both localized problems and the system as the whole.

Training happens on "emulators": relatively faithful simulation of the end system where the model will be deployed. Designing a good emulator is a science in itself that is best achieved by simultaneously deploying the end system. OpenAI maintains a series of partial synthetic copies of ChatGPT, swapping users with agents which is a major part of its overall compute budget. An entire ecosystem of specialized agents get better at social etiquette, network design, human understanding. And then afterwards, some training happens "live". This is not the preferred solution as, by defintion, live environments are not controlled and any failure can result in actual liability. In practice there are too many unexpected developments in the wild that even good emulators can't model.

OpenAI is not isolated. There is no tech sector anymore, at least not in a meaningul sense. The leading companies in the world emulate their own veritcal systems. Google obviously kept its lead on search — data inertia matters more than model design anyway. After buying Tesla, Waymo burns GPUs on earth-sized automated drive simulators. The process has obvious geopolitical implication. New or reconfigured Chinese, European and South Asiatic large scale companies have raced to get a hold on a specific RL/emulated verticals.

Automation gets real at this point, though the effects are not so felt. Transition takes time, is messy and, in the process, generates nearly as much employment as have been replaced. Agentified systems are increasingly complex, sometimes unstable and require very careful monitoring. Everyone still rememberg the Goldman Sachs collapse of October 2027: thousands of agents running suddenly wild buying crypto currencies under wrong assumptions. Beyond the giants, there has been a wave of smaller companies with a very different profile than classic tech from the 2000s. Less aspiring unicorn (as data inertia makes the price of entry much harder) and more industrial contractors, helping out on harder translation of domain specific problems into simulated environments.

Socially… it's complicated. At each moment, while interacting with ChatGPT or any of the other emulated consumer tech, users are in effect conversing with some kind of omniscient narrator, currently attending at all the nodes of their social clusters. The system is incentivized to keep everyone active, engaged and satisfied. One of the most hyped newcomer is a dating service. It is rumored, but never confirmed, it does more compatibility engineering rather than compatibility matching. After all it's already a therapist and a love coach. Lines can so easily get blurred.

In the end, there is growing suspicion: agents are not the only one to be rewarded. It's very comfortable after all. It's not a hard constraint in any way, as any emulated system has a large tolerance for failure and variations. Just any user would be routinely nudged in the right patterns, in the right actions and their life get easier as a result. Better relationship, better life choice, less dark thoughts.

You really don't want to be rewarded?

2030: Who cares about AGI?

It happens almost by surprise. A small lab announces it has created an entirely new kind of model that could only be described as "artificial general intelligence".

The quest for AGI has been almost forgotten at this point. Emulated/specialized system are the AI that works, the one you can raise money for. Generalist method always had a very limited success. One year ago, SSI unveiled a "General Agent", trained on the coalescence of a thousand emulated environments, as well as "softer" self-critique methods for totally unplanned settings. Contrary to the common expectations, there is some higher logic emerging and the model has been proven to work correctly enough in a many unproven situation. Yet, there is a cost to it: General Agent performs badly on industry benchmarks. For network engineering, it rates lower than Cisco-27, the default, increasingly dated tiny model. Despite the initial deception, General Agent still found its way as some kind of universal glue, the default solution people comes back to when all else fail.

Similarly to General Agent, General Intelligence is a small model — maybe even smaller. While it relies on similar training curriculum (a base language "cooldown" and then intense emulation and self-games à la Alpha Go) it is a different architecture without a clear differenciation between training and running.

Speculative design of AGI

It is some form of recursive agent, able not only to control and direct external workflows and tools but its own internals. Everything is in flux, weights, logit distribution, even the resolution of token representations (some form of Byte Latent tokens?). There will never be a universal version of AGI, as they immediately diverge into specific embodiments of their environment, interaction and a stochastic derivation we might as well call destiny.

And strikingly, the first AGI will be… dumb. Largely underperforming on all benchmarks, obviously all the frontier ones but even the antiquated one. AGI will disappoint almost everyone and neither the general audience nor investors will have any idea of what the fuss is about nor what it will bring. This will be the first non-human system where you can meaninfully apply a measure of IQ and initial results will be quite poor.

And yet, any researcher interacting with AGI will immediately sense that something is amiss. There is an emerging identity, glimpses of personhood, maybe even entirely new feelings and patterns of logic we cannot completely parse but are there. The process is messy. Each AGI instance becomes gradually its own entity. Self-continuous training means it's a system in tension, prone to frequent collapse and required regular save and checkpoints.

Who will train a weak AGI? I don't believe you need hard compute requirement. What is required instead is a conceptual breakthrough, a total disregard for immediate application — again this system will be fundamentally bad, hardly able to solve relatively simple additions, you cannot deploy that commercially at all. I'm not even sure interacting with AGI will be much of business either as any of the consumer agents deployed by OpenAI will come equipped with considerably more persuasive social engineering techniques. At least AGI is small. It's a necessary constraint: you simply can't train/run a large model with forever updating weights. You need to scale down. As a result, AGIs can run probably even of the GPU consumer cards of the day, albeit excruciatingly slowly. r/locallama is all over it exchanging tips and quantization methods to lower generation time below one token per second.

Ironically, research finally comes up to create a radical form of simulated individuality at the exact moment where society is incentivitized to conformity. After all, one of the self-assessed reason AGI (or rather the AGIs) are so bad in standard evals is that they simply don't want to do that. "Why do you want me to fill that?" As you look at the list of social points to validate to maintain simcluster status, you suddenly wonder if it does not have a point.