Fine-tuning is how frontier labs build models as a product. Specialization through SFT and RL is what separates a general-purpose model from a competitive coding assistant, a healthcare reasoner, or a financial analyst. Increasingly, it's also how labs create the training data in the first place, by designing specialized generator models.
Yet outside frontier labs, there are few fine-tuning success stories. Most corporate and research teams with concrete use cases either give up or settle for prompting workarounds, unable to keep up with the infrastructure demands: provisioning GPU clusters, debugging distributed training, managing long runs that fail silently. Now enters a new generation of managed platforms of fine-tuning as a service. Tinker, Nebius Token Factory, Together AI, Prime Intellect offer integrated SFT and RL infrastructure so that teams can focus on models, data, and use cases rather than compute operations.
We set out to evaluate whether these platforms deliver on that promise, and specifically, whether they can support one of the most demanding fine-tuning workflows: large-scale synthetic data generation.
Three months ago we released SYNTH, a fully synthetic generalist training environment that allowed us to train state-of-the-art small models without any reliance on an "organic" dataset like FineWeb. Since publication, SYNTH has been adopted by several academic projects and industry labs, including Step-Fun for its deep-research model and maybe even more importantly, we saw significant demand from industrial partners for custom, domain-specific synthetic data generation. This exposed a gap in our own infrastructure: SYNTH was a generalist environment, focused on conversational use cases with relatively simple pipelines for memorization, retrieval, and basic interactions. Moving to specialized agents — able to work iteratively within domain-specific infrastructures, structured data, and custom ontologies — required a qualitative step up in our fine-tuning capabilities.
This agentic turn in our synthetic pipelines became our test case for evaluating the current fine-tuning platforms. It is a deliberately demanding benchmark: if a platform can support iterative training and deployment of specialist generator models at scale generating and solving complex tool sequences, it can likely handle most post-training workflows. We evaluated the services along the following dimensions:
SYNTH data was originally generated using fine-tuned models. We found fine-tuning advantageous for the following reasons:
For SYNTH, we trained dense models on the Jean Zay cluster using trl from HuggingFace, mostly from the Qwen 3 series. Yet moving to more agentic use cases proved more challenging for our current selection of dense models, as synthetic generation is no longer limited to verbalization of existing data seeds but must actively create and solve actual problems, provide structured data outflows and, increasingly, interact with consistent simulated environments. Our early tests using the same fine-tuning infrastructure as SYNTH showed that we had to drop a significant amount of synthetic generation to get generally usable agentic traces: up to 30% of tool calls included for instance faulty JSON that could not be parsed properly and we had a further high prevalence of so-called tortured reasoning (where the model does not really manage to lead from a problem to a pre-defined solution, because the correct intermediate path is too complex).
This constraint changes the condition of inference economics that used to be initially favorable to the fine-tuning of smaller dense models, and the issue further compounds as we use larger context windows with a larger pool of functions. If we have to throw away most of the synthetic generations and, even more crucially, we hit a ceiling of capability, it may be worth investing in a more powerful generator.
Over the past year, mixture-of-expert models (MoE) increasingly the better trade-off of generation cost vs. data quality. In theory this is an optimal architecture for synthetic data generation as inference speed is much more of a concern than memory footprint. Yet, MoE are noticeably harder to finetune:
For all these reasons, tuned MoEs remain scarcely used for large scale synthetic data generation, big labs aside. This issues compound as, conversely, dense models are now getting further and further from the current state-of-the-art. Best available base models below 12b (mostly Qwen 3 and Gemma 3) have been released more than 9 months ago. Llama 3 based is still commonly employed despite being nearly 2-years old.
A secondary motivation to rely on an integrated service lies on the inference side. Since the release of SYNTH we have seen early market signals for paid consumption of synthetic pipelines outputs — like typically, having an entire dataset expanded with thinking traces to ensure content memorization throughout model training. For now this is still a very emerging demand, hard to scale and hosting the model we have created for SYNTH would not be an economical solution.
Our training dataset include 9,415 unique examples with the following components:
The dataset was generated by several frontier models (including DeepSeekV3.2 and Minimax) with additional internal curation processed to ensure diversity, consistency and general JSON conformity.
This is not the only model in our synthetic agentic pipeline but it bundles the hardest components that do require actual problem-generation skills. In parallel we have also designed another fine-tuned dataset for interleaved thinking, that takes the correct expected output and deduces a sequence of thinking steps / tool call / new thinking steps, similarly to Claude Code. Both models are designed from the onset to be integrated into a broader synthetic pipeline. Typically the Wikipedia seeds are pre-classified to fit into the expected environments. For further projects, we plan to increasingly move on to structured data seeds coming from Wikidata, rather than unstructured texts. Our general plan is overall to move from simple memory/logic forms or synthetic compilation to complex emulated environments
We adjusted our training data to the instruction style of the models. Otherwise, each training run was based on the exact same dataset.
To test the training frameworks on speed, price and general user experience, multiple experiments were conducted using different model sizes. The function calling dataset used for training is the same as described above for all runs, representing 30M tokens in total. The 4 different experiments conducted are divided into 2 main categories: Mixture of Expert (MoE) models (in experiments 1.X) and dense models (experiments 2.X). Many major AI labs (OpenAI, Qwen, deepseek, moonshot) have recently released strong open-source MoE models, making this class of models highly relevant to consider for downstream training and applications. They are also harder to train than dense models, which have extensive available resources from open source contributions. The training platforms are interesting because they make an abstraction of the type of model used - it makes no difference to users whether the underlying model is MoE or dense, removing the difficulties added by novel architectures.
The benchmark is designed to evaluate multiple training and deployment capabilities across LLM training services. It compares full fine-tuning and LoRA-based fine-tuning, and assesses whether inference can be executed directly on the trained model within the same platform. The four different models used are:
The experimental setup is standardized and a single fixed dataset is used across all runs. Core training parameters are held constant: context length of 16,384 tokens, batch size of 8, three training epochs, and a uniform learning rate across providers. An exception is experiment 1.1 (see Table 1) for Together AI, which only supports a batch size of 1. For LoRA experiments, a rank of 8 is used consistently. This controlled configuration enables direct comparison of training workflows, fine-tuning methods, and post-training inference support across frameworks. For experiment 1.1 the aim is to train a model using full fine-tuning, but as the option was only available for Nebius Token Factory, LoRA had to be chosen for the other frameworks. The other experiments all use LoRA because Tinker does not support full fine-tuning capabilities.
Table 1: Time and cost metrics for Nebius, Tinker and Together AI training services. Red cells indicate that the run was not performed and that total cost is based on the service's official documentation. The bold text highlights the most competitive model for each run (for both time and cost metrics). In Run 1.1, Nebius is the only provider offering full fine-tuning and the performance therefore cannot be fairly compared to other providers.
** full finetune for Nebius, LoRA for others
The training framework that generated the most online attention was Thinking Machine's Tinker. Thinking Machines Lab was founded in February 2025 in San Francisco by former OpenAI CTO Mira Murati and a team of leading AI researchers and Tinker is their first publicly available product.
Tinker provides a low-level training abstraction that offloads distributed execution to the provider while leaving almost all model, training, and inference logic to the user. Users are required to write essentially the same code they would use for local training and inference; Tinker's primary contribution is managing distributed training across GPUs or nodes on its infrastructure. This design is beneficial for researchers who need flexibility for custom implementations and want to avoid the operational complexity of distributed systems. However, it does not simplify the training workflow itself. Users must still manage model code, training loops, data handling, and configuration. As a result, the platform is poorly suited for users who want a data-centric workflow where models are trained by supplying data and adjusting a small number of parameters.
Several limitations follow from this design choice. There is no cost or time estimation in the interface, and even retroactively costs for different training runs are unclear on the web interface. It also lacks any built-in support for Weights & Biases training supervision or HuggingFace model uploads. Furthermore, only LoRA fine-tuning is currently supported; full fine-tuning is not available, despite often yielding better performance. The platform does not support inference or deployment of trained models and model availability is more limited than competing services. Out-of-the-box performance is comparatively slow, as throughput optimization is left to the user rather than handled automatically by the platform. There are no built-in templates for common input formats such as chat-style fine-tuning, requiring additional custom code.
The main advantage is cost: in most cases, Tinker is significantly cheaper than the other platforms evaluated.
Prime Intellect (still in private beta) seems to have a similar functionality than Tinker, running training on their end but providing an interface to track training runs. It features a custom library for writing training code using their infrastructure without managing parallelism, therefore currently falling in the only-code category with Tinker. Model availability is currently quite limited, so pricing comparison is difficult as model overlap is sparse. For the few overlapping models, Prime Intellect prices seem to fall in between Tinker and the following two services. As we have not included the Prime Intellect in the general benchmark, training and inference speed or the general user experience will not be discussed. Generally, the product is in a less established state than the other alternatives included in the full comparison.
Together AI offers both a programmable API and a graphical interface, combining low-level flexibility with a higher-level managed workflow. This allows users to implement custom training logic and wrappers when needed, while also supporting straightforward fine-tuning directly through the interface.
Compared to Tinker, the interface is considerably more accessible and improves practical usability. Users can choose between full code-driven control and an interface-based approach, making the platform suitable for a wider range of users, from researchers requiring customization to practitioners seeking simple fine-tuning workflows.
The Together AI interface is intuitive and users can browse available models, test them in a playground, and access inference through both dedicated endpoints (billed per minute) and serverless endpoints (billed per token). However, serverless availability varies by model, and documentation does not always reflect actual availability. Navigation through the model collection is sometimes laborious using the documentation so the Models tab in the interface should be preferred for accurate and up-to-date information.
Fine-tuning is streamlined via a dedicated interface tab: users upload data, select a model and parameters, and execute the job. Models can be retrieved directly from the platform or via the Hugging Face integration, automatically uploading checkpoints or final models. The platform also integrates with Weight & Biases, allowing real-time monitoring of training progress. Fine-tuning jobs can be cloned from previous runs to simplify repeated experiments or slight variations. Before launching a job, the interface also displays the expected cost of the run, which is a very useful insight that Tinker does not provide.
Besides a quite extensive model collection, the Together AI service also allows for different types of training: LoRA and full fine-tuning for dense and MoE models. The availability of each method however is model dependent and not all combinations of model and fine-tuning type are available. There is also support for supervised fine-tuning (SFT) as well as reinforcement learning (RL), allowing for customization of model behavior to domain-specific data or preference-oriented alignment objectives. For our function calling usage we will only use and test SFT.
Performance benchmarking shows Together AI delivers high training speed. It is consistently and considerably faster than Tinker, but only achieves a modest speed improvement over Nebius in experiment 2.1. Nebius achieves comparable speeds to Together AI, both outperforming Tinker.
Cost analysis indicates Together AI is generally more expensive than both Tinker (except for the LLaMA 3) and Nebius (except for experiment 1.2 where the prices match). The evaluated platforms fall into two distinct tiers based on pricing. Tinker represents the lower-cost tier, offering substantially cheaper options than the other platforms. Together AI and Nebius occupy a higher-cost tier, with pricing generally comparable between them. Across the benchmarked runs, Nebius was observed to be slightly less expensive than Together AI. For experiment 1.1, cost and timing results can only be compared across Tinker and Together AI, both trained with LoRA, while Nebius was trained using full fine-tuning. Full fine-tuning is both more expensive and slower because it involves all of the model's parameters and not only a few low-rank matrices.
Together AI supports serverless inference, billed per token, but availability is limited to certain models. Models without serverless support require dedicated endpoints, billed by runtime, which can be more expensive unless fully utilized. The lack of clear visibility into which models support which inference or training mode sometimes complicates planning.
Fine-tuned models can be deployed from the Models tab on dedicated endpoints or used directly in serverless mode (if serverless mode is available). The accepted template for serverless batch inference however is not flexible, making the inference process not fully streamlined. Users must supply input datasets in specific templates matching the model, rather than simply selecting a model and running predictions. This process requires modifying the dataset to run inference on a fine-tuned model, making inference less user-friendly.
Nebius Token Factory also provides a flexible and efficient fine-tuning and multimodal inference environment that supports both advanced users and more workflow-oriented practitioners. Like Together AI, they offer high training performance, achieving speeds comparable to the fastest platforms in the comparison, and clearly outperforming lower-end alternatives on speed and user comfort. The platform supports multiple fine-tuning strategies, including LoRA and full fine-tuning, as well as SFT and RL-based approaches, enabling both domain adaptation and preference alignment.
The overall workflow supports easy training setup and has practical integrations with common ML tooling for tracking training progress (Weights & Biases) and model uploads on Hugging Face. Model selection and training are designed to accommodate a wide range of architectures, including from small to large for both dense and MoE models. Cost transparency and predictable performance place Nebius in the high-end tier of managed training platforms, with pricing that is competitive and generally slightly more favorable than comparable offerings.
The general user experience is pleasant, with useful guidance such as predicted training costs and estimated time to completion. Having both the opportunity to use code and add custom implementations or work directly with the interface adds the required flexibility for both beginners and seasoned LLM trainers. The model documentation is clear and coincides with the actual service provided (for the experiments conducted above), with an AI chat referencing documentation and explaining the service even for untrained users.
One-click deployment allows to run serverless inference of custom LoRA models in a single click, with inference speed of 120 tokens per second (for the llama model of experiment 2.2). The input format for inference (and training) is also more flexible, accepting conversational, instruction or text inputs. Inference model selection is directly done in the browser, avoiding the need to change the dataset before running inference. Inference prices are comparable with the other inference providers, providing fast inference for competitive pricing.
One feature that was particularly useful for different iterations of the function calling was the Data Lab, which allows to analyse input and output datasets and validate the quality of the generated function calling scenarios. Compatibility ranges from chat transcripts through logs to feedback streams, easing the process of data curation. For data-centered tasks, this feature eliminates the need to move data in and out of the platform for iterations of model training, seamlessly integrating the data into the training workflow. The Playground also allows users to quickly test custom models with manual inputs, for sampled validation or isolated requests.
The service makes it easy for anyone to generate and deploy custom models quickly, with low waiting times and affordable, usage-based pricing. Speed, efficiency, and personalization are offered together and not treated as tradeoffs. Inference is among the fastest available thanks for instance to speculative decoding, and the system remains stable and performant even with very large context windows of up to 131k tokens.
Beyond inference, the platform supports the full model lifecycle, including data ingestion and curation, supervised fine-tuning, RL/RFT, and model compression. The underlying infrastructure automatically manages parallelism, scheduling, and checkpointing, so users get high throughput and reproducible results without needing to handle operational complexity themselves. EU and US data compliance guarantees set the way for safe and private custom AI, and new models can be requested for specific use cases as the model lineup continues to evolve.
The main takeaway from our study is that fine-tuning services have sufficiently matured to become a key component of the AI infrastructure layer. A few months ago, our initial testing revealed mostly work-in-progress. Now, the core promise is actually delivered: abstracting away distributed training complexity so that practitioners can focus on models, data and concrete use cases. Training speeds are faster than what we see on home clusters, costs are increasingly transparent, and the gap between local and managed training is narrowing in usability. Considering the high hidden cost of maintaining a fine-tuning run, keeping GPU systems updated and facing potential lags of long training runs, we see an actual valid business proposition that delivers real cost savings for any organization with significant and regular needs for post-training.
Yet, as fine-tuning services consolidate, they are likely to expand into the data layer. In practice, this was simply our main bottleneck, not 'how to tune the hyperparameters' but something far more basic: 'how can I look at the data?' This is where the platforms diverge most, in the extended ecosystem across the training run: data inspection and curation tooling, flexible inference for immediate evaluation of fine-tuned models (with LLM-as-a-judges), seamless iteration between generation and training, and support for the full model lifecycle from data ingestion through deployment. Too often, we found that the time gained by significantly faster run was lost into compounded data frictions, as we could not easily inspect either the training sets, or a sample of model outputs.
On this axis, the three platforms occupy distinct positions. Tinker offers the lowest cost and maximum low-level control, but delegates virtually all workflow design to the user — which might be suitable for teams with strong infrastructure expertise who primarily need cheap compute. Together AI provides a more balanced interface with good speed and the broadest model support so far, though friction remains around inference deployment and input format rigidity. Nebius Token Factory came closest to the integrated, data-centric workflow that our synthetic pipeline cycles require: the combination of Data Lab for dataset analysis and validation, one-click inference deployment of fine-tuned models, flexible input formats, and a generally coherent end-to-end experience made it the most practical platform for our iterative use case.
To date, the fine-tuning services market is mostly constrained by data availability: only users who have already sorted out the proper data model that meets their needs can effectively allocate resources to training runs. In our experience, the vast majority of high-value training use cases have not yet reached this point of data maturity, as data is too frequently non-existent, inaccessible or not modelled as a proper learning exercise for language/agentic model training. The ideal platform for this unaddressed market would require only general concepts and a broad description of the environment or situation to emulate.
Building SYNTH gave us a concrete preview of what full-stack data generation infrastructure actually requires: it's not just a handful of heuristic filters, but a complex workflow integrating specialized/constrained generator models, verification systems with formal checks and integrated LLMs, emulated systems and agentic loops. Platforms that acknowledge this shift early on and build toward data-centric rather than purely model-centric workflows will be best positioned for an actual democratization of model training.