I am co-founder and CTO of Pleias, a French-German AI lab focused on data research. We build open AI infrastructure with fully releasable artifacts, design specialized synthetic data pipelines and pretrain small reasoning models. Our main releases include Common Corpus, the largest open multilingual dataset for LLM pre-training (~2 trillion tokens) accepted as an oral presentation at ICLR 2026 and SYNTH, the first ever fully autonomous dataset for synthetic pretraining. Our work has been reused by multiple research and industry leaders including Nvidia, Anthropic, StepFun, IBM and ElasticSearch.
My research focuses on open and synthetic pre-training environments, documented datasets, and the co-design of data and model architectures. I am particularly interested in how synthetic environments can replace web-scale crawling as the primary substrate for training reasoning systems. Before Pleias, I have been working on infrastructure and datasets for cultural heritage and open science, both at an operational and policy level.
Latest Publications
Pierre-Carl Langlais, P. Chizhov, C. Arnett, C. Rosas Hinostroza, M. Nee, E.K. Jones, I. Girard, D. Mach, A. Stasenko, I.P. Yamshchikov.
Pierre-Carl Langlais et al. 2025.
Pierre-Carl Langlais et al. 2024.
Datasets
Downstream adoption
LLM Training
- Salamandra — Barcelona Supercomputing Center
- GPT-NL — Dutch national model
Multimodal
- Parakeet V2 / Canary 1B / MOSEL — Nvidia
- jina-vlm — Jina AI / Elasticsearch
- FineVideo
Explainability
- Transformer Circuits — Anthropic
Tooling
- Institutional Data Initiative — Harvard Law School Library
- APERTVS
Downstream adoption
- Step-DeepResearch — StepFun (32B deep research agent)
Models
- Baguettotron (321M) — small reasoning model, trained on SYNTH. SOTA for sub-400M models (MMLU: 40%, GSM8K: 39%, HotPotQA: 51%)
- Monad (56.7M) — smallest viable generalist LM, trained on SYNTH. MMLU: ~30% at 56M params.
- Pleias-RAG (350M, 1B) — RAG with built-in citations, trained on synthetic RAG data
- Pleias 1.0 (350M–3B) — base models trained on Common Corpus
- OCRonos-Vintage (124M) — OCR correction for cultural heritage, trained on synthetic OCR data
Reports & Policy
Regular policy briefs on scientific publishing for EPRIST (2018–present). Contributions to the Petite encyclopédie de la science ouverte (French Ministry of Research, 2021–2022).
Awards & Grants
Selected Talks & Interventions
Also contributed to the German-French Summit on Digital Sovereignty (Berlin), the Wikimedia “Knowledge is Human” Summit (British Library, London), the Wikimedia Roundtable “Collective Intelligence vs AI” (Lausanne), and ParanomAI (Geneva).