Pierre-Carl Langlais

Pierre-Carl Langlais

also known as Alexander Doria

I am co-founder and CTO of Pleias, a French-German AI lab focused on data research. We build open AI infrastructure with fully releasable artifacts, design specialized synthetic data pipelines and pretrain small reasoning models. Our main releases include Common Corpus, the largest open multilingual dataset for LLM pre-training (~2 trillion tokens) accepted as an oral presentation at ICLR 2026 and SYNTH, the first ever fully autonomous dataset for synthetic pretraining. Our work has been reused by multiple research and industry leaders including Nvidia, Anthropic, StepFun, IBM and ElasticSearch.

My research focuses on open and synthetic pre-training environments, documented datasets, and the co-design of data and model architectures. I am particularly interested in how synthetic environments can replace web-scale crawling as the primary substrate for training reasoning systems. Before Pleias, I have been working on infrastructure and datasets for cultural heritage and open science, both at an operational and policy level.

Latest Publications

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training ICLR 2026 Oral
Pierre-Carl Langlais, P. Chizhov, C. Arnett, C. Rosas Hinostroza, M. Nee, E.K. Jones, I. Girard, D. Mach, A. Stasenko, I.P. Yamshchikov.
Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family
Pierre-Carl Langlais et al. 2025.
Pleias 1.0: The First Ever Family of Language Models Trained on Fully Open Data
Pierre-Carl Langlais et al. 2024.

All publications →

Datasets

Common Corpus 2T tokens · 517M documents
The largest open dataset for LLM pre-training. Six collections (Open Government, Open Culture, Open Science, Open Code, Open Web, Semantic Data). All data uncopyrighted or under permissible licenses.
Downstream adoption

LLM Training

  • Salamandra — Barcelona Supercomputing Center
  • GPT-NL — Dutch national model

Multimodal

Explainability

Tooling

SYNTH 100B tokens · 79.6M documents
First open generalist synthetic dataset for training small reasoning models end-to-end. ~58K Wikipedia articles amplified through memorization, reasoning, creative writing, RAG, math, and translation exercises. Joint release with the AI Alliance.
Downstream adoption

Models

  • Baguettotron (321M) — small reasoning model, trained on SYNTH. SOTA for sub-400M models (MMLU: 40%, GSM8K: 39%, HotPotQA: 51%)
  • Monad (56.7M) — smallest viable generalist LM, trained on SYNTH. MMLU: ~30% at 56M params.
  • Pleias-RAG (350M, 1B) — RAG with built-in citations, trained on synthetic RAG data
  • Pleias 1.0 (350M–3B) — base models trained on Common Corpus
  • OCRonos-Vintage (124M) — OCR correction for cultural heritage, trained on synthetic OCR data

Reports & Policy

EU AI Act. Attended model-builder policy meetings and provided feedback. Pleias is one of ~20 signatories of the EU AI Act.
AI Action Summit. Common Corpus was a key deliverable of the Summit (Grand Palais, Feb. 2025), contributed in-kind to the Current AI initiative.
Towards Best Practices for Open Datasets for LLM Training. Collaborative white paper with Mozilla Foundation and EleutherAI. 2025. arXiv
OA Diamond Study. Report for Coalition S / European Commission on non-commercial scientific journals (with J. Bosman, J.-E. Frantsvåg, B. Kramer, P. Mounier, V. Proudman). 2021. Zenodo
Étude critique des nouveaux modes d'éditorialisation de revues scientifiques en accès ouvert. Report for the French Ministry of Research. 2016. HAL

Regular policy briefs on scientific publishing for EPRIST (2018–present). Contributions to the Petite encyclopédie de la science ouverte (French Ministry of Research, 2021–2022).

Awards & Grants

Concours de l'Innovation (Bpifrance / ADEME). Project “Poliaris”, 2025. One of 62 early innovative companies selected.
Mozilla Builders Accelerator. One of 14 projects selected, 2024. Mozilla
GENCI Grand Challenge. 150,000 GPU H100 hours on the Jean Zay supercomputer for training Pleias 1.0. One of 13 Grand Challenges alongside HuggingFace, H-Company, Owkin, etc. 2024. GENCI

Selected Talks & Interventions

“Open data flows: rethinking AI infrastructure after the synthetic turn” — Commons.AI / apidays, Paris, Dec. 2025.
“SYNTH: Training for reasoning with generalist synthetic environments” — Institute of Foundation Models, Nov. 2025.
“Cartographier la science” — Colloque DGLFLF / OPALE, Ministère de la Culture, Paris, Nov. 2025.
“SYNTH: designing open synthetic environments with open data” — EPFL, Lausanne, Nov. 2025.
“SYNTH: Open Environments with Open Data” — AI Alliance Meetup, Ekimetrics, Paris, Sep. 2025. With Anastasia Stasenko.
“Pleias 1.0: open language models for the public good” — Jean Zay supercomputer inauguration, GENCI / IDRIS, May 2025.
“Optimising AI for Sensitive Industries” Open Data Institute, London, Mar. 2025. Keynote.
“Designing frugal reasoning models” — X-IA #26, École Polytechnique, 2025.
“French ScienceCommons” — Académie des Sciences, Paris, 2025.
“Entropy is all you need? The quest for best tokens and the new physics of LLMs” CERN Data Science Seminar, Nov. 2024.
“More than Open — Creating Ethical AI Datasets” Datasets Convening, Amsterdam, Jun. 2024. Co-hosted by Mozilla and EleutherAI.

Also contributed to the German-French Summit on Digital Sovereignty (Berlin), the Wikimedia “Knowledge is Human” Summit (British Library, London), the Wikimedia Roundtable “Collective Intelligence vs AI” (Lausanne), and ParanomAI (Geneva).

Full list & podcasts →

Selected Press

B SMART — “Smart Tech” TV appearance, Oct. 2025.
Stephen Hurford, “How Small Open-Source AI Startups Are Taking on the Big Developers”, Global Venturing, May 2025.
Pascale Davies, “This French Start-Up Just Proved OpenAI Wrong”, Euronews, Apr. 2024.
Kate Knibbs, “Sex, Drugs, and AI Mickey Mouse”, Wired, Jan. 2024.
Nikkei Digital Governance — Mickey Mouse copyright & generative AI, Mar. 2024.

All press, podcasts & video →