David Heineman

Hey! I'm David 👋

I'm a research engineer at Meta FAIR, working on pretraining, data and evaluation for language models.

email * resume scholar github huggingface twitter

About Me

I study foundation models with a focus on methods: how we make decisions about data [1], draw conclusions from experiments [2], and how model behavior changes at scale [3]. I've developed new and efficient tools for measuring model behavior [4, 5], and built receipes for fully-open language models [6, 7]. Recently, I've been thinking about how methods from pretraining can help us build vision-language, speech-language and reasoning models.

Previously, I was part of the Open Language Model (OLMo) team at Ai2, advised by Kyle Lo and Jesse Dodge. Before that, I was an undergrad at Georgia Tech 🐝, fortunate to be advised by Wei Xu and work with Yao Dou and Mounica Maddela. I've spent a few summers at AWS and a healthcare startup Patientco. I enjoy reading, hiking, and making homebrew nitrogen cold brew.

Publications & Preprints [selected, all]

Olmo Hybrid: From Theory to Practice [models]

William Merrill*, Yanhong Li*, Tyler Romero*, Anej Svete*, Caia Costello*, ..., David Heineman, ..., Noah A. Smith, Hannaneh Hajishirzi, Ashish Sabharwal
preprint, 2026

Olmix: A Framework for Data Mixing Throughout LM Development [blog, code]

Mayee F. Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, Christopher Ré, Luca Soldaini, Kyle Lo
ICML, 2026

Olmo 3 [blog, code, models, data]

Olmo Team (incl. David Heineman)
technical report, 2025

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [code, leaderboard]

Mike A. Merrill*, Alexander G. Shaw*, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, ... David Heineman, ..., Jesse Hu, Christopher Michael Rytting, Ryan Marten, Yixin Wang, Alex Dimakis, Andy Konwinski, Ludwig Schmidt
ICLR, 2026

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation [code, data]

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A. Smith, Hannaneh Hajishirzi, Kyle Lo, Jesse Dodge
NeurIPS, 2025 (Spotlight, Top 5%)

Fluid Language Model Benchmarking [code, models]

Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith
COLM, 2025 (Oral, Top 5%)

Establishing Task Scaling Laws via Compute-Efficient Model Ladders [code]

Akshita Bhagia*, Jiacheng Liu*, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi
COLM, 2025

2 OLMo 2 Furious [blog, code, models, data]

Pete Walsh*, Luca Soldaini*, Dirk Groeneveld*, Kyle Lo*, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, ..., David Heineman, ..., Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
COLM, 2025

Evaluating LLMs on Chinese Idiom Translation

Cai Yang, Yao Dou, David Heineman, Xiaofeng Wu, Wei Xu
COLM, 2025

DataDecide: How to Predict Best Pretraining Data with Small Experiments [code, models]

Ian Magnusson*, Nguyen Tai*, Ben Bogin*, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
ICML, 2025

Improving Minimum Bayes Risk Decoding with Multi-Prompt [code]

David Heineman, Yao Dou, Wei Xu
EMNLP, 2024

Towards a Path Dependent Account of Category Fluency [code]

David Heineman, Reba Koenen, Sashank Varma
CogSci, 2024

Thresh: Unified, Customizable and Deployable Fine-Grained Text Evaluation [live tool]

David Heineman, Yao Dou, Wei Xu
EMNLP Demo, 2023

Edit-level Simplification Evaluation using SALSA 💃 [code/data, metric]

David Heineman, Yao Dou, Mounica Maddela, Wei Xu
EMNLP, 2023

LENS: A Learnable Evaluation Metric for Text Simplification [code/data, metric]

Mounica Maddela*, Yao Dou*, David Heineman, Wei Xu
ACL, 2023

* = equal contribution

Some past work

Gave a guest lecture at Georgia Tech's CS 7643 (Deep Learning) on language model pretraining methodology, data and evaluation [slides, video].
Winning submission to the Berghain challenge! [code] (🏆 1st of 1300 submissions)
Participated in Thinking Machines' Tinker Beta. I experimented with RL training with terminal environments for reproducing empirical findings in ACL papers [code].
Gave an invited talk at the Evaluating Agentic Systems social at NeurIPS 2025.
I'm trying a new system for keeping up with fresh papers in our field [code] that updates every morning. It might be helpful for others, let me know if it is for you!
Contributed to Terminal-Bench [leaderboard, docs], a challenging benchmark for language model agents using the CLI. I believe tbench's tmux envs are unique, new construct for our field!
A few mini-projects: a 500 line GRPO implmenetation; showing LLM benchmark scores can improve +2 pts on MATH simply by changing vLLM version; a reproduction of branching factor; custom PyTorch kernels for Fast FFNs; and eval'ing LLMs on quant puzzles.
Spent Summer '24 in the first US cohort of Entrepreneurs First in South Park, SF 🌉 as part of a residency program. I briefly worked on a few ideas with RL for tool use, before moving to Ai2.
Maintaining the Thresh 🌾 platform, an all-purpose tool for fine-grained text generation evaluation [video], including an annotation tool builder and Python library.
Built a search engine [code] for ML / NLP conferences, indexed with ColBERT.
Wrote a LLM-based Rubiks cube solver as a demonstration of explore/exploit behavior for reasoning (🏆 2nd place at AGI House open source hackathon).
Awarded the GT College of Computing Outstanding Undergraduate Research Award (1 of 3000+ CS students) for my undergradaute thesis work on fine-grained evaluation of LLMs.
Designed new programming assignments for CS 4650, Natural Language Processing as a teaching assistant (sampling algorithms & LLaMA fine-tuning with LoRA).
Built an air pollution complaint tracker and classifier [code] for the Georgia Environmental Protection Divison (part of a larger collaboration at GT).
Awarded the PURA research grant to work on open problems in generation & evaluation (check out my Huggingface decoding vizualizer extension).
Thoughts on approaching reasoning evaluation in LLMs using theories of human cognition.
pip install lens-metric - A simple library to evalute text simplification using our LENS and LENS-SALSA LLMs on HuggingFace using only 5 lines of Python [demo].
Interned in AWS EC2 Enterprise Services, developing a prototype language model service, addressing problems in inference cost and deployment of open-source LLMs.
Earned 4th place in Georgia Tech's Wrek CTF (one of the largest greyhat hackathons in the southern US) [answers].
Helped lead Georgia Tech's CS 3510, Design and Analysis of Algorithms as a teaching asssistant in Fall '21 and '22.
Interned at AWS CloudWatch Application Insights, built infrastructure to monitor and group telemetry data from processes running on EC2 instances to identify the root causes of problems on customers' AWS infrastructure.
Interned at Patientco (now part of Waystar), invented and deployed new sequence-based prediction models to predict when a patient pays their healthcare bill using their payment history (used to customize ~5% of U.S. healthcare bills).
Deployed an API to allow researchers to segment Twitter hashtags using a new segmentation model from Georgia Tech's NLP Lab.
In the pre GPT-3 times, worked on methods for automatically grading student essays [code].

Recommendations

A few interesting corners of the internet I find worth checking out!

... to read

Bisk et al., Liberman

...

Katherine Lee et al.

Florian Ederer et al.

Nicholas Bloom et al.

... to `clone`

davidheineman/dotfiles

... to listen

Compressor Head

Generally Intelligent

Artem Kirsanov

Mary Wootters

Jia-Bin Huang

... to flip through

Games, Puzzles, and Computation by Erik Demaine

The Corrections by Jonathan Franzen

Naked Statistics by Charles Wheelan

Society Must be Defended by Michel Foucault

Oblivion by David Foster Wallace

I also enjoy trying new coffee shops. Here's some recommendations across Atlanta, that I visited during my undergrad, and a growing list across Seattle.

If you're looking for advice, feedback or just to chat about research, feel free to schedule a short (non-commercial) chat.

David Heineman
Last updated May 2026 [view source]

curl -s https://davidheineman.com/rick | bash