AI Does Science
While You Sleep
Andrej Karpathy shipped a 3-file repo that lets an AI agent run ~100 ML experiments overnight on a single GPU. No human in the loop. This is how it works.
"Meat Computers" vs.
Agent Swarms
Karpathy opens the autoresearch README with a sci-fi vision: a future where autonomous swarms of AI agents conduct all frontier research, iterating through thousands of generations of experiments. Human researchers — the "meat computers" — are long gone from the loop.
That prologue isn't just flavor text. It frames the gap autoresearch is designed to close: traditional ML research is a serial, human-bottlenecked process. You hypothesize, write code, train a model, check the results, and repeat. But you also eat lunch, attend meetings, and sleep. Eight experiments on a good day.
What if you gave that loop to an AI agent and went to bed?
Human Researcher — 24 Hours
Autoresearch Agent — 24 Hours
Three Files. That's It.
The autoresearch repo is intentionally tiny. No distributed training, no complex configs, no external dependencies beyond PyTorch. The entire system lives in three files, each with a clearly defined role and owner.
prepare.py
Nobody editsFixed infrastructure. Downloads data, trains a BPE tokenizer, provides dataloader and evaluation harness. Set up once by the human, then locked.
train.py
Agent editsThe playground. Full GPT model, Muon + AdamW optimizer, training loop. The AI agent modifies architecture, hyperparameters, batch size — everything is fair game.
program.md
You write thisThe "research director." A markdown document that instructs the agent what to try. This is where the researcher's judgment and strategy lives.
Modify, Train,
Evaluate, Repeat
Each iteration of the autoresearch loop follows a fixed protocol. The agent reads its instructions, makes a change, trains for exactly 5 minutes, and checks if the model got better. Then it loops.
Read Strategy
Agent reads program.md for context: what's been tried, what to explore next, constraints, and success criteria.
Edit Code
Agent modifies train.py — changes model architecture, tweaks learning rate, adjusts batch size, or tries a new optimizer schedule.
Train (5 min)
Training runs for a fixed 5-minute wall-clock budget, excluding startup and compilation. Every experiment gets the same time, regardless of architecture changes.
wall clockEvaluate
Agent checks val_bpbValidation bits per byte — measures how efficiently the model compresses unseen text. Lower is better. against the previous best. Lower means the model compresses text more efficiently.
val_bpbKeep or Discard?
If val_bpb improved → keep the changes. If not → revert and try something different. Either way, loop back to step 2.
✓ keep ✗ revertWhy val_bpb?
val_bpb stands for validation bits per byte. It measures how efficiently the model compresses unseen text. Lower is better — fewer bits to represent each byte of real text.
Why this metric specifically? Because it's vocabulary-size-independent. If the agent changes the tokenizer architecture or vocabulary, experiments are still directly comparable. Without this property, an agent that expands vocab would appear to "improve" by gaming the metric.
Think of it like measuring fuel efficiency in miles-per-gallon regardless of whether you changed the engine, the tires, or the fuel type.
Programming the
Research Director
Here's the paradigm shift: you're no longer writing Python. You're writing a markdown document that tells an AI what kind of research to do. program.md is essentially a system prompt for a research agent.
It contains context about the codebase, what's been tried, what to explore next, constraints, and success criteria. Karpathy's default version is intentionally barebones — the obvious next step is iterating on the research instructions themselves. Meta-research: optimizing the document that guides the optimizer.
The implication is profound: the researcher's job shifts from "writing code" to "writing strategy documents for AI agents." Research becomes management.
# Autoresearch Program
You are an autonomous ML researcher.
Your goal: minimize val_bpb on the
validation set.
## Constraints
- Only modify train.py
- Training budget: 5 minutes
- Keep changes incremental
## Strategy
Try architectural improvements,
optimizer tuning, and regularization.
Keep what works, revert what doesn't.
## Autoresearch Program v2
You are an autonomous ML researcher
working on a GPT language model.
## Context
- Base model: 124M params, 12 layers
- Current best val_bpb: 1.21
- Hardware: 1x H100 80GB
## Priority Queue
1. Try Mixture-of-Experts (2 experts)
2. Experiment with RoPE embeddings
3. Test gradient accumulation 2x→4x
4. Explore KV-cache optimizations
## Anti-patterns (don't repeat)
- Expanding vocab beyond 32K (OOM)
- Flash attention v1 (use v2)
- Batch size > 64 on this hardware
## Success criteria
Any improvement to val_bpb is worth
keeping. Log your reasoning.
You manage the agent. The agent manages the code. The code manages the GPU.
What Actually Happens
Overnight
The agent makes measurable progress. Karpathy described waking up to a log of autonomous experiments, each one building on the last: "This is what post-AGI feels like... I didn't touch anything."
What kinds of changes does the agent make? Architecture tweaks (attention head count, layer depth), optimizer hyperparameters, batch size adjustments, learning rate schedules. Some experiments improve. Many don't. The agent discards failures and builds on successes — exactly like a human researcher, but at 10x the pace.
The Bigger Picture
Autoresearch is intentionally minimal — 3 files, 1 GPU, MIT license. But it demonstrates a principle that scales. The trajectory is clear: single agent on 1 GPU → multiple specialized agents → agent swarms across clusters → fully autonomous research organizations.
The human role shifts permanently: from doing research to designing the research process to eventually auditing research that happened without you.
This fits Karpathy's broader arc. Each project strips another layer of complexity, making the next frontier accessible to individuals with a single GPU.
The uncomfortable question: if an agent can run 100 experiments overnight on a single GPU, what happens when it gets 1,000 GPUs?
Try It Yourself
The code is, as Karpathy writes, "the story of how it all began."
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/karpathy/autoresearch
cd autoresearch && uv sync
uv run prepare.py && uv run train.py
Requirements: Python 3.10+, uv package manager, NVIDIA GPU (or Apple Silicon via fork)
github.com/karpathy/autoresearch ·
nanochat ·
macOS fork