March 2026 — Dispatch

AI Does Science
While You Sleep

Andrej Karpathy shipped a 3-file repo that lets an AI agent run ~100 ML experiments overnight on a single GPU. No human in the loop. This is how it works.

3 Files in repo
~100 Experiments / night
5 min Per experiment
1 GPU required
Chapter I

"Meat Computers" vs.
Agent Swarms

Karpathy opens the autoresearch README with a sci-fi vision: a future where autonomous swarms of AI agents conduct all frontier research, iterating through thousands of generations of experiments. Human researchers — the "meat computers" — are long gone from the loop.

That prologue isn't just flavor text. It frames the gap autoresearch is designed to close: traditional ML research is a serial, human-bottlenecked process. You hypothesize, write code, train a model, check the results, and repeat. But you also eat lunch, attend meetings, and sleep. Eight experiments on a good day.

What if you gave that loop to an AI agent and went to bed?

Attribution
Released March 2026 by Andrej Karpathy, building on his nanochat project. MIT license. 3.3k stars and growing.

Human Researcher — 24 Hours

6am
8am
#1–2
10am
#3
12pm
lunch
1pm
#4–5
3pm
mtg
4pm
#6–8
6pm
done
10pm
sleep
~8 experiments / day

Autoresearch Agent — 24 Hours

6am
#1–12
8am
#13–24
10am
#25–36
12pm
#37–48
2pm
#49–60
4pm
#61–72
8pm
#73–96
12am
#97–120
6am
#121+
~100+ experiments / night
Chapter II

Three Files. That's It.

The autoresearch repo is intentionally tiny. No distributed training, no complex configs, no external dependencies beyond PyTorch. The entire system lives in three files, each with a clearly defined role and owner.

🔒

prepare.py

Nobody edits

Fixed infrastructure. Downloads data, trains a BPE tokenizer, provides dataloader and evaluation harness. Set up once by the human, then locked.

🤖

train.py

Agent edits

The playground. Full GPT model, Muon + AdamW optimizer, training loop. The AI agent modifies architecture, hyperparameters, batch size — everything is fair game.

✍️

program.md

You write this

The "research director." A markdown document that instructs the agent what to try. This is where the researcher's judgment and strategy lives.

prepare.py Fixed infrastructure 🔒 data train.py Agent's Sandbox 🤖 strategy program.md Your Strategy ✍️
Chapter III

Modify, Train,
Evaluate, Repeat

Each iteration of the autoresearch loop follows a fixed protocol. The agent reads its instructions, makes a change, trains for exactly 5 minutes, and checks if the model got better. Then it loops.

1

Read Strategy

Agent reads program.md for context: what's been tried, what to explore next, constraints, and success criteria.

program.md
2

Edit Code

Agent modifies train.py — changes model architecture, tweaks learning rate, adjusts batch size, or tries a new optimizer schedule.

train.py
3

Train (5 min)

Training runs for a fixed 5-minute wall-clock budget, excluding startup and compilation. Every experiment gets the same time, regardless of architecture changes.

wall clock
4

Evaluate

Agent checks val_bpbValidation bits per byte — measures how efficiently the model compresses unseen text. Lower is better. against the previous best. Lower means the model compresses text more efficiently.

val_bpb
5

Keep or Discard?

If val_bpb improved → keep the changes. If not → revert and try something different. Either way, loop back to step 2.

✓ keep ✗ revert
Key insight
The 5-minute fixed budget isn't a limitation — it's a design choice. It makes every experiment directly comparable regardless of what the agent changed (model size, batch size, architecture). It also means autoresearch finds the most optimal model for your specific hardware.
~12 Experiments / hour
~100 Experiments / overnight
Chapter IV

Why val_bpb?

val_bpb stands for validation bits per byte. It measures how efficiently the model compresses unseen text. Lower is better — fewer bits to represent each byte of real text.

Why this metric specifically? Because it's vocabulary-size-independent. If the agent changes the tokenizer architecture or vocabulary, experiments are still directly comparable. Without this property, an agent that expands vocab would appear to "improve" by gaming the metric.

Think of it like measuring fuel efficiency in miles-per-gallon regardless of whether you changed the engine, the tires, or the fuel type.

0.8 bpb — Excellent 1.4 bpb — Decent 2.0 bpb — Poor
1.30
Good compression
The model needs 10.4 bits to predict each byte of unseen text. Like compressing a book to 16% of its original size.
Experiment Results (val_bpb) Run #1 1.52 bpb Run #2 1.38 bpb Run #3 1.41 bpb Run #4 1.21 bpb Run #5 1.14 bpb ✓ best
Chapter V

Programming the
Research Director

Here's the paradigm shift: you're no longer writing Python. You're writing a markdown document that tells an AI what kind of research to do. program.md is essentially a system prompt for a research agent.

It contains context about the codebase, what's been tried, what to explore next, constraints, and success criteria. Karpathy's default version is intentionally barebones — the obvious next step is iterating on the research instructions themselves. Meta-research: optimizing the document that guides the optimizer.

The implication is profound: the researcher's job shifts from "writing code" to "writing strategy documents for AI agents." Research becomes management.

# Autoresearch Program

You are an autonomous ML researcher.
Your goal: minimize val_bpb on the
validation set.

## Constraints
- Only modify train.py
- Training budget: 5 minutes
- Keep changes incremental

## Strategy
Try architectural improvements,
optimizer tuning, and regularization.
Keep what works, revert what doesn't.
## Autoresearch Program v2

You are an autonomous ML researcher
working on a GPT language model.

## Context
- Base model: 124M params, 12 layers
- Current best val_bpb: 1.21
- Hardware: 1x H100 80GB

## Priority Queue
1. Try Mixture-of-Experts (2 experts)
2. Experiment with RoPE embeddings
3. Test gradient accumulation 2x→4x
4. Explore KV-cache optimizations

## Anti-patterns (don't repeat)
- Expanding vocab beyond 32K (OOM)
- Flash attention v1 (use v2)
- Batch size > 64 on this hardware

## Success criteria
Any improvement to val_bpb is worth
keeping. Log your reasoning.
🧑‍💻
Human
writes program.md
🤖
AI Agent
reads program.md, edits train.py
🔹
GPU
runs experiments

You manage the agent. The agent manages the code. The code manages the GPU.

Chapter VI

What Actually Happens
Overnight

The agent makes measurable progress. Karpathy described waking up to a log of autonomous experiments, each one building on the last: "This is what post-AGI feels like... I didn't touch anything."

What kinds of changes does the agent make? Architecture tweaks (attention head count, layer depth), optimizer hyperparameters, batch size adjustments, learning rate schedules. Some experiments improve. Many don't. The agent discards failures and builds on successes — exactly like a human researcher, but at 10x the pace.

Overnight Progress — val_bpb 1.55 1.40 1.25 1.10 0.95 Larger model ✓ Changed attention ✗ Lower LR ✓ 0 20 40 60 80 100 Experiment #
Pattern
The sawtooth pattern is characteristic: spikes upward when the agent tries something that doesn't work (immediately reverted), and gradual steps downward when improvements stick. Over 100 experiments, the trend is clearly downward — the model gets better.
Chapter VII

The Bigger Picture

Autoresearch is intentionally minimal — 3 files, 1 GPU, MIT license. But it demonstrates a principle that scales. The trajectory is clear: single agent on 1 GPU → multiple specialized agents → agent swarms across clusters → fully autonomous research organizations.

The human role shifts permanently: from doing research to designing the research process to eventually auditing research that happened without you.

This fits Karpathy's broader arc. Each project strips another layer of complexity, making the next frontier accessible to individuals with a single GPU.

? micrograd nanoGPT nanochat autoresearch 2020 autograd from scratch 2023 training from scratch 2025 $100 ChatGPT 2026 autonomous research increasingly autonomous →

The uncomfortable question: if an agent can run 100 experiments overnight on a single GPU, what happens when it gets 1,000 GPUs?

Mission Complete

Try It Yourself

The code is, as Karpathy writes, "the story of how it all began."

3
Files
1
GPU
MIT
License
terminal
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/karpathy/autoresearch
cd autoresearch && uv sync
uv run prepare.py && uv run train.py

Requirements: Python 3.10+, uv package manager, NVIDIA GPU (or Apple Silicon via fork)
github.com/karpathy/autoresearch · nanochat · macOS fork

Made with scrolly.to by Jerry SoerReport