March 2026 — Dispatch

AI Does Science
While You Sleep

Andrej Karpathy shipped a 3-file repo that lets an AI agent run ~100 ML experiments overnight on a single GPU. No human in the loop. This is how it works.

3 Files in repo

~100 Experiments / night

5 min Per experiment

1 GPU required

Chapter I

"Meat Computers" vs.
Agent Swarms

Karpathy opens the autoresearch README with a sci-fi vision: a future where autonomous swarms of AI agents conduct all frontier research, iterating through thousands of generations of experiments. Human researchers — the "meat computers" — are long gone from the loop.

That prologue isn't just flavor text. It frames the gap autoresearch is designed to close: traditional ML research is a serial, human-bottlenecked process. You hypothesize, write code, train a model, check the results, and repeat. But you also eat lunch, attend meetings, and sleep. Eight experiments on a good day.

What if you gave that loop to an AI agent and went to bed?

Attribution

Released March 2026 by Andrej Karpathy, building on his nanochat project. MIT license. 3.3k stars and growing.

Human Researcher — 24 Hours

6am

8am

#1–2

10am

#3

12pm

lunch

1pm

#4–5

3pm

mtg

4pm

#6–8

6pm

done

10pm

sleep

~8 experiments / day

Autoresearch Agent — 24 Hours

6am

#1–12

8am

#13–24

10am

#25–36

12pm

#37–48

2pm

#49–60

4pm

#61–72

8pm

#73–96

12am

#97–120

6am

#121+

~100+ experiments / night

Chapter II

Three Files. That's It.

The autoresearch repo is intentionally tiny. No distributed training, no complex configs, no external dependencies beyond PyTorch. The entire system lives in three files, each with a clearly defined role and owner.

🔒

prepare.py

Nobody edits

Fixed infrastructure. Downloads data, trains a BPE tokenizer, provides dataloader and evaluation harness. Set up once by the human, then locked.

🤖

train.py

Agent edits

The playground. Full GPT model, Muon + AdamW optimizer, training loop. The AI agent modifies architecture, hyperparameters, batch size — everything is fair game.

✍️

program.md

You write this

The "research director." A markdown document that instructs the agent what to try. This is where the researcher's judgment and strategy lives.

Constraining the agent to only modify train.py is a deliberate design choice. It prevents scope creep — the agent can't rewrite the evaluation harness to game the metrics, and it can't change the data pipeline to make training easier. Every experiment is evaluated against the same fixed infrastructure, making results directly comparable.

This mirrors how human researchers work with standardized benchmarks: you change the model, not the test.

Chapter III

Modify, Train,
Evaluate, Repeat

Each iteration of the autoresearch loop follows a fixed protocol. The agent reads its instructions, makes a change, trains for exactly 5 minutes, and checks if the model got better. Then it loops.

1

Read Strategy

Agent reads program.md for context: what's been tried, what to explore next, constraints, and success criteria.

program.md

2

Edit Code

Agent modifies train.py — changes model architecture, tweaks learning rate, adjusts batch size, or tries a new optimizer schedule.

train.py

3

Train (5 min)

Training runs for a fixed 5-minute wall-clock budget, excluding startup and compilation. Every experiment gets the same time, regardless of architecture changes.

wall clock

4

Evaluate

Agent checks val_bpbValidation bits per byte — measures how efficiently the model compresses unseen text. Lower is better. against the previous best. Lower means the model compresses text more efficiently.

val_bpb

5

Keep or Discard?

If val_bpb improved → keep the changes. If not → revert and try something different. Either way, loop back to step 2.

✓ keep ✗ revert

Key insight

The 5-minute fixed budget isn't a limitation — it's a design choice. It makes every experiment directly comparable regardless of what the agent changed (model size, batch size, architecture). It also means autoresearch finds the most optimal model for your specific hardware.

~12 Experiments / hour

~100 Experiments / overnight

Chapter IV

Why val_bpb?

val_bpb stands for validation bits per byte. It measures how efficiently the model compresses unseen text. Lower is better — fewer bits to represent each byte of real text.

Why this metric specifically? Because it's vocabulary-size-independent. If the agent changes the tokenizer architecture or vocabulary, experiments are still directly comparable. Without this property, an agent that expands vocab would appear to "improve" by gaming the metric.

Think of it like measuring fuel efficiency in miles-per-gallon regardless of whether you changed the engine, the tires, or the fuel type.

0.8 bpb — Excellent 1.4 bpb — Decent 2.0 bpb — Poor

1.30

Good compression

The model needs 10.4 bits to predict each byte of unseen text. Like compressing a book to 16% of its original size.

Chapter V

Programming the
Research Director

Here's the paradigm shift: you're no longer writing Python. You're writing a markdown document that tells an AI what kind of research to do. program.md is essentially a system prompt for a research agent.

It contains context about the codebase, what's been tried, what to explore next, constraints, and success criteria. Karpathy's default version is intentionally barebones — the obvious next step is iterating on the research instructions themselves. Meta-research: optimizing the document that guides the optimizer.

The implication is profound: the researcher's job shifts from "writing code" to "writing strategy documents for AI agents." Research becomes management.

# Autoresearch Program

You are an autonomous ML researcher.
Your goal: minimize val_bpb on the
validation set.

## Constraints
- Only modify train.py
- Training budget: 5 minutes
- Keep changes incremental

## Strategy
Try architectural improvements,
optimizer tuning, and regularization.
Keep what works, revert what doesn't.

## Autoresearch Program v2

You are an autonomous ML researcher
working on a GPT language model.

## Context
- Base model: 124M params, 12 layers
- Current best val_bpb: 1.21
- Hardware: 1x H100 80GB

## Priority Queue
1. Try Mixture-of-Experts (2 experts)
2. Experiment with RoPE embeddings
3. Test gradient accumulation 2x→4x
4. Explore KV-cache optimizations

## Anti-patterns (don't repeat)
- Expanding vocab beyond 32K (OOM)
- Flash attention v1 (use v2)
- Batch size > 64 on this hardware

## Success criteria
Any improvement to val_bpb is worth
keeping. Log your reasoning.

🧑‍💻

Human

writes program.md

🤖

AI Agent

reads program.md, edits train.py

🔹

GPU

runs experiments

You manage the agent. The agent manages the code. The code manages the GPU.

Chapter VI

What Actually Happens
Overnight

The agent makes measurable progress. Karpathy described waking up to a log of autonomous experiments, each one building on the last: "This is what post-AGI feels like... I didn't touch anything."

What kinds of changes does the agent make? Architecture tweaks (attention head count, layer depth), optimizer hyperparameters, batch size adjustments, learning rate schedules. Some experiments improve. Many don't. The agent discards failures and builds on successes — exactly like a human researcher, but at 10x the pace.

Pattern

The sawtooth pattern is characteristic: spikes upward when the agent tries something that doesn't work (immediately reverted), and gradual steps downward when improvements stick. Over 100 experiments, the trend is clearly downward — the model gets better.

Chapter VII

The Bigger Picture

Autoresearch is intentionally minimal — 3 files, 1 GPU, MIT license. But it demonstrates a principle that scales. The trajectory is clear: single agent on 1 GPU → multiple specialized agents → agent swarms across clusters → fully autonomous research organizations.

The human role shifts permanently: from doing research to designing the research process to eventually auditing research that happened without you.

This fits Karpathy's broader arc. Each project strips another layer of complexity, making the next frontier accessible to individuals with a single GPU.

The uncomfortable question: if an agent can run 100 experiments overnight on a single GPU, what happens when it gets 1,000 GPUs?

Today's autoresearch runs a single agent on a single GPU, iterating sequentially. But nothing in the architecture prevents parallelization. Imagine multiple agents, each exploring a different branch of the architecture search space simultaneously, sharing results through a central leaderboard.

This is already happening informally: the 406 forks of the repo are essentially a distributed, human-coordinated research swarm. The next step is to automate that coordination layer too.

Mission Complete

Try It Yourself

The code is, as Karpathy writes, "the story of how it all began."

3

Files

1

GPU

MIT

License

terminal

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/karpathy/autoresearch
cd autoresearch && uv sync
uv run prepare.py && uv run train.py

terminal

curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/miolini/autoresearch-macos
cd autoresearch-macos && uv sync
uv run prepare.py && uv run train.py

Requirements: Python 3.10+, uv package manager, NVIDIA GPU (or Apple Silicon via fork)
github.com/karpathy/autoresearch · nanochat · macOS fork

AI Does ScienceWhile You Sleep

"Meat Computers" vs.Agent Swarms

Human Researcher — 24 Hours

Autoresearch Agent — 24 Hours

Three Files. That's It.

prepare.py

train.py

program.md

Modify, Train,Evaluate, Repeat

Read Strategy

Edit Code

Train (5 min)

Evaluate

Keep or Discard?

Why val_bpb?

Programming theResearch Director

What Actually HappensOvernight

The Bigger Picture

Try It Yourself

AI Does Science
While You Sleep

"Meat Computers" vs.
Agent Swarms

Modify, Train,
Evaluate, Repeat

Programming the
Research Director

What Actually Happens
Overnight