Karpathy Open-Sources Autoresearch: 100 AI Experiments Overnight on One GPU

March 10, 2026Provided by Utku Ege Tuluk

Andrej Karpathy, former director of AI at Tesla and co-founder of OpenAI, has open-sourced autoresearch — a compact 630-line Python framework that lets AI agents autonomously design, run, and evaluate machine learning experiments on a single GPU. Released on March 7, 2026, the tool has already garnered over 8,000 GitHub stars and sparked a broader conversation about the future of automated AI research.

Intermediate

Visualization of a single GPU running autonomous AI experiments with radiating data streams — Illustration generated by AI

How It Works

The core idea behind autoresearch is elegant: instead of manually tweaking model code and hyperparameters, you hand the reins to an AI coding agent. The agent modifies the training code, runs a 5-minute training session, evaluates the result using validation bits-per-byte (val_bpb) as its single optimization metric, and decides whether to keep or discard the changes — then repeats. With each experiment taking exactly 5 minutes regardless of hardware, a single overnight run can yield approximately 100 completed experiments.

The system is built around just three files:

prepare.py — One-time data preparation: downloads training data, trains a BPE tokenizer, and defines dataloaders and evaluation functions. This file is never modified by the agent.
train.py — The sole file the agent edits. It contains the complete GPT model definition, optimizer implementations (Muon + AdamW), and the training loop. Everything is fair game: architecture, hyperparameters, optimizer selection, and batch sizes.
program.md — A Markdown file that humans edit to give the agent its research objectives and constraints. As Karpathy puts it, “you are not touching any Python files like you normally would as a researcher. Instead, you are programming the program.md.”

Technical Requirements and Setup

Autoresearch is deliberately minimal. It requires Python 3.10+, the UV package manager, and a single NVIDIA GPU (tested on H100). Beyond PyTorch and a few small packages, there are no external dependencies — no distributed training, no complex configuration files. Setup takes about 2 minutes: install UV, sync dependencies, run prepare.py to download data and train the tokenizer, then launch experiments with uv run train.py.

The fixed 5-minute training budget is a key design choice. It makes experiments comparable across different hardware, and the vocab-size-independent bits-per-byte metric ensures fair comparisons even when the agent changes the tokenizer or architecture fundamentally.

Community Reception and Debate

The release has sparked vigorous discussion in the AI community. Supporters see autoresearch as a paradigm shift — one commenter on Hacker News adapted the pattern for “adversarial protocol hardening,” discovering edge cases that 359 hand-written tests had missed. Others see it as a practical demonstration of how AI can automate any task with an objective, verifiable metric.

Critics, however, raise important questions. Some argue the improvements mostly come from hyperparameter tweaking rather than genuinely novel research directions — a concern echoed by Karpathy himself, who noted that current models feel “very ‘cagy’ and ‘scared'” when tackling open-ended research problems. Others invoke Goodhart’s Law, warning that optimizing a single metric without deeper understanding risks brute-force discovery masquerading as research.

The community has also rallied to extend the project. An MLX port already enables autoresearch to run natively on Apple Silicon without PyTorch or CUDA, broadening access to Mac users.

What This Means

Autoresearch represents a compelling proof of concept for automated ML research. While it won’t replace human researchers anytime soon — the tool excels at optimization within a well-defined search space, not at formulating novel hypotheses — it demonstrates how AI agents can dramatically accelerate the experimental grind that consumes much of a researcher’s time. For students and independent researchers with limited compute, the single-GPU design is particularly valuable: a night’s sleep becomes a 100-experiment research sprint.

The broader implication is clear: as AI coding agents improve, the bottleneck in ML research may shift from running experiments to asking the right questions.

Related Coverage

Effortless Git Repository Visualization with RenderGit — an earlier look at Karpathy’s open-source tooling contributions

How It Works

Technical Requirements and Setup

Community Reception and Debate

What This Means

Related Coverage

Sources

New York University