Autoresearch and Context Rot — How a Stateless Agent Loop Avoids Memory Problems (And Where It Breaks)
13th March 2026
The autoresearch pattern — where a coding agent runs hundreds of autonomous experiments to optimize code — produced a 53% speedup on Shopify’s 20-year-old Liquid codebase and a 69x speedup on a demo text processor. But there’s a fundamental flaw nobody talks about: the agent has no memory of failed experiments. Here’s exactly how the pattern works, where it breaks, and how Tobi Lütke’s team quietly fixed it.
What Autoresearch Actually Is
Strip away the naming and autoresearch is five files and a loop:
autoresearch.md ← instructions: "optimize text_processor.py, one change at a time"
text_processor.py ← the code being optimized (ONLY file agent edits)
test_text_processor.py ← 51 unit tests (correctness gate)
benchmark.py ← measures execution time (performance gate)
autoresearch.sh ← runs pytest + benchmark, prints one number
The loop:
while True:
agent("make it faster") # no history, no memory
run("./autoresearch.sh") # pytest + benchmark
if worse:
run("git revert")
That’s the entire “framework.” A shell script that runs tests and prints a number. The agent reads the number, decides if it improved, keeps or reverts. Then does it again with zero memory of the previous cycle.
How Data Flows Through the System
Every cycle is identical — the agent starts completely fresh:
CYCLE START (agent has zero memory)
═══════════════════════════════════
Step 1: Agent reads everything fresh
─────────────────────────────────────
┌─────────────────────┐
│ autoresearch.md │ "Optimize text_processor.py"
│ (56 lines) │ "One change at a time"
│ │ "Run ./autoresearch.sh"
└────────┬────────────┘
│ read tool
▼
┌─────────────────────┐
│ text_processor.py │ def sort_words(text):
│ (107 lines) │ words = text.split()
│ │ # BUBBLE SORT ← agent sees this
│ THIS IS THE ONLY │ for i in range(len(words)):
│ FILE AGENT EDITS │ for j in range(i+1, len(words)):
└────────┬────────────┘ if words[i] > words[j]:
│ read tool words[i], words[j] = ...
▼
┌──────────────────────────────────────────────────────┐
│ LLM │
│ │
│ System: [autoresearch.md instructions] │
│ Context: [text_processor.py code] │
│ │
│ "bubble sort is O(n²), sorted() is O(n log n) │
│ I'll replace it" │
└────────┬─────────────────────────────────────────────┘
│ edit tool
▼
Step 2: Agent makes ONE change
──────────────────────────────
BEFORE: AFTER:
┌──────────────────────┐ ┌──────────────────────┐
│ for i in range(...): │ ──► │ return sorted(words) │
│ for j in range(..):│ │ │
│ if words[i]>...: │ │ │
│ swap │ │ │
└──────────────────────┘ └──────────────────────┘
Step 3: Agent runs autoresearch.sh
──────────────────────────────────
┌──── autoresearch.sh ───────────────────────────────────┐
│ │
│ Step A: pytest │
│ ┌───────────────────────────────┐ │
│ │ test_text_processor.py │ │
│ │ (51 unit tests) │ │
│ │ 51 passed ✓ │── PASS ──► │
│ └───────────────────────────────┘ │ │
│ ▼ │
│ Step B: benchmark.py │
│ ┌───────────────────────────────┐ │
│ │ warmup × 3 │ │
│ │ measure × 10 (best of 10) │ │
│ │ combined_us=4220 │ │
│ └───────────────────────────────┘ │
│ │
│ echo "METRIC combined_us=4220" ◄── ALL THE AGENT │
│ exit 0 GETS BACK │
└─────────────────────────────────────────────────────────┘
│
│ tool result: "51 passed ✓ ... METRIC combined_us=4220"
▼
Step 4: LLM decides
────────────────────
"Tests passed ✓. combined_us went from 8500 → 4220.
That's a 50% improvement. I'll commit."
│ bash tool
▼
┌─────────────────┐
│ Git History │
│ │
│ abc123 sort_words: use sorted() — 4220µs ◄── NEW
│ def456 Initial setup — 8500µs
└─────────────────┘
How the Agent “Remembers” Without Memory
The next cycle, the agent reads the code fresh. It has zero memory of cycle 1. But it doesn’t need it — the code tells it what’s already been done:
CYCLE 2 (agent has ZERO memory of cycle 1)
═══════════════════════════════════════════
Agent reads text_processor.py:
def sort_words(text):
return sorted(text.split()) ← ALREADY OPTIMIZED
Agent sees this. Skips it.
def word_frequency(text):
counts = {}
for w in text.split():
found = False
for k in counts: ← O(n²) loop! Agent spots this.
if k == w:
counts[k] += 1
Agent doesn't REMEMBER cycle 1.
It SEES the result of cycle 1 in the code.
The code IS the memory of all successful optimizations.
This is externalized memory — instead of the agent storing state internally (conversation history), the state lives in the world (files, git, test output). Each cycle reads fresh state from disk.
The Context Rot Problem That Doesn’t Exist
Autoresearch avoids context rot entirely by design. Compare:
TYPICAL AGENT (context grows):
Turn 1: system_prompt + user_msg = 2K tokens
Turn 5: system_prompt + 5 turns + tool results = 15K tokens
Turn 20: system_prompt + 20 turns + tool results = 60K tokens
Turn 50: system_prompt + 50 turns + tool results = 150K tokens
↑ context rot zone
AUTORESEARCH (context stays flat):
Cycle 1: read brief + read code + run test = 500 tokens
Cycle 50: read brief + read code + run test = 500 tokens
Cycle 120: read brief + read code + run test = 500 tokens
↑ always fresh
The insight: don’t manage context rot — avoid it by making every cycle read fresh state from disk instead of accumulating conversation history. The agent never had to remember experiment #1 while running experiment #120.
The Hole Nobody Talks About — Failed Experiments Have No Memory
Here’s what actually happens when we run 5 optimization cycles on already-optimized code. I tested this on a text processor that was already at 582µs:
CYCLE WHAT HAPPENED RESULT TRACE LEFT?
───── ───────────────────────────────────────── ───────── ───────────
1 collections.Counter for word_frequency WORSE ✗ NONE — reverted
2 str.translate table for caesar_cipher BETTER ✓ YES — in code + git
3 Compiled regex at module level WORSE ✗ NONE — reverted
4 str.split instead of regex BETTER ✓ YES — in code + git
5 Compiled regex at module level WORSE ✗ NONE — reverted
↑↑↑ EXACT SAME as cycle 3 ↑↑↑
Cycle 5 retried the exact same compiled regex idea that failed in cycle 3. No memory of the failure. Wasted cycle. The git log confirms no trace:
$ git log --oneline
2f6881e word_frequency: use str.split + strip instead of regex — 552→546µs
8d11221 caesar_cipher: use str.translate table — 22x faster (45→2µs)
24224c5 Optimize all remaining functions: set-based unique, str.find, ...
1b517f8 sort_words: replace bubble sort with sorted() — 73% faster
8d2cae4 word_frequency: replace O(n²) counting with dict.get — 85% faster
Failed attempts? NOT IN GIT. Reverted. Gone.
What Has Memory vs What Doesn’t
SUCCESSES (encoded in code) FAILURES (gone forever)
═════════════════════════════ ═══════════════════════
text_processor.py line 60: ??? Counter was slower
text.translate(table) ??? Compiled regex was slower
↑ agent sees this, won't ↑ agent has NO IDEA,
re-optimize caesar_cipher WILL retry these
Git log: Git log:
"caesar_cipher: str.translate" (nothing — reverted changes
"word_frequency: dict.get" leave no commit)
↑ successes recorded ↑ failures invisible
For micro-optimizations on already-optimized code where most attempts fail:
Unique ideas to try: ~20
Successful: ~8-10
Failed: ~10-12
In 120 cycles:
~10 successful (each tried once, kept)
~12 unique failures (first attempt)
~98 DUPLICATE RETRIES of those 12 failures ← wasted
~82% of cycles wasted after the easy wins are taken
How Tobi Lütke’s Team Fixed It
Look closely at what Tobi actually used:
“He used Pi as the coding agent and released a new pi-autoresearch plugin in collaboration with David Cortés, which maintains state in an autoresearch.jsonl file.”
That autoresearch.jsonl is the fix. It’s a structured log of every experiment — both successes AND failures:
KARPATHY (original) TOBI (pi-autoresearch plugin)
═══════════════════ ══════════════════════════════
autoresearch.md ✓ autoresearch.md ✓
autoresearch.sh ✓ autoresearch.sh ✓
failures memory ✗ autoresearch.jsonl ✓ ← THE FIX
│
▼
{"experiment": 47,
"change": "compiled regex for tag scanning",
"status": "discard",
"combined_µs": 4200,
"reason": "2% slower"}
{"experiment": 48,
"change": "byteindex for tokenizer",
"status": "keep",
"combined_µs": 3556,
"reason": "40% faster tokenization"}
The agent reads the JSONL at the start of each cycle and knows what’s been tried, what worked, and what failed. That’s why the PR includes a “What did NOT work” section:
Failed approaches (recorded, not retried):
- Split-based tokenizer — 2.5x faster but can't handle edge cases
- Tag name interning via byte-based perfect hash — collision issues
- String#match for name extraction — +5K allocations
- while loops replacing each — YJIT optimizes each better
- Shared expression cache — leaks state, grows unboundedly
- TruthyCondition subclass — hurts YJIT polymorphism
These negative results weren't rediscovered 10 times each.
They were recorded in the JSONL, and the agent avoided retrying them.
The Trade-Off — Memory Costs Context Tokens
But the JSONL grows. And it has to fit in the context window:
CYCLE 1:
┌──────────────────────────────────────────────┐
│ Context window │
│ │
│ autoresearch.md ~500 tokens │
│ text_processor.py ~800 tokens │
│ autoresearch.jsonl ~0 tokens (empty) │
│ │
│ TOTAL: ~1,300 tokens │
└──────────────────────────────────────────────┘
CYCLE 50:
┌──────────────────────────────────────────────┐
│ Context window │
│ │
│ autoresearch.md ~500 tokens │
│ text_processor.py ~800 tokens │
│ autoresearch.jsonl ~15,000 tokens │ ← 50 × ~300 tokens each
│ │
│ TOTAL: ~16,300 tokens │
└──────────────────────────────────────────────┘
CYCLE 120:
┌──────────────────────────────────────────────┐
│ Context window │
│ │
│ autoresearch.md ~500 tokens │
│ text_processor.py ~800 tokens │
│ autoresearch.jsonl ~36,000 tokens │ ← 120 × ~300 tokens each
│ │
│ TOTAL: ~37,300 tokens │
└──────────────────────────────────────────────┘
At ~300 tokens per experiment, context limits hit at:
Claude (200K tokens): ~660 experiments before overflow
GPT-4 (128K tokens): ~420 experiments
Gemini (1M+ tokens): ~3,300 experiments
Three Strategies When Memory Outgrows Context
STRATEGY 1: SUMMARIZE
─────────────────────
Keep last 20 experiments in full detail.
Summarize older ones:
SUMMARY (experiments 1-80):
- Regex compilation: no benefit (Python caches internally)
- StringScanner alternatives: byteindex wins, split doesn't
- Loop replacements: while beats each for <3 elements only
- Caching: integer to_s works, expression cache leaks
RECENT (experiments 81-100):
{"experiment": 81, "change": "...", "status": "keep", ...}
{"experiment": 82, "change": "...", "status": "discard", ...}
STRATEGY 2: CATEGORIZE
───────────────────────
Group by approach, not by order:
TOKENIZER approaches tried: 7 (3 kept, 4 failed)
ALLOCATION approaches tried: 5 (2 kept, 3 failed)
CACHING approaches tried: 4 (1 kept, 3 failed)
Failed list (don't retry):
- StringScanner#string= reset: slow
- TruthyCondition subclass: YJIT polymorphism
- shared expression cache: state leaks
STRATEGY 3: JUST TRUNCATE
─────────────────────────
Only keep the last N experiments.
Accept that very old failures might be retried.
Simplest. Works when N is large enough.
The Space-Time Trade-Off
NO MEMORY WITH JSONL MEMORY
(Karpathy) (Tobi/pi-autoresearch)
══════════ ═════════════════════
Context size Small, constant Grows linearly with experiments
Cost/cycle ~$0.02 ~$0.02 → $0.15 by cycle 120
Wasted cycles ~40% ~5-10%
Total cost 120 × $0.02 = $2.40 Avg ~$0.08 × 120 = $9.60
Quality Retries failures Avoids failures, learns from history
blindly
Context
usage ↑
│
│ ╱ with JSONL memory
│ ╱ (grows, but fewer
│ ╱ wasted cycles)
│ ╱
│ ╱
│ ╱
│ ╱─────────────── without memory
│╱ (flat, but wastes cycles)
└──────────────────────►
0 120
Experiments
It’s the classic space-time trade-off applied to LLM context windows instead of RAM. You’re paying either way — in wasted compute or in context tokens. Tobi chose to pay in context, which gives better results at roughly the same cost.
The Five Anti-Rot Patterns
Autoresearch uses five patterns that eliminate context rot by avoiding context accumulation entirely:
| # | Pattern | What It Replaces | How |
|---|---|---|---|
| 1 | Tests replace documentation | “Make sure word_frequency handles duplicates” | assertEqual(word_frequency("the cat the")["the"], 2) — 51 tests = the spec |
| 2 | One metric replaces judgment | “Improve performance in a balanced way” | combined_us = lower is better — one number, no ambiguity |
| 3 | Git replaces memory | Agent remembers “I tried X, Y, Z” | git log shows all experiments, git revert = instant reset |
| 4 | Single file scope | Agent tracks which files depend on which | Only text_processor.py is editable. Everything else is off-limits |
| 5 | One change per cycle | Agent plans 10 optimizations, tracks progress | Try ONE thing → measure → keep or revert → repeat |
But pattern 3 is incomplete — git only stores successes (committed changes). Failed experiments are reverted and leave no trace. That’s the gap autoresearch.jsonl fills.
The Honest Scorecard
┌───────────────────────────────────────┬──────────────┬─────────────────────────────┐
│ Problem │ Handled? │ How │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Don't repeat successful optimizations │ Yes │ Code itself is the memory │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Don't repeat failed optimizations │ No* │ No memory mechanism │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Context rot from long conversations │ Yes │ Every cycle reads fresh │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Context rot from experiment history │ No* │ JSONL grows linearly │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Did Tobi fix the memory gap? │ Yes │ autoresearch.jsonl │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Did Tobi fix the growing JSONL? │ Unknown │ Likely summarization │
└───────────────────────────────────────┴──────────────┴─────────────────────────────┘
* Without pi-autoresearch plugin. With it, both are addressed.
What This Means for Agent Design
The autoresearch pattern reveals a fundamental tension in agent architecture:
STATELESS AGENT (autoresearch):
✓ No context rot — ever
✓ Simple — five files, one loop
✓ Scales to hundreds of cycles
✗ Retries failed approaches
✗ Can't learn from negative results
STATEFUL AGENT (typical chatbot):
✓ Remembers everything
✓ Learns from failures
✗ Context grows every turn
✗ Quality degrades after ~50% window fill
✗ Eventually halluccinates or ignores instructions
HYBRID (pi-autoresearch with JSONL):
✓ Remembers both successes and failures
✓ Context grows slowly (structured, not conversational)
✓ Can summarize old experiments
✗ Still bounded by context window
✗ More complex to implement
The hybrid approach — stateless agent loop + structured external memory — is emerging as the pattern that works at scale. The agent stays memoryless, but the world maintains state. Files are the memory. Git is the journal. Test output is the specification. And a JSONL log captures what the files and git can’t: what was tried and failed.
The Bottom Line
Autoresearch is not a clever context management strategy. It’s the absence of one — and that’s its genius. By making every cycle read fresh state from disk, it sidesteps the context rot problem entirely. The 53% Shopify speedup and 69x demo speedup came from brute force with a quality gate: pytest + a benchmark number.
But the pattern has a hole — failed experiments vanish. Tobi’s team recognized this and built autoresearch.jsonl as a structured memory layer. The fix is trivial (append experiment results to a file), but the insight is deep: code remembers what worked, but nothing remembers what didn’t work unless you build it.
The pattern is powerful not because it’s clever, but because it’s simple enough that the waste doesn’t matter. A shell script, a test suite, and a number. That’s the whole thing.
References
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026