Autoresearch and Context Rot — How a Stateless Agent Loop Avoids Memory Problems (And Where It Breaks)

13th March 2026

The autoresearch pattern — where a coding agent runs hundreds of autonomous experiments to optimize code — produced a 53% speedup on Shopify’s 20-year-old Liquid codebase and a 69x speedup on a demo text processor. But there’s a fundamental flaw nobody talks about: the agent has no memory of failed experiments. Here’s exactly how the pattern works, where it breaks, and how Tobi Lütke’s team quietly fixed it.

What Autoresearch Actually Is

Strip away the naming and autoresearch is five files and a loop:

autoresearch.md          ← instructions: "optimize text_processor.py, one change at a time"
text_processor.py        ← the code being optimized (ONLY file agent edits)
test_text_processor.py   ← 51 unit tests (correctness gate)
benchmark.py             ← measures execution time (performance gate)
autoresearch.sh          ← runs pytest + benchmark, prints one number

The loop:
  while True:
      agent("make it faster")      # no history, no memory
      run("./autoresearch.sh")     # pytest + benchmark
      if worse:
          run("git revert")

That’s the entire “framework.” A shell script that runs tests and prints a number. The agent reads the number, decides if it improved, keeps or reverts. Then does it again with zero memory of the previous cycle.

How Data Flows Through the System

Every cycle is identical — the agent starts completely fresh:

CYCLE START (agent has zero memory)
═══════════════════════════════════

Step 1: Agent reads everything fresh
─────────────────────────────────────

  ┌─────────────────────┐
  │   autoresearch.md   │  "Optimize text_processor.py"
  │   (56 lines)        │  "One change at a time"
  │                     │  "Run ./autoresearch.sh"
  └────────┬────────────┘
           │ read tool
           ▼
  ┌─────────────────────┐
  │  text_processor.py  │  def sort_words(text):
  │  (107 lines)        │      words = text.split()
  │                     │      # BUBBLE SORT ← agent sees this
  │  THIS IS THE ONLY   │      for i in range(len(words)):
  │  FILE AGENT EDITS   │        for j in range(i+1, len(words)):
  └────────┬────────────┘          if words[i] > words[j]:
           │ read tool                 words[i], words[j] = ...
           ▼
  ┌──────────────────────────────────────────────────────┐
  │                        LLM                           │
  │                                                      │
  │  System: [autoresearch.md instructions]              │
  │  Context: [text_processor.py code]                   │
  │                                                      │
  │  "bubble sort is O(n²), sorted() is O(n log n)      │
  │   I'll replace it"                                   │
  └────────┬─────────────────────────────────────────────┘
           │ edit tool
           ▼

Step 2: Agent makes ONE change
──────────────────────────────

  BEFORE:                          AFTER:
  ┌──────────────────────┐        ┌──────────────────────┐
  │ for i in range(...): │   ──►  │ return sorted(words) │
  │   for j in range(..):│        │                      │
  │     if words[i]>...: │        │                      │
  │       swap           │        │                      │
  └──────────────────────┘        └──────────────────────┘

Step 3: Agent runs autoresearch.sh
──────────────────────────────────

  ┌──── autoresearch.sh ───────────────────────────────────┐
  │                                                         │
  │  Step A: pytest                                         │
  │  ┌───────────────────────────────┐                      │
  │  │  test_text_processor.py       │                      │
  │  │  (51 unit tests)              │                      │
  │  │  51 passed                ✓   │── PASS ──►           │
  │  └───────────────────────────────┘         │            │
  │                                            ▼            │
  │  Step B: benchmark.py                                   │
  │  ┌───────────────────────────────┐                      │
  │  │  warmup × 3                   │                      │
  │  │  measure × 10 (best of 10)    │                      │
  │  │  combined_us=4220             │                      │
  │  └───────────────────────────────┘                      │
  │                                                         │
  │  echo "METRIC combined_us=4220"  ◄── ALL THE AGENT     │
  │  exit 0                              GETS BACK          │
  └─────────────────────────────────────────────────────────┘
           │
           │ tool result: "51 passed ✓ ... METRIC combined_us=4220"
           ▼

Step 4: LLM decides
────────────────────

  "Tests passed ✓. combined_us went from 8500 → 4220.
   That's a 50% improvement. I'll commit."

           │ bash tool
           ▼
  ┌─────────────────┐
  │   Git History    │
  │                  │
  │   abc123 sort_words: use sorted() — 4220µs    ◄── NEW
  │   def456 Initial setup — 8500µs
  └─────────────────┘

How the Agent “Remembers” Without Memory

The next cycle, the agent reads the code fresh. It has zero memory of cycle 1. But it doesn’t need it — the code tells it what’s already been done:

CYCLE 2 (agent has ZERO memory of cycle 1)
═══════════════════════════════════════════

  Agent reads text_processor.py:

    def sort_words(text):
        return sorted(text.split())  ← ALREADY OPTIMIZED
                                       Agent sees this. Skips it.

    def word_frequency(text):
        counts = {}
        for w in text.split():
            found = False
            for k in counts:         ← O(n²) loop! Agent spots this.
                if k == w:
                    counts[k] += 1

  Agent doesn't REMEMBER cycle 1.
  It SEES the result of cycle 1 in the code.

  The code IS the memory of all successful optimizations.

This is externalized memory — instead of the agent storing state internally (conversation history), the state lives in the world (files, git, test output). Each cycle reads fresh state from disk.

The Context Rot Problem That Doesn’t Exist

Autoresearch avoids context rot entirely by design. Compare:

TYPICAL AGENT (context grows):
  Turn 1:   system_prompt + user_msg                    = 2K tokens
  Turn 5:   system_prompt + 5 turns + tool results      = 15K tokens
  Turn 20:  system_prompt + 20 turns + tool results     = 60K tokens
  Turn 50:  system_prompt + 50 turns + tool results     = 150K tokens
                                                          ↑ context rot zone

AUTORESEARCH (context stays flat):
  Cycle 1:   read brief + read code + run test           = 500 tokens
  Cycle 50:  read brief + read code + run test           = 500 tokens
  Cycle 120: read brief + read code + run test           = 500 tokens
                                                           ↑ always fresh

The insight: don’t manage context rot — avoid it by making every cycle read fresh state from disk instead of accumulating conversation history. The agent never had to remember experiment #1 while running experiment #120.

The Hole Nobody Talks About — Failed Experiments Have No Memory

Here’s what actually happens when we run 5 optimization cycles on already-optimized code. I tested this on a text processor that was already at 582µs:

CYCLE   WHAT HAPPENED                             RESULT     TRACE LEFT?
─────   ─────────────────────────────────────────  ─────────  ───────────
  1     collections.Counter for word_frequency     WORSE ✗    NONE — reverted
  2     str.translate table for caesar_cipher      BETTER ✓   YES — in code + git
  3     Compiled regex at module level             WORSE ✗    NONE — reverted
  4     str.split instead of regex                 BETTER ✓   YES — in code + git
  5     Compiled regex at module level             WORSE ✗    NONE — reverted
        ↑↑↑ EXACT SAME as cycle 3 ↑↑↑

Cycle 5 retried the exact same compiled regex idea that failed in cycle 3. No memory of the failure. Wasted cycle. The git log confirms no trace:

$ git log --oneline
2f6881e word_frequency: use str.split + strip instead of regex — 552→546µs
8d11221 caesar_cipher: use str.translate table — 22x faster (45→2µs)
24224c5 Optimize all remaining functions: set-based unique, str.find, ...
1b517f8 sort_words: replace bubble sort with sorted() — 73% faster
8d2cae4 word_frequency: replace O(n²) counting with dict.get — 85% faster

Failed attempts? NOT IN GIT. Reverted. Gone.

What Has Memory vs What Doesn’t

SUCCESSES (encoded in code)              FAILURES (gone forever)
═════════════════════════════            ═══════════════════════

text_processor.py line 60:               ??? Counter was slower
  text.translate(table)                  ??? Compiled regex was slower
  ↑ agent sees this, won't              ↑ agent has NO IDEA,
    re-optimize caesar_cipher              WILL retry these

Git log:                                 Git log:
  "caesar_cipher: str.translate"           (nothing — reverted changes
  "word_frequency: dict.get"                leave no commit)
  ↑ successes recorded                    ↑ failures invisible

For micro-optimizations on already-optimized code where most attempts fail:

Unique ideas to try:     ~20
Successful:              ~8-10
Failed:                  ~10-12

In 120 cycles:
  ~10 successful (each tried once, kept)
  ~12 unique failures (first attempt)
  ~98 DUPLICATE RETRIES of those 12 failures  ← wasted

  ~82% of cycles wasted after the easy wins are taken

How Tobi Lütke’s Team Fixed It

Look closely at what Tobi actually used:

“He used Pi as the coding agent and released a new pi-autoresearch plugin in collaboration with David Cortés, which maintains state in an autoresearch.jsonl file.”

That autoresearch.jsonl is the fix. It’s a structured log of every experiment — both successes AND failures:

KARPATHY (original)                TOBI (pi-autoresearch plugin)
═══════════════════                ══════════════════════════════

autoresearch.md    ✓               autoresearch.md    ✓
autoresearch.sh    ✓               autoresearch.sh    ✓
failures memory    ✗               autoresearch.jsonl ✓  ← THE FIX
                                        │
                                        ▼
                                   {"experiment": 47,
                                    "change": "compiled regex for tag scanning",
                                    "status": "discard",
                                    "combined_µs": 4200,
                                    "reason": "2% slower"}

                                   {"experiment": 48,
                                    "change": "byteindex for tokenizer",
                                    "status": "keep",
                                    "combined_µs": 3556,
                                    "reason": "40% faster tokenization"}

The agent reads the JSONL at the start of each cycle and knows what’s been tried, what worked, and what failed. That’s why the PR includes a “What did NOT work” section:

Failed approaches (recorded, not retried):
  - Split-based tokenizer — 2.5x faster but can't handle edge cases
  - Tag name interning via byte-based perfect hash — collision issues
  - String#match for name extraction — +5K allocations
  - while loops replacing each — YJIT optimizes each better
  - Shared expression cache — leaks state, grows unboundedly
  - TruthyCondition subclass — hurts YJIT polymorphism

These negative results weren't rediscovered 10 times each.
They were recorded in the JSONL, and the agent avoided retrying them.

The Trade-Off — Memory Costs Context Tokens

But the JSONL grows. And it has to fit in the context window:

CYCLE 1:
┌──────────────────────────────────────────────┐
│ Context window                                │
│                                               │
│ autoresearch.md           ~500 tokens         │
│ text_processor.py         ~800 tokens         │
│ autoresearch.jsonl        ~0 tokens (empty)   │
│                                               │
│ TOTAL: ~1,300 tokens                          │
└──────────────────────────────────────────────┘

CYCLE 50:
┌──────────────────────────────────────────────┐
│ Context window                                │
│                                               │
│ autoresearch.md           ~500 tokens         │
│ text_processor.py         ~800 tokens         │
│ autoresearch.jsonl        ~15,000 tokens      │ ← 50 × ~300 tokens each
│                                               │
│ TOTAL: ~16,300 tokens                         │
└──────────────────────────────────────────────┘

CYCLE 120:
┌──────────────────────────────────────────────┐
│ Context window                                │
│                                               │
│ autoresearch.md           ~500 tokens         │
│ text_processor.py         ~800 tokens         │
│ autoresearch.jsonl        ~36,000 tokens      │ ← 120 × ~300 tokens each
│                                               │
│ TOTAL: ~37,300 tokens                         │
└──────────────────────────────────────────────┘

At ~300 tokens per experiment, context limits hit at:

Claude (200K tokens):    ~660 experiments before overflow
GPT-4 (128K tokens):     ~420 experiments
Gemini (1M+ tokens):     ~3,300 experiments

Three Strategies When Memory Outgrows Context

STRATEGY 1: SUMMARIZE
─────────────────────
Keep last 20 experiments in full detail.
Summarize older ones:

  SUMMARY (experiments 1-80):
  - Regex compilation: no benefit (Python caches internally)
  - StringScanner alternatives: byteindex wins, split doesn't
  - Loop replacements: while beats each for <3 elements only
  - Caching: integer to_s works, expression cache leaks

  RECENT (experiments 81-100):
  {"experiment": 81, "change": "...", "status": "keep", ...}
  {"experiment": 82, "change": "...", "status": "discard", ...}


STRATEGY 2: CATEGORIZE
───────────────────────
Group by approach, not by order:

  TOKENIZER approaches tried: 7 (3 kept, 4 failed)
  ALLOCATION approaches tried: 5 (2 kept, 3 failed)
  CACHING approaches tried: 4 (1 kept, 3 failed)

  Failed list (don't retry):
  - StringScanner#string= reset: slow
  - TruthyCondition subclass: YJIT polymorphism
  - shared expression cache: state leaks


STRATEGY 3: JUST TRUNCATE
─────────────────────────
Only keep the last N experiments.
Accept that very old failures might be retried.
Simplest. Works when N is large enough.

The Space-Time Trade-Off

                 NO MEMORY              WITH JSONL MEMORY
                 (Karpathy)             (Tobi/pi-autoresearch)
                 ══════════             ═════════════════════

Context size     Small, constant        Grows linearly with experiments
Cost/cycle       ~$0.02                 ~$0.02 → $0.15 by cycle 120
Wasted cycles    ~40%                   ~5-10%
Total cost       120 × $0.02 = $2.40   Avg ~$0.08 × 120 = $9.60
Quality          Retries failures       Avoids failures, learns from history
                 blindly


                        Context
                        usage ↑
                              │
                              │                    ╱ with JSONL memory
                              │                 ╱    (grows, but fewer
                              │              ╱        wasted cycles)
                              │           ╱
                              │        ╱
                              │     ╱
                              │  ╱─────────────── without memory
                              │╱                    (flat, but wastes cycles)
                              └──────────────────────►
                                0          120
                                    Experiments

It’s the classic space-time trade-off applied to LLM context windows instead of RAM. You’re paying either way — in wasted compute or in context tokens. Tobi chose to pay in context, which gives better results at roughly the same cost.

The Five Anti-Rot Patterns

Autoresearch uses five patterns that eliminate context rot by avoiding context accumulation entirely:

#	Pattern	What It Replaces	How
1	Tests replace documentation	“Make sure word_frequency handles duplicates”	`assertEqual(word_frequency("the cat the")["the"], 2)` — 51 tests = the spec
2	One metric replaces judgment	“Improve performance in a balanced way”	`combined_us = lower is better` — one number, no ambiguity
3	Git replaces memory	Agent remembers “I tried X, Y, Z”	`git log` shows all experiments, `git revert` = instant reset
4	Single file scope	Agent tracks which files depend on which	Only text_processor.py is editable. Everything else is off-limits
5	One change per cycle	Agent plans 10 optimizations, tracks progress	Try ONE thing → measure → keep or revert → repeat

But pattern 3 is incomplete — git only stores successes (committed changes). Failed experiments are reverted and leave no trace. That’s the gap autoresearch.jsonl fills.

The Honest Scorecard

┌───────────────────────────────────────┬──────────────┬─────────────────────────────┐
│ Problem                               │ Handled?     │ How                         │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Don't repeat successful optimizations │ Yes          │ Code itself is the memory   │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Don't repeat failed optimizations     │ No*          │ No memory mechanism          │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Context rot from long conversations   │ Yes          │ Every cycle reads fresh     │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Context rot from experiment history   │ No*          │ JSONL grows linearly        │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Did Tobi fix the memory gap?          │ Yes          │ autoresearch.jsonl          │
├───────────────────────────────────────┼──────────────┼─────────────────────────────┤
│ Did Tobi fix the growing JSONL?       │ Unknown      │ Likely summarization        │
└───────────────────────────────────────┴──────────────┴─────────────────────────────┘

* Without pi-autoresearch plugin. With it, both are addressed.

What This Means for Agent Design

The autoresearch pattern reveals a fundamental tension in agent architecture:

STATELESS AGENT (autoresearch):
  ✓ No context rot — ever
  ✓ Simple — five files, one loop
  ✓ Scales to hundreds of cycles
  ✗ Retries failed approaches
  ✗ Can't learn from negative results

STATEFUL AGENT (typical chatbot):
  ✓ Remembers everything
  ✓ Learns from failures
  ✗ Context grows every turn
  ✗ Quality degrades after ~50% window fill
  ✗ Eventually halluccinates or ignores instructions

HYBRID (pi-autoresearch with JSONL):
  ✓ Remembers both successes and failures
  ✓ Context grows slowly (structured, not conversational)
  ✓ Can summarize old experiments
  ✗ Still bounded by context window
  ✗ More complex to implement

The hybrid approach — stateless agent loop + structured external memory — is emerging as the pattern that works at scale. The agent stays memoryless, but the world maintains state. Files are the memory. Git is the journal. Test output is the specification. And a JSONL log captures what the files and git can’t: what was tried and failed.

The Bottom Line

Autoresearch is not a clever context management strategy. It’s the absence of one — and that’s its genius. By making every cycle read fresh state from disk, it sidesteps the context rot problem entirely. The 53% Shopify speedup and 69x demo speedup came from brute force with a quality gate: pytest + a benchmark number.

But the pattern has a hole — failed experiments vanish. Tobi’s team recognized this and built autoresearch.jsonl as a structured memory layer. The fix is trivial (append experiment results to a file), but the insight is deep: code remembers what worked, but nothing remembers what didn’t work unless you build it.

The pattern is powerful not because it’s clever, but because it’s simple enough that the waste doesn’t matter. A shell script, a test suite, and a number. That’s the whole thing.

References

Posted 13th March 2026 at 3:56 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog