Akshay Parkhi's Weblog

Subscribe

How Firecracker MicroVMs Power AgentCore Runtime — From 125ms Boot to Auto-Scaling AI Agents

11th March 2026

When AWS needed to run Lambda functions — millions of them, simultaneously, for strangers on the internet — containers weren’t isolated enough and full VMs were too slow. So they built Firecracker: a microVM that boots in ~125 milliseconds with ~5 MB of memory overhead, gives you hardware-level isolation, and lets you pack thousands of them onto a single server. Now Amazon Bedrock AgentCore Runtime uses the same technology to run AI agent tools. Here’s exactly how it all works.

The Problem: Containers Are Fast but Leaky, VMs Are Safe but Slow

When you run untrusted code (like an AI agent’s tool execution), you need isolation. The two traditional options both have problems:

CONTAINERS (Docker, etc.):
  ✅ Fast startup (~1 second)
  ✅ Low overhead (~10 MB)
  ❌ Share host OS kernel
  ❌ Kernel vulnerabilities = escape to host
  ❌ Not safe for running strangers' code

FULL VMs (EC2, VMware):
  ✅ Own kernel, strong isolation
  ✅ Hardware-level security (KVM/VT-x)
  ❌ Slow startup (30-60 seconds)
  ❌ Heavy overhead (hundreds of MB)
  ❌ Can't spin up thousands per second

FIRECRACKER microVM:
  ✅ Own kernel — hardware-level isolation via KVM
  ✅ Boots in ~125 milliseconds
  ✅ ~5 MB memory overhead
  ✅ 5 new microVMs per CPU core per second
  ✅ 36-core server → 180 new microVMs per second
  ✅ Safe enough for AWS Lambda (billions of invocations)

Firecracker is the sweet spot — it’s a Virtual Machine Monitor (VMM) purpose-built by Amazon for multi-tenant serverless workloads. It runs on top of Linux KVM, giving you real hardware virtualization, but strips away everything unnecessary from a traditional VM.

Firecracker Architecture — One Process, Dedicated Threads

Each microVM is a single Firecracker process on the host. Inside that process:

Physical Server (Host)
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  Firecracker Process 1 (microVM for Session A)               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  API Server Thread    ← REST API for configuration     │  │
│  │  vCPU Thread 1        ← runs guest code on CPU core    │  │
│  │  vCPU Thread 2        ← runs guest code on CPU core    │  │
│  │  VirtIO Device Thread ← handles network + disk I/O     │  │
│  │                                                        │  │
│  │  KVM isolation + seccomp + cgroups + jailer             │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  Firecracker Process 2 (microVM for Session B)               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  (completely separate process, own threads, own memory) │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
│  Firecracker Process 3 (microVM for Session C)               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │  (completely separate process)                          │  │
│  └────────────────────────────────────────────────────────┘  │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Kill the process = kill the microVM. Clean. Simple.
No zombie state. No orphaned resources.

The key design decision: one vCPU = one thread. A microVM with 2 vCPUs has 2 vCPU threads. Each thread is pinned to a physical CPU core via cgroups, which prevents cache thrashing from core migration.

4 Layers of Security Isolation

Firecracker doesn’t rely on a single security boundary. It uses defense-in-depth with four layers:

Layer 1: KVM Virtualization (Hardware)
  └─ Intel VT-x / AMD-V hardware extensions
  └─ Guest runs in its own virtual address space
  └─ Guest CANNOT see host memory or other VMs
  └─ This is the same isolation that runs EC2

Layer 2: Seccomp Filters (System Calls)
  └─ Each Firecracker thread has its OWN seccomp profile
  └─ API thread: allowed to do network I/O
  └─ vCPU thread: allowed to do KVM operations
  └─ Blocks ALL unnecessary syscalls
  └─ Even if guest escapes KVM → seccomp blocks dangerous calls

Layer 3: Cgroups + Namespaces (Resources)
  └─ cpuset cgroup: pins microVM to specific CPU cores
  └─ cpu cgroup: limits CPU time quota
  └─ memory cgroup: caps memory usage
  └─ PID namespace: process isolation
  └─ Network namespace: network isolation

Layer 4: Jailer Process (Privilege Dropping)
  └─ Jailer starts with root privileges
  └─ Sets up cgroups, namespaces, seccomp
  └─ Creates chroot filesystem jail
  └─ DROPS all privileges
  └─ exec() into Firecracker (now unprivileged)
  └─ Firecracker never runs as root

The result: one microVM cannot see another’s memory, access another’s files, exceed its CPU quota, make unauthorized system calls, or escape to the host OS.

CPU Management — Pinning and Quotas

Firecracker uses two complementary CPU isolation mechanisms:

MECHANISM 1: CPU Pinning (cpuset cgroup)
  "This microVM can ONLY use CPU cores 4 and 5"

  Physical CPU cores:
  Core 0: [microVM-A]      ← pinned, can't migrate
  Core 1: [microVM-A]      ← pinned
  Core 2: [microVM-B]      ← pinned
  Core 3: [microVM-B]      ← pinned
  Core 4: [microVM-C]      ← pinned
  Core 5: (idle)

  Why pin? Moving between CPU cores causes:
    → L1/L2 cache misses (cold cache on new core)
    → NUMA penalties (memory might be on wrong socket)
    → Performance drops of 10-30%

MECHANISM 2: CPU Quota (cpu cgroup)
  "This microVM gets 50% of CPU time on its assigned cores"

  Core 0 timeline:
  ██░░██░░██░░██░░██░░
  ██ = microVM-A runs (50%)
  ░░ = microVM-B runs (50%)

  Fair sharing. No one microVM can hog the CPU.
  This is how "pay only for active CPU" works.

Important limitation: vCPU count is set BEFORE boot and cannot be changed on a running microVM. Maximum is 32 vCPUs per microVM. To get more CPU power, you create a NEW microVM — this is why scaling is horizontal, not vertical.

Memory — Hotplugging, Oversubscription, and the Balloon

Unlike CPUs, memory CAN be added to a running microVM without any downtime. This is called memory hotplugging:

STEP 1: microVM boots with 2 GB

  microVM memory map:
  ┌──────────────────────────────────┐
  │ 0 GB ─────────────────── 2 GB   │ ← usable memory
  └──────────────────────────────────┘

STEP 2: Agent needs more (e.g., analyzing a large PDF)

  Firecracker API call from HOST:
  PUT /machine-config { "mem_size_mib": 6144 }

STEP 3: New memory appears INSTANTLY inside the VM

  microVM memory map:
  ┌──────────────────────────────────┬─────────────────────────┐
  │ 0 GB ─────────────────── 2 GB   │ 2 GB ──────────── 6 GB  │
  │ (original)                       │ (hotplugged — NEW)      │
  └──────────────────────────────────┴─────────────────────────┘

  Guest Linux kernel detects: "New memory appeared!"
  Kernel adds it to the available memory pool.
  Agent continues running. Zero downtime.

The host also uses memory oversubscription via demand-fault paging:

Host server: 256 GB physical RAM
Each microVM: configured with 8 GB
Naive math: 256 / 8 = 32 microVMs max

But most microVMs only USE 2 GB at any time.
Firecracker only allocates USED pages.

256 GB / 2 GB actual usage = 128 microVMs on one server!

Like a hotel with 100 rooms selling 200 reservations
because ~50% of guests are no-shows.

RISK: If ALL 128 microVMs suddenly use 8 GB each:
  128 × 8 GB = 1,024 GB needed, only 256 GB available
  → Linux OOM killer terminates some VMs
  → Operator must set oversubscription ratio carefully
ResourceCan Hotplug?Downtime?Max
CPU (vCPUs)NO — set before boot onlyN/A32 vCPUs
Memory (RAM)YES — add while runningZeroHost limit
Storage (disk)YES — block device rescanZeroHost limit
Network (NICs)NO — set before boot onlyN/AConfigured at start

I/O Rate Limiting — Token Bucket Algorithm

Each VirtIO device (network and disk) has configurable rate limiters to prevent one microVM from saturating shared resources:

Each rate limiter has TWO token buckets:

  Bucket 1: Operations per second (IOPS)
    Size: 1000 tokens (max burst)
    Refill: 500 tokens/second (sustained rate)
    Cost: 1 token per I/O operation

  Bucket 2: Bandwidth (bytes/second)
    Size: 100 MB (max burst)
    Refill: 50 MB/second (sustained rate)
    Cost: actual bytes transferred

How it works:
  Agent makes API call → costs 1 IOPS token + N bandwidth tokens
  Bucket has tokens? → request proceeds immediately
  Bucket empty? → request BLOCKS until tokens refill

Example: Agent tries 5000 API calls/second
  Bucket allows burst of 1000 → first 1000 go through
  Then throttled to 500/second sustained
  Other microVMs on the same host are protected

AgentCore Runtime — One Session, One MicroVM

Amazon Bedrock AgentCore Runtime uses Firecracker to run AI agent tools (Code Interpreter, Browser, custom tools) in isolated environments. The architecture is simple: one session = one microVM.

Agent sends tool call: "run this Python code"
                │
                ▼
┌──────────────────────────────────────────────────────┐
│  AgentCore Runtime                                    │
│                                                      │
│  1. Receives tool execution request                  │
│  2. Checks: does session "user-42" have a microVM?   │
│                                                      │
│  NO → Boot new Firecracker microVM (~125ms)          │
│       Install tool runtime (Python, browser, etc.)   │
│       Execute the tool                               │
│                                                      │
│  YES → Route to existing microVM                     │
│        Execute the tool                              │
│        State preserved (variables, files, cookies)   │
│                                                      │
│  Session idle → Terminate microVM                    │
│       Memory sanitized, filesystem destroyed         │
│       Resources returned to pool                     │
└──────────────────────────────────────────────────────┘

How AgentCore Auto-Scales — Horizontal, Not Vertical

AgentCore doesn’t make existing microVMs bigger (except memory hotplugging). It spins up MORE microVMs:

10:00 AM — 5 users chatting with agents:
  Server 1: [microVM-1] [microVM-2] [microVM-3]
  Server 2: [microVM-4] [microVM-5]

10:01 AM — Marketing campaign goes viral, 500 users arrive:
  Firecracker boots 495 new microVMs in ~3 seconds
  (5 per core per second × 36 cores = 180/sec)

  Server 1:  [vm1]  [vm2]  [vm3]  [vm4]  [vm5]  [vm6]  [vm7]  [vm8]
  Server 2:  [vm9]  [vm10] [vm11] [vm12] [vm13] [vm14] [vm15] [vm16]
  Server 3:  [vm17] [vm18] [vm19] [vm20] ... ← NEW servers added
  ...
  Server 50: [vm497] [vm498] [vm499] [vm500]

  microVM-1 through 5: STILL RUNNING, untouched, zero downtime
  microVM-6 through 500: NEW, booted in ~125ms each

2:00 PM — Traffic dies down, 3 users left:
  Server 1: [microVM-1] [microVM-2] [microVM-3]
  Servers 2-50: shut down, resources returned

  You paid for 500 microVMs at 10:01 AM.
  You paid for 3 microVMs at 2:00 PM.
  No pre-provisioning. No capacity planning.

State Management Within Sessions

Within a session, the microVM preserves state across multiple tool executions:

Session "user-42" — microVM stays alive between calls:

  Call 1: "import pandas; df = pd.read_csv('data.csv')"
    → Python variables persist in memory
    → Files written to microVM filesystem persist

  Call 2: "df.describe()"
    → Same Python process, same variables
    → df is still loaded from Call 1

  Call 3: "df.to_csv('results.csv')"
    → Writes to same filesystem
    → Agent can download results.csv

For Browser sessions:
  → Cookies persist across page loads
  → Local storage maintained
  → Navigation history available
  → Login sessions stay active

Between sessions? Complete isolation. When a session ends, the microVM is terminated, the writable filesystem layer is destroyed, and all in-memory state is cleared. No data leaks between users.

How Parallel Tool Execution Works Inside a MicroVM

When an agent calls 4 tools in parallel, they run as threads inside the same microVM:

microVM: user-42 (2 vCPUs)
┌────────────────────────────────────────────────────────┐
│  Python ThreadPoolExecutor (4 threads)                  │
│                                                        │
│  Thread 1: get_weather("Tokyo")                        │
│    [CPU: 0.01s] [I/O wait: 2.0s] [CPU: 0.01s]         │
│                                                        │
│  Thread 2: get_weather("Paris")                        │
│    [CPU: 0.01s] [I/O wait: 2.0s] [CPU: 0.01s]         │
│                                                        │
│  Thread 3: get_population("Tokyo")                     │
│    [CPU: 0.01s] [I/O wait: 1.5s] [CPU: 0.01s]         │
│                                                        │
│  Thread 4: get_population("Paris")                     │
│    [CPU: 0.01s] [I/O wait: 1.5s] [CPU: 0.01s]         │
│                                                        │
│  Total CPU time billed: ~0.08s                         │
│  Total wall time: ~2.0s                                │
│  You pay for: ~0.08s of CPU                            │
│  I/O waiting: FREE (CPU serves other microVMs)         │
└────────────────────────────────────────────────────────┘

The microVM doesn't get "bigger" for parallel tools.
Threads share the same 2 vCPUs. But since agent tools
are I/O-bound (waiting for APIs), the CPU barely works.
4 threads or 40 threads — same ~0.08s of actual CPU.

CPU Billing — Pay Only for Active Computation

This is how AgentCore achieves cost efficiency. The physical CPU core is time-sliced between microVMs:

Traditional server (EC2):
  You rent 4 vCPUs for 1 hour = you pay for 4 CPU-hours
  ██░░░░░░░░░░██░░░░░░░░░░██░░░░░░░░░░
  ██ = actual work (5% of time)
  ░░ = idle, waiting for API responses (95% of time)
  You pay for 100% of the time. Waste: 95%.

AgentCore microVM:
  Physical CPU core serves MULTIPLE microVMs:
  ──────────────────────────────────────────
  ██             ██             ██          ← your agent (you pay)
    ▓▓▓▓▓▓▓▓▓▓▓▓  ▓▓▓▓▓▓▓▓▓▓▓▓            ← OTHER agents (they pay)

  Your microVM is "paused" during I/O wait.
  The CPU core runs someone else's workload.
  When your I/O completes, you get CPU back.
  You only pay for ██ time, not ░░ time.

Memory Hotplugging for Agents — Why It Matters

Agent workloads are uniquely spiky. A single conversation can go from trivial to memory-intensive in one message:

Agent starts:     "What is 2+2?"              → needs 128 MB
Agent mid-task:   "Analyze this 50 MB PDF"     → needs 4 GB suddenly
Agent later:      "Summarize in one sentence"  → needs 500 MB

WITHOUT memory hotplugging:
  Option A: Start with 128 MB → crashes on PDF → bad UX
  Option B: Start with 4 GB  → wastes 3.8 GB for "2+2" → expensive

WITH memory hotplugging:
  Start with 128 MB          → cheap
  PDF arrives → hotplug to 4 GB (instant, zero downtime)
  You only pay for 4 GB during PDF analysis
  Session ends → all memory freed at once

This is how AgentCore achieves "pay only for what you use"
— start small, grow on demand, never pre-allocate for peak.

What AgentCore Manages vs. What You Manage

AgentCore Manages (You Don’t Touch)You Manage (Your Responsibility)
Physical server fleetUser-to-session mapping logic
MicroVM placement and schedulingMaximum sessions per user
CPU time-slicing between microVMsSession lifecycle management
Memory hotplugging on demandTool definitions and configurations
Network isolation between sessionsAgent logic and prompts
Health checks and session terminationError handling in your application
Scaling servers up/down based on demandCost monitoring and budgets
Security (KVM + seccomp + cgroups + jailer)Input validation before tool calls

The Complete Picture

┌─────────────────────────────────────────────────────────────────┐
│                    AgentCore Runtime Stack                       │
│                                                                 │
│  YOUR APPLICATION                                               │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Agent (LLM + prompts + tool definitions)                 │  │
│  │  "Analyze this CSV and plot the results"                  │  │
│  └──────────────────────┬────────────────────────────────────┘  │
│                         │ tool call                              │
│                         ▼                                       │
│  AGENTCORE RUNTIME                                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Session Manager                                          │  │
│  │  → Find or create microVM for this session                │  │
│  │  → Route tool execution to correct microVM                │  │
│  │  → Handle session lifecycle (create/extend/terminate)     │  │
│  └──────────────────────┬────────────────────────────────────┘  │
│                         │                                       │
│                         ▼                                       │
│  FIRECRACKER LAYER                                              │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐     │  │
│  │  │microVM-1│  │microVM-2│  │microVM-3│  │microVM-N│     │  │
│  │  │Session A│  │Session B│  │Session C│  │Session N│     │  │
│  │  │Code Intl│  │Browser  │  │Custom   │  │Code Intl│     │  │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘     │  │
│  │                                                           │  │
│  │  Security: KVM + seccomp + cgroups + namespaces + jailer  │  │
│  │  Resources: CPU pinning, memory hotplug, I/O rate limits  │  │
│  │  Scaling: horizontal (new VMs), ~125ms boot, ~5MB overhead│  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
│  PHYSICAL INFRASTRUCTURE (managed by AWS)                       │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Server fleet auto-scales based on demand                 │  │
│  │  5 → 500 → 3 sessions: automatic, zero downtime           │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

The bottom line: Firecracker microVMs give you VM-level security with container-level speed. AgentCore Runtime builds on this to auto-scale AI agent tool execution — each session gets its own isolated environment that boots in 125 milliseconds, scales memory on demand without downtime, and costs you only for the CPU cycles your agent actually uses. No capacity planning, no idle resources, no security compromises.

References

This is How Firecracker MicroVMs Power AgentCore Runtime — From 125ms Boot to Auto-Scaling AI Agents by Akshay Parkhi, posted on 11th March 2026.

Next: AgentCore Runtime vs Lambda — Scaling, Warm Pools, and Why Fixed 8 GB Boxes Exist

Previous: How Everything Connects — NVIDIA's Cosmos Pipeline from Simulation to Real-World Robots