How Firecracker MicroVMs Power AgentCore Runtime — From 125ms Boot to Auto-Scaling AI Agents
11th March 2026
When AWS needed to run Lambda functions — millions of them, simultaneously, for strangers on the internet — containers weren’t isolated enough and full VMs were too slow. So they built Firecracker: a microVM that boots in ~125 milliseconds with ~5 MB of memory overhead, gives you hardware-level isolation, and lets you pack thousands of them onto a single server. Now Amazon Bedrock AgentCore Runtime uses the same technology to run AI agent tools. Here’s exactly how it all works.
The Problem: Containers Are Fast but Leaky, VMs Are Safe but Slow
When you run untrusted code (like an AI agent’s tool execution), you need isolation. The two traditional options both have problems:
CONTAINERS (Docker, etc.):
✅ Fast startup (~1 second)
✅ Low overhead (~10 MB)
❌ Share host OS kernel
❌ Kernel vulnerabilities = escape to host
❌ Not safe for running strangers' code
FULL VMs (EC2, VMware):
✅ Own kernel, strong isolation
✅ Hardware-level security (KVM/VT-x)
❌ Slow startup (30-60 seconds)
❌ Heavy overhead (hundreds of MB)
❌ Can't spin up thousands per second
FIRECRACKER microVM:
✅ Own kernel — hardware-level isolation via KVM
✅ Boots in ~125 milliseconds
✅ ~5 MB memory overhead
✅ 5 new microVMs per CPU core per second
✅ 36-core server → 180 new microVMs per second
✅ Safe enough for AWS Lambda (billions of invocations)
Firecracker is the sweet spot — it’s a Virtual Machine Monitor (VMM) purpose-built by Amazon for multi-tenant serverless workloads. It runs on top of Linux KVM, giving you real hardware virtualization, but strips away everything unnecessary from a traditional VM.
Firecracker Architecture — One Process, Dedicated Threads
Each microVM is a single Firecracker process on the host. Inside that process:
Physical Server (Host)
┌──────────────────────────────────────────────────────────────┐
│ │
│ Firecracker Process 1 (microVM for Session A) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ API Server Thread ← REST API for configuration │ │
│ │ vCPU Thread 1 ← runs guest code on CPU core │ │
│ │ vCPU Thread 2 ← runs guest code on CPU core │ │
│ │ VirtIO Device Thread ← handles network + disk I/O │ │
│ │ │ │
│ │ KVM isolation + seccomp + cgroups + jailer │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Firecracker Process 2 (microVM for Session B) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ (completely separate process, own threads, own memory) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
│ Firecracker Process 3 (microVM for Session C) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ (completely separate process) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Kill the process = kill the microVM. Clean. Simple.
No zombie state. No orphaned resources.
The key design decision: one vCPU = one thread. A microVM with 2 vCPUs has 2 vCPU threads. Each thread is pinned to a physical CPU core via cgroups, which prevents cache thrashing from core migration.
4 Layers of Security Isolation
Firecracker doesn’t rely on a single security boundary. It uses defense-in-depth with four layers:
Layer 1: KVM Virtualization (Hardware)
└─ Intel VT-x / AMD-V hardware extensions
└─ Guest runs in its own virtual address space
└─ Guest CANNOT see host memory or other VMs
└─ This is the same isolation that runs EC2
Layer 2: Seccomp Filters (System Calls)
└─ Each Firecracker thread has its OWN seccomp profile
└─ API thread: allowed to do network I/O
└─ vCPU thread: allowed to do KVM operations
└─ Blocks ALL unnecessary syscalls
└─ Even if guest escapes KVM → seccomp blocks dangerous calls
Layer 3: Cgroups + Namespaces (Resources)
└─ cpuset cgroup: pins microVM to specific CPU cores
└─ cpu cgroup: limits CPU time quota
└─ memory cgroup: caps memory usage
└─ PID namespace: process isolation
└─ Network namespace: network isolation
Layer 4: Jailer Process (Privilege Dropping)
└─ Jailer starts with root privileges
└─ Sets up cgroups, namespaces, seccomp
└─ Creates chroot filesystem jail
└─ DROPS all privileges
└─ exec() into Firecracker (now unprivileged)
└─ Firecracker never runs as root
The result: one microVM cannot see another’s memory, access another’s files, exceed its CPU quota, make unauthorized system calls, or escape to the host OS.
CPU Management — Pinning and Quotas
Firecracker uses two complementary CPU isolation mechanisms:
MECHANISM 1: CPU Pinning (cpuset cgroup)
"This microVM can ONLY use CPU cores 4 and 5"
Physical CPU cores:
Core 0: [microVM-A] ← pinned, can't migrate
Core 1: [microVM-A] ← pinned
Core 2: [microVM-B] ← pinned
Core 3: [microVM-B] ← pinned
Core 4: [microVM-C] ← pinned
Core 5: (idle)
Why pin? Moving between CPU cores causes:
→ L1/L2 cache misses (cold cache on new core)
→ NUMA penalties (memory might be on wrong socket)
→ Performance drops of 10-30%
MECHANISM 2: CPU Quota (cpu cgroup)
"This microVM gets 50% of CPU time on its assigned cores"
Core 0 timeline:
██░░██░░██░░██░░██░░
██ = microVM-A runs (50%)
░░ = microVM-B runs (50%)
Fair sharing. No one microVM can hog the CPU.
This is how "pay only for active CPU" works.
Important limitation: vCPU count is set BEFORE boot and cannot be changed on a running microVM. Maximum is 32 vCPUs per microVM. To get more CPU power, you create a NEW microVM — this is why scaling is horizontal, not vertical.
Memory — Hotplugging, Oversubscription, and the Balloon
Unlike CPUs, memory CAN be added to a running microVM without any downtime. This is called memory hotplugging:
STEP 1: microVM boots with 2 GB
microVM memory map:
┌──────────────────────────────────┐
│ 0 GB ─────────────────── 2 GB │ ← usable memory
└──────────────────────────────────┘
STEP 2: Agent needs more (e.g., analyzing a large PDF)
Firecracker API call from HOST:
PUT /machine-config { "mem_size_mib": 6144 }
STEP 3: New memory appears INSTANTLY inside the VM
microVM memory map:
┌──────────────────────────────────┬─────────────────────────┐
│ 0 GB ─────────────────── 2 GB │ 2 GB ──────────── 6 GB │
│ (original) │ (hotplugged — NEW) │
└──────────────────────────────────┴─────────────────────────┘
Guest Linux kernel detects: "New memory appeared!"
Kernel adds it to the available memory pool.
Agent continues running. Zero downtime.
The host also uses memory oversubscription via demand-fault paging:
Host server: 256 GB physical RAM
Each microVM: configured with 8 GB
Naive math: 256 / 8 = 32 microVMs max
But most microVMs only USE 2 GB at any time.
Firecracker only allocates USED pages.
256 GB / 2 GB actual usage = 128 microVMs on one server!
Like a hotel with 100 rooms selling 200 reservations
because ~50% of guests are no-shows.
RISK: If ALL 128 microVMs suddenly use 8 GB each:
128 × 8 GB = 1,024 GB needed, only 256 GB available
→ Linux OOM killer terminates some VMs
→ Operator must set oversubscription ratio carefully
| Resource | Can Hotplug? | Downtime? | Max |
|---|---|---|---|
| CPU (vCPUs) | NO — set before boot only | N/A | 32 vCPUs |
| Memory (RAM) | YES — add while running | Zero | Host limit |
| Storage (disk) | YES — block device rescan | Zero | Host limit |
| Network (NICs) | NO — set before boot only | N/A | Configured at start |
I/O Rate Limiting — Token Bucket Algorithm
Each VirtIO device (network and disk) has configurable rate limiters to prevent one microVM from saturating shared resources:
Each rate limiter has TWO token buckets:
Bucket 1: Operations per second (IOPS)
Size: 1000 tokens (max burst)
Refill: 500 tokens/second (sustained rate)
Cost: 1 token per I/O operation
Bucket 2: Bandwidth (bytes/second)
Size: 100 MB (max burst)
Refill: 50 MB/second (sustained rate)
Cost: actual bytes transferred
How it works:
Agent makes API call → costs 1 IOPS token + N bandwidth tokens
Bucket has tokens? → request proceeds immediately
Bucket empty? → request BLOCKS until tokens refill
Example: Agent tries 5000 API calls/second
Bucket allows burst of 1000 → first 1000 go through
Then throttled to 500/second sustained
Other microVMs on the same host are protected
AgentCore Runtime — One Session, One MicroVM
Amazon Bedrock AgentCore Runtime uses Firecracker to run AI agent tools (Code Interpreter, Browser, custom tools) in isolated environments. The architecture is simple: one session = one microVM.
Agent sends tool call: "run this Python code"
│
▼
┌──────────────────────────────────────────────────────┐
│ AgentCore Runtime │
│ │
│ 1. Receives tool execution request │
│ 2. Checks: does session "user-42" have a microVM? │
│ │
│ NO → Boot new Firecracker microVM (~125ms) │
│ Install tool runtime (Python, browser, etc.) │
│ Execute the tool │
│ │
│ YES → Route to existing microVM │
│ Execute the tool │
│ State preserved (variables, files, cookies) │
│ │
│ Session idle → Terminate microVM │
│ Memory sanitized, filesystem destroyed │
│ Resources returned to pool │
└──────────────────────────────────────────────────────┘
How AgentCore Auto-Scales — Horizontal, Not Vertical
AgentCore doesn’t make existing microVMs bigger (except memory hotplugging). It spins up MORE microVMs:
10:00 AM — 5 users chatting with agents:
Server 1: [microVM-1] [microVM-2] [microVM-3]
Server 2: [microVM-4] [microVM-5]
10:01 AM — Marketing campaign goes viral, 500 users arrive:
Firecracker boots 495 new microVMs in ~3 seconds
(5 per core per second × 36 cores = 180/sec)
Server 1: [vm1] [vm2] [vm3] [vm4] [vm5] [vm6] [vm7] [vm8]
Server 2: [vm9] [vm10] [vm11] [vm12] [vm13] [vm14] [vm15] [vm16]
Server 3: [vm17] [vm18] [vm19] [vm20] ... ← NEW servers added
...
Server 50: [vm497] [vm498] [vm499] [vm500]
microVM-1 through 5: STILL RUNNING, untouched, zero downtime
microVM-6 through 500: NEW, booted in ~125ms each
2:00 PM — Traffic dies down, 3 users left:
Server 1: [microVM-1] [microVM-2] [microVM-3]
Servers 2-50: shut down, resources returned
You paid for 500 microVMs at 10:01 AM.
You paid for 3 microVMs at 2:00 PM.
No pre-provisioning. No capacity planning.
State Management Within Sessions
Within a session, the microVM preserves state across multiple tool executions:
Session "user-42" — microVM stays alive between calls:
Call 1: "import pandas; df = pd.read_csv('data.csv')"
→ Python variables persist in memory
→ Files written to microVM filesystem persist
Call 2: "df.describe()"
→ Same Python process, same variables
→ df is still loaded from Call 1
Call 3: "df.to_csv('results.csv')"
→ Writes to same filesystem
→ Agent can download results.csv
For Browser sessions:
→ Cookies persist across page loads
→ Local storage maintained
→ Navigation history available
→ Login sessions stay active
Between sessions? Complete isolation. When a session ends, the microVM is terminated, the writable filesystem layer is destroyed, and all in-memory state is cleared. No data leaks between users.
How Parallel Tool Execution Works Inside a MicroVM
When an agent calls 4 tools in parallel, they run as threads inside the same microVM:
microVM: user-42 (2 vCPUs)
┌────────────────────────────────────────────────────────┐
│ Python ThreadPoolExecutor (4 threads) │
│ │
│ Thread 1: get_weather("Tokyo") │
│ [CPU: 0.01s] [I/O wait: 2.0s] [CPU: 0.01s] │
│ │
│ Thread 2: get_weather("Paris") │
│ [CPU: 0.01s] [I/O wait: 2.0s] [CPU: 0.01s] │
│ │
│ Thread 3: get_population("Tokyo") │
│ [CPU: 0.01s] [I/O wait: 1.5s] [CPU: 0.01s] │
│ │
│ Thread 4: get_population("Paris") │
│ [CPU: 0.01s] [I/O wait: 1.5s] [CPU: 0.01s] │
│ │
│ Total CPU time billed: ~0.08s │
│ Total wall time: ~2.0s │
│ You pay for: ~0.08s of CPU │
│ I/O waiting: FREE (CPU serves other microVMs) │
└────────────────────────────────────────────────────────┘
The microVM doesn't get "bigger" for parallel tools.
Threads share the same 2 vCPUs. But since agent tools
are I/O-bound (waiting for APIs), the CPU barely works.
4 threads or 40 threads — same ~0.08s of actual CPU.
CPU Billing — Pay Only for Active Computation
This is how AgentCore achieves cost efficiency. The physical CPU core is time-sliced between microVMs:
Traditional server (EC2):
You rent 4 vCPUs for 1 hour = you pay for 4 CPU-hours
██░░░░░░░░░░██░░░░░░░░░░██░░░░░░░░░░
██ = actual work (5% of time)
░░ = idle, waiting for API responses (95% of time)
You pay for 100% of the time. Waste: 95%.
AgentCore microVM:
Physical CPU core serves MULTIPLE microVMs:
──────────────────────────────────────────
██ ██ ██ ← your agent (you pay)
▓▓▓▓▓▓▓▓▓▓▓▓ ▓▓▓▓▓▓▓▓▓▓▓▓ ← OTHER agents (they pay)
Your microVM is "paused" during I/O wait.
The CPU core runs someone else's workload.
When your I/O completes, you get CPU back.
You only pay for ██ time, not ░░ time.
Memory Hotplugging for Agents — Why It Matters
Agent workloads are uniquely spiky. A single conversation can go from trivial to memory-intensive in one message:
Agent starts: "What is 2+2?" → needs 128 MB
Agent mid-task: "Analyze this 50 MB PDF" → needs 4 GB suddenly
Agent later: "Summarize in one sentence" → needs 500 MB
WITHOUT memory hotplugging:
Option A: Start with 128 MB → crashes on PDF → bad UX
Option B: Start with 4 GB → wastes 3.8 GB for "2+2" → expensive
WITH memory hotplugging:
Start with 128 MB → cheap
PDF arrives → hotplug to 4 GB (instant, zero downtime)
You only pay for 4 GB during PDF analysis
Session ends → all memory freed at once
This is how AgentCore achieves "pay only for what you use"
— start small, grow on demand, never pre-allocate for peak.
What AgentCore Manages vs. What You Manage
| AgentCore Manages (You Don’t Touch) | You Manage (Your Responsibility) |
|---|---|
| Physical server fleet | User-to-session mapping logic |
| MicroVM placement and scheduling | Maximum sessions per user |
| CPU time-slicing between microVMs | Session lifecycle management |
| Memory hotplugging on demand | Tool definitions and configurations |
| Network isolation between sessions | Agent logic and prompts |
| Health checks and session termination | Error handling in your application |
| Scaling servers up/down based on demand | Cost monitoring and budgets |
| Security (KVM + seccomp + cgroups + jailer) | Input validation before tool calls |
The Complete Picture
┌─────────────────────────────────────────────────────────────────┐
│ AgentCore Runtime Stack │
│ │
│ YOUR APPLICATION │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Agent (LLM + prompts + tool definitions) │ │
│ │ "Analyze this CSV and plot the results" │ │
│ └──────────────────────┬────────────────────────────────────┘ │
│ │ tool call │
│ ▼ │
│ AGENTCORE RUNTIME │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Session Manager │ │
│ │ → Find or create microVM for this session │ │
│ │ → Route tool execution to correct microVM │ │
│ │ → Handle session lifecycle (create/extend/terminate) │ │
│ └──────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ FIRECRACKER LAYER │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │microVM-1│ │microVM-2│ │microVM-3│ │microVM-N│ │ │
│ │ │Session A│ │Session B│ │Session C│ │Session N│ │ │
│ │ │Code Intl│ │Browser │ │Custom │ │Code Intl│ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ │ Security: KVM + seccomp + cgroups + namespaces + jailer │ │
│ │ Resources: CPU pinning, memory hotplug, I/O rate limits │ │
│ │ Scaling: horizontal (new VMs), ~125ms boot, ~5MB overhead│ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ PHYSICAL INFRASTRUCTURE (managed by AWS) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Server fleet auto-scales based on demand │ │
│ │ 5 → 500 → 3 sessions: automatic, zero downtime │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The bottom line: Firecracker microVMs give you VM-level security with container-level speed. AgentCore Runtime builds on this to auto-scale AI agent tool execution — each session gets its own isolated environment that boots in 125 milliseconds, scales memory on demand without downtime, and costs you only for the CPU cycles your agent actually uses. No capacity planning, no idle resources, no security compromises.
References
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026