I Ran 100 Parallel Tool Calls on AgentCore — The microVM Didn’t Break, But the LLM Did

12th March 2026

What happens when you fire 100 tool calls in parallel inside a single AgentCore microVM? Does the microVM crash? Does it run out of memory? Does the thread pool explode? I deployed an agent with 100 tools to Amazon Bedrock AgentCore Runtime and ran a scaling test from 5 to 100 parallel tool calls. Here’s exactly what happened.

The Test Setup

I created a Strands agent with 100 identical lightweight tools — each one sleeps for 100ms and returns a sensor reading. The agent is deployed to AgentCore Runtime, which runs it inside a Firecracker microVM with 2 vCPU and 8 GB RAM.

from strands import Agent, tool
from bedrock_agentcore.runtime import BedrockAgentCoreApp

# Generate 100 tools programmatically
tools = []
for i in range(100):
    @tool(name=f"sensor_{i:03d}")
    def read_sensor(input_data: str) -> dict:
        """Read sensor data and return measurement."""
        time.sleep(0.1)  # Simulate 100ms I/O
        return {
            "sensor_id": tool_name,
            "value": random.uniform(20, 30),
            "thread": threading.current_thread().name,
            "timestamp": time.time()
        }
    tools.append(read_sensor)

agent = Agent(
    model=BedrockModel(model_id="anthropic.claude-sonnet-4-20250514"),
    tools=tools
)

app = BedrockAgentCoreApp()

@app.entrypoint
def handler(payload):
    result = agent(payload["prompt"])
    return {"response": str(result), "diagnostics": diagnostics}

The prompt tells the LLM to call ALL tools simultaneously. Strands’ ConcurrentToolExecutor (enabled by default) handles parallel execution via a thread pool.

The Scaling Test: 5 → 10 → 25 → 50 → 100 Tools

Each test invokes the agent with a prompt requesting N tools to be called in parallel. Here are the actual results from AgentCore Runtime:

Tools	Total Time	LLM Call #1 (decide)	LLM Call #2 (summarize)	Input Tokens	Output Tokens
5	7.6s	3.48s	4.07s	16,393	449
10	8.7s	3.48s	4.67s	17,076	693
25	15.7s	4.75s	9.85s	19,213	1,468
50	22.8s	5.37s	15.41s	22,407	2,338
100	40.3s	4.32s	31.66s	29,128	4,454

The microVM didn’t crash. No OOM. No throttling. Zero errors. But 100 tools took 40 seconds — 4x slower than running them sequentially (10s). That’s not what you’d expect from “parallel” execution.

Where Did 40 Seconds Go?

Timeline for 100-tool invocation (40s total):

0s        5s        10s       15s       20s       25s       30s       35s       40s
│─────────│─────────│─────────│─────────│─────────│─────────│─────────│─────────│

├─ LLM #1 ─┤
│ 5.2s     │
│ Read 100 tool schemas
│ Decide to call all 100
│ Output: 100 tool_use blocks
│          │
│          ├─ Tools ─┤
│          │ ~2s     │
│          │ 6 threads, 100 tools
│          │ 17 batches × 0.1s
│          │
│          │         ├───────────── LLM #2 ──────────────────────────────┤
│          │         │ 31 seconds                                        │
│          │         │ Read 100 tool results (16,971 tokens)             │
│          │         │ Generate summary (4,454 tokens)                   │
│          │         │ THIS is where all the time goes                   │
│          │         └───────────────────────────────────────────────────┘

The tool execution itself — all 100 tools — took about 2 seconds. The other 38 seconds was the LLM reading tool schemas and processing tool results.

Finding #1: Only 6 Threads, Not 100

The diagnostics showed unique_threads: 6. Despite requesting 100 parallel tools, the ConcurrentToolExecutor inside the microVM uses a capped thread pool. The CloudWatch logs confirmed sequential-looking execution:

22:11:59.649  Tool #37: sensor_036
22:11:59.886  Tool #38: sensor_037    ← 237ms gap
22:12:00.171  Tool #39: sensor_038    ← 285ms gap
22:12:00.468  Tool #40: sensor_039    ← 297ms gap

With 6 threads and 100 tools at 0.1s each: 100 ÷ 6 × 0.1s ≈ 1.7s. The actual start_spread was 1.604s — matching perfectly. The ~250ms gap includes the ConcurrentToolExecutor’s event-driven backpressure mechanism (await task_event.wait()), which adds overhead per tool dispatch.

Finding #2: The LLM Is the Bottleneck, Not the Infrastructure

Look at how LLM Call #2 scales with tool count:

  5 tools →  4.07s   (8,233 input tokens)
 10 tools →  4.67s   (8,677 input tokens)
 25 tools →  9.85s  (10,050 input tokens)
 50 tools → 15.41s  (12,329 input tokens)
100 tools → 31.66s  (16,971 input tokens)

Each tool result adds ~90 tokens. 100 tools = ~9,000 extra tokens. The LLM processes these linearly — there’s no way to parallelize token ingestion. This is the fundamental scaling wall: tool execution is parallelizable, but LLM processing of tool results is not.

Finding #3: CPU and Memory Barely Moved

From the CloudWatch billing metrics during the test:

CPU:    0.0137 vCPU-hours ≈ 49 vCPU-seconds
        → ~0.8 vCPU average during invocation
        → Barely using the allocated 2 vCPU (mostly I/O wait)

Memory: 0.0165 GB-hours ≈ 59 GB-seconds
        → ~1.0 GB average during invocation
        → Stable, no spike — well within the 8 GB allocation

Errors:     0
Throttles:  0

The microVM was mostly idle — waiting for the LLM API to respond. CPU spiked briefly during request serialization (building 100 tool_use blocks) and response parsing (deserializing 100 tool results), but those bursts were under 1 second each.

Finding #4: Python’s GIL Doesn’t Matter Here

I expected the GIL (Global Interpreter Lock) to be a problem with 100 threads. It wasn’t — because the work is I/O-bound, not CPU-bound:

Phase 1: Build 100 requests (CPU-bound, GIL contention)
  100 × json.dumps ≈ 50ms total
  GIL serializes this, but it's so fast it doesn't matter

Phase 2: Wait for 100 tool executions (I/O-bound, GIL released)
  All threads sleeping (time.sleep releases the GIL)
  No contention — this is what threads are good at

Phase 3: Parse 100 results (CPU-bound, GIL contention)
  100 × json.loads ≈ 30ms total
  Again serialized by GIL, again too fast to matter

With 2 vCPU, the second core is wasted for CPU-bound Python work (GIL only lets one thread run Python at a time). But since 99% of the time is spent in I/O wait (LLM API calls), this doesn’t matter in practice.

Finding #5: Thread Stack Memory Is Not the Killer (Yet)

Before running this test, I calculated that 100 threads with Python’s default 8 MB stack size would consume 800 MB of thread stacks alone. But the actual memory stayed at ~1 GB because:

The thread pool was capped at 6 threads, not 100
6 threads × 8 MB = 48 MB of thread stacks — manageable
Tools are queued and dispatched to the fixed pool, not given one thread each

If you bypassed the ConcurrentToolExecutor and spawned 100 raw threads, you’d hit the memory wall. The executor’s thread pool cap is a silent safety valve.

Finding #6: Network Was Trivial

Per LLM call data:
  Request:  ~2-20 KB (messages + tool_config)
  Response: ~1-10 KB (streamed tokens)

  100 concurrent tools:
    Outbound: 100 × 20 KB = 2 MB
    Inbound:  streaming over ~3 sec

    Bandwidth needed: ~3 Mbps
    Available in microVM: ~1-5 Gbps (virtio-net → host TAP → AWS VPC ENI)

Network utilization: <0.1%

Network is never the bottleneck for agent workloads. The payloads are tiny compared to available bandwidth.

The Three Walls of Parallel Tool Scaling

Based on this test, here’s where things actually break as you increase parallel tools:

Parallel Tools	Wall 1: Thread Pool	Wall 2: LLM Processing	Wall 3: API Rate Limits
5	Fine (6 threads)	Fast (4s)	No issue
10	Fine (6 threads)	Fast (5s)	No issue
25	Batched (5 batches)	Moderate (10s)	No issue
50	Batched (9 batches)	Slow (15s)	Possible
100	Batched (17 batches)	Very slow (32s)	Likely

Wall 1 (thread pool cap) is a design choice, not a bug. It prevents memory explosions from unbounded thread creation.

Wall 2 (LLM token processing) is the fundamental limit. Each tool result adds tokens the LLM must read sequentially. No infrastructure improvement can fix this — it’s inherent to how LLMs work.

Wall 3 (API rate limits) didn’t trigger in our test because the tools were local (sleep), not making LLM sub-calls. If each of the 100 tools called Bedrock’s invoke_model, you’d hit rate limits around 10-50 concurrent calls depending on your account tier.

When Parallel Tools Actually Help

Parallel execution wins when tool latency is high and tool count is moderate:

SCENARIO A: 5 tools, each takes 3 seconds (API calls, DB queries)
  Sequential: 5 × 3s = 15s
  Parallel:   max(3s) + LLM overhead = ~10s
  Speedup: 1.5x ✓

SCENARIO B: 100 tools, each takes 0.1 seconds (local computation)
  Sequential: 100 × 0.1s = 10s
  Parallel:   2s tools + 38s LLM overhead = 40s
  Speedup: 0.25x ✗ (4x SLOWER)

SCENARIO C: 10 tools, each takes 5 seconds (sub-agent LLM calls)
  Sequential: 10 × 5s = 50s
  Parallel:   max(5s) + LLM overhead = ~15s
  Speedup: 3.3x ✓✓

The sweet spot is 5-15 slow tools. More than that and LLM processing time dominates. Fewer than that and the overhead isn’t worth it.

Practical Recommendations for AgentCore

┌─────────────────────────────────────────────────────────────────┐
│  DO                                                             │
│                                                                 │
│  ✓ Use parallel tools for 5-15 slow operations (API calls,      │
│    database queries, sub-agent calls taking 1-5s each)          │
│  ✓ Keep tool schemas small — every token in the schema is       │
│    read by the LLM on every invocation                          │
│  ✓ Return minimal tool results — 50 tokens beats 500 tokens     │
│                                                                 │
│  DON'T                                                          │
│                                                                 │
│  ✗ Create 100 tools "just in case" — the LLM reads all schemas  │
│    even if it only calls 3                                      │
│  ✗ Use parallel execution for fast tools (<100ms) — the         │
│    overhead exceeds the benefit                                  │
│  ✗ Expect linear speedup — LLM processing is sequential         │
│                                                                 │
│  RESTRUCTURE INSTEAD                                            │
│                                                                 │
│  Instead of 100 tools → 1 tool that internally batches:         │
│                                                                 │
│  @tool                                                          │
│  def read_all_sensors(sensor_ids: list) -> dict:                │
│      results = ThreadPoolExecutor(10).map(read_sensor, ids)     │
│      return {"readings": list(results)}                         │
│                                                                 │
│  LLM sees 1 tool schema, gets 1 result back.                   │
│  Internal parallelism without LLM token overhead.               │
└─────────────────────────────────────────────────────────────────┘

Why the LLM Is the Bottleneck — Autoregressive Decoding Explained

The 31-second LLM Call #2 wasn’t a rate limit, a timeout, or a bug. It’s how transformer models fundamentally work. To understand why, you need to know what happens inside the LLM when it receives 100 tool results.

The Agent Loop That Forces Two LLM Calls

The Anthropic/Bedrock tool-use protocol requires this exact sequence:

STEP 1: Agent sends to LLM (LLM Call #1)
  Input:  system_prompt + 100 tool schemas + user message
  Tokens: ~7,700 input
  LLM decides: "I need to call all 100 sensors"
  LLM generates: 100 tool_use blocks (~258 output tokens)
  Time: ~5s

STEP 2: SDK executes 100 tools locally
  ConcurrentToolExecutor runs them (6 threads, 17 batches)
  Time: ~1.6s

STEP 3: Agent sends to LLM AGAIN (LLM Call #2)    ← BOTTLENECK
  Input:  system_prompt + 100 tool schemas + user message
          + 100 tool_use blocks (from step 1)
          + 100 toolResult blocks (from step 2)
  Tokens: ~16,971 input
  LLM generates: summary (~4,231 output tokens)
  Time: ~31s

You cannot skip Step 3. The API requires tool results to be sent back to the LLM. The LLM doesn’t know the tools succeeded until you tell it. And once you tell it, it generates a human-readable response.

Prefill vs Decode: Two Very Different Phases

When the LLM receives 16,971 input tokens plus needs to generate 4,231 output tokens, two distinct phases happen on the GPU:

PHASE 1: PREFILL (reading input — ~3 seconds)
┌──────────────────────────────────────────────────────────────┐
│  Read all 16,971 input tokens                                │
│  Process through ~80 transformer layers                      │
│  Each layer: every token attends to every other token        │
│  Computation: O(n²) where n = 16,971                         │
│  = ~288 MILLION attention computations PER LAYER             │
│  × 80 layers = ~23 BILLION computations                      │
│                                                              │
│  BUT: this runs in PARALLEL on the GPU                       │
│  All tokens processed simultaneously                         │
│  Result: ~3 seconds (fast, despite huge computation)         │
└──────────────────────────────────────────────────────────────┘

PHASE 2: DECODE (generating output — ~28 seconds)
┌──────────────────────────────────────────────────────────────┐
│  Generate tokens ONE AT A TIME, sequentially:                │
│                                                              │
│  Token 1 ("##"):                                             │
│    Attend to 16,971 input + 0 output = 16,971 tokens         │
│    Through 80 layers → output "##"                           │
│                                                              │
│  Token 2 (" SENSOR"):                                        │
│    Attend to 16,971 input + 1 output = 16,972 tokens         │
│    Through 80 layers → output " SENSOR"                      │
│                                                              │
│  Token 100 ("20.0"):                                         │
│    Attend to 16,971 + 99 = 17,070 tokens                     │
│    Must SCAN all 100 toolResult blocks to find minimum        │
│                                                              │
│  Token 4,231 ("."):                                          │
│    Attend to 16,971 + 4,230 = 21,201 tokens                  │
│    Through 80 layers → output "."                            │
│                                                              │
│  CANNOT be parallelized — token N depends on tokens 1..N-1   │
│  4,231 sequential steps × ~6.6ms each = ~28 seconds          │
└──────────────────────────────────────────────────────────────┘

Every single output token re-reads the entire context. When the LLM writes “minimum temperature: 20.0°C”, it scans all 100 tool results through attention across 17,000 tokens, 80 layers deep. It’s like reading 17 pages before writing each word — the book isn’t full (200K context available), but scanning 17 pages per word is slow.

Why More Quota Doesn’t Help

What quota increase fixes:
  Requests per minute:  ✓ more concurrent AGENTS (not tools within one agent)
  Tokens per minute:    ✓ more concurrent AGENTS

What quota increase does NOT fix:
  Time for LLM to read 17,000 input tokens:    still ~3s
  Time for LLM to generate 4,231 output tokens: still ~28s

  Token generation is sequential — one token at a time.
  More quota lets you run more requests simultaneously.
  It doesn't make a single request faster.

Current (1 agent, 100 tools):
  Agent → LLM: "here are 100 tool results" → LLM thinks 31s → response

With 10x quota (still 1 agent, 100 tools):
  Agent → LLM: "here are 100 tool results" → LLM STILL thinks 31s → response

Where the Time Actually Goes — The Breakdown

Component	Time	% of Total	Can We Fix It?
LLM #1 prefill (read schemas)	2s	5%	No — must read tool schemas
LLM #1 decode (tool_use blocks)	3s	8%	Partially — fewer tools = fewer blocks
Tool execution (100 tools)	1.6s	4%	Already parallel, already fast
LLM #2 prefill (read results)	3s	8%	Yes — shorter tool results = fewer tokens
LLM #2 decode (summary)	28s	75%	YES — this is the bottleneck

75% of the time is the LLM generating its summary of 100 tool results. The fix isn’t more infrastructure — it’s less output.

The Four Ways to Reduce That 31 Seconds

1. CONSTRAIN OUTPUT (biggest win)
   System prompt: "Reply ONLY with JSON: {count, min, max, avg}. Nothing else."
   Current:  4,231 output tokens → 28s decode
   Fixed:    ~20 output tokens   → <1s decode
   Savings:  ~27 seconds

2. FEWER TOOL RESULTS (reduce input)
   Split: 10 agents × 10 tools instead of 1 agent × 100 tools
   Each agent: ~2,000 input tokens → ~5s total
   All 10 run in parallel → ~5s wall time (not 40s)

3. SMALLER TOOL RESULTS (reduce input tokens per result)
   Current: {"sensor_id": "sensor_042", "value": 25.3, "unit": "celsius", ...}
   Minimal: "042:25.3"
   100 results × ~60 fewer tokens = 6,000 fewer input tokens
   Saves ~3-4 seconds on prefill

4. FASTER MODEL (trade capability for speed)
   Claude Haiku: ~2ms/token vs Sonnet's ~7ms/token
   31s → ~10s. But less capable tool selection.

The Surprising Conclusion

AgentCore’s Firecracker microVM handled 100 parallel tools without breaking a sweat — 0.8 vCPU average, 1 GB memory, zero errors. The infrastructure is not the bottleneck. The LLM is. Processing 100 tool schemas and 100 tool results costs ~29,000 tokens and 31 seconds of LLM time. The actual tool execution took 2 seconds.

The bottleneck isn’t context window size, API rate limits, CPU, memory, or network. It’s autoregressive decoding — the LLM generates tokens one at a time, and 4,231 tokens at ~6.6ms each equals 28 seconds. No amount of infrastructure scaling changes that. The fix is architectural: fewer tools with batch operations, constrained output, or splitting work across multiple agents.

If you’re designing an agent with many tools, the optimization target isn’t the runtime infrastructure — it’s minimizing the tokens the LLM has to process. Fewer tools with batch operations inside them will always outperform many tools called in parallel.

References

Posted 12th March 2026 at 6:26 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog