A visual, simulation-driven explanation of the two phases of LLM inference, why they have fundamentally different hardware needs, and how separating them unlocks massive performance gains.
When you type "Explain quantum computing" into ChatGPT, it feels instant. But under the hood, the GPU does two fundamentally different jobs, one after the other:
That's it. Every single LLM inference call — whether it's GPT-4, Llama, DeepSeek, or Claude — goes through these exact two phases. They're not optional. But they're wildly different in how they use the GPU.
Think of it like reading a book vs writing a book. Reading (prefill) = you consume the entire input at once, building understanding. Writing (decode) = you produce words one at a time, each depending on everything before it. You can read fast because it's all there. Writing is slow because each word depends on the last.
Source: This two-phase architecture is described in the original Transformer paper "Attention Is All You Need" (Vaswani et al., 2017) [arxiv.org] and is the standard in all modern autoregressive LLMs.
During prefill, the model takes all your input tokens and processes them through every transformer layer simultaneously in a single forward pass. Here's what that means step by step:
["Explain", " quantum", " computing"]. For a typical prompt, this might be 100-4000 tokens.[seq_len, hidden_dim].softmax(QK^T / sqrt(d)) * V. For N tokens, this is O(N²) — quadratic in sequence length.Key insight: Prefill processes all N input tokens in parallel. The GPU's thousands of cores are fully utilized doing massive matrix multiplications. This makes prefill compute-bound — the bottleneck is raw FLOPS (floating-point operations per second), not memory.
For each token, at each layer, the model stores a Key vector and a Value vector. The total KV cache size is:
For Llama-3-70B with 4096 tokens in FP16:
| Parameter | Value |
|---|---|
| Layers | 80 |
| KV Heads (GQA) | 8 |
| Head dimension | 128 |
| Sequence length | 4,096 |
| Bytes per param (FP16) | 2 |
| Total KV cache | 2 × 80 × 8 × 128 × 4096 × 2 = ~1.34 GB per request |
With 50 concurrent users, that's 67 GB of KV cache alone — nearly an entire H100's memory just for cache, before the model weights even load.
Source: KV cache sizing analysis from "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon et al., vLLM paper, 2023) [arxiv.org]
Watch all input tokens get processed in parallel, building the KV cache:
After prefill, decode takes over. It generates the response one token at a time. Each step:
Key insight: Each decode step does very little computation (just 1 token through the model) but reads a huge amount of data (the entire KV cache + all model weights). This makes decode memory-bandwidth-bound — the bottleneck is how fast you can read from GPU memory (HBM), not raw FLOPS.
Imagine a librarian writing a report. For each new sentence (1 token), they must walk to the library (KV cache), read every single book they've referenced so far (all cached K/V pairs), walk back to their desk, write one sentence, then repeat. The walking (memory reads) takes 10x longer than the writing (compute). The library keeps growing with each sentence.
Arithmetic intensity = FLOPs / Bytes read from memory. It measures how much compute you do per byte of data loaded.
High arithmetic intensity. Processing N tokens means NxN attention matrices. Lots of compute per memory read. GPU cores stay busy. Compute utilization: 60-80%.
Low arithmetic intensity. Processing 1 token means reading the full KV cache + weights but doing minimal math. GPU cores are idle waiting for data. Compute utilization: 5-15%.
Source: Analysis from "Splitwise: Efficient generative LLM inference using phase splitting" (Patel et al., 2024, Microsoft Research) [arxiv.org]. Also see "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving" (Zhong et al., 2024) [arxiv.org]
Watch tokens get generated one at a time. Notice how GPU compute is barely used while memory bandwidth maxes out:
Now you understand: prefill wants max compute (FLOPS), decode wants max memory bandwidth. When you run both on the same GPU, they fight:
A long prefill (e.g., 8K token prompt) takes hundreds of milliseconds. During that time, ALL decode iterations for other users are paused. Users see "stuttering" — tokens flowing, then freezing, then flowing again.
Decode uses ~10% of GPU compute. The other 90% of FLOPS are wasted. But you can't use them for more prefill because decode is occupying the GPU's memory bandwidth.
Each active decode holds KV cache in GPU memory. As more users join, cache eats memory needed for prefill's compute buffers. You hit OOM faster.
Prefill benefits from high tensor parallelism (more GPUs = faster). Decode benefits from high batch size (more users per GPU = better utilization). You can't optimize for both simultaneously.
It's like forcing a sprinter and a marathon runner to share the same pair of legs. The sprinter (prefill) needs explosive power in short bursts. The marathon runner (decode) needs steady endurance over a long distance. They have completely different training regimens. Making them share the same body means neither performs at their peak.
Source: Interference between prefill and decode is analyzed in "Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" (Agrawal et al., 2023) [arxiv.org] and NVIDIA Dynamo's design documentation [github.com]
The insight is simple: don't run prefill and decode on the same GPU. Give each its own dedicated hardware, optimized for its specific needs.
Source: Performance numbers from NVIDIA Dynamo benchmarks [nvidia.com], Splitwise paper [arxiv.org], and DistServe [arxiv.org]. Single-node ~30% improvement per GPU comes from NVIDIA Dynamo design docs [github.com]
Now the sprinter and marathon runner each get their own body. The sprinter (prefill GPUs) does explosive bursts, finishes, and starts the next sprint immediately. The marathon runner (decode GPUs) maintains a steady pace with many users simultaneously. A relay handoff (NIXL KV transfer) connects them. Both perform at their peak.
Watch 8 requests processed under both strategies. Pay attention to total completion time and GPU utilization.
AGGREGATED — 4 GPUs each doing both prefill + decode:
DISAGGREGATED — 2 prefill GPUs + 2 decode GPUs:
Yes. This is the key tradeoff. Disaggregation adds a KV transfer step that doesn't exist in aggregated serving. So when is it worth it?
| KV cache size | NVLink (900 GB/s) | InfiniBand (50 GB/s) | RoCE (25 GB/s) |
|---|---|---|---|
| 500 MB (short prompt) | ~0.6 ms | ~10 ms | ~20 ms |
| 1.5 GB (medium prompt) | ~1.7 ms | ~30 ms | ~60 ms |
| 5 GB (long prompt) | ~5.6 ms | ~100 ms | ~200 ms |
The rule: Disaggregation is beneficial when the time saved by specialization (better utilization of both prefill and decode GPUs) exceeds the time added by KV transfer. With NVLink (same node), the transfer cost is negligible (~1ms). Cross-node via InfiniBand, it's still worthwhile for most workloads. This is exactly what NVIDIA's NIXL library optimizes.
When NOT to disaggregate: Very short prompts (<100 tokens) with very short responses (<50 tokens) on slow networks. The transfer overhead may exceed the benefit. NVIDIA Dynamo's Planner can dynamically decide whether to disaggregate or keep aggregated based on current conditions.
Source: Transfer latency analysis from NIXL documentation [github.com] and "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving" (Qin et al., 2024) [arxiv.org]
Dynamo isn't just disaggregation. It's a complete system with 5 components working together:
Instead of random load balancing, the router maintains a global radix tree of cached token prefixes across all GPUs. When a request arrives, it finds the GPU with the most overlapping cached tokens and routes there — saving prefill time by reusing existing KV cache.
Result: 3x faster TTFT, 2x lower average latency.
Automatically selects the fastest transport (NVLink > InfiniBand > RoCE > TCP) and performs the KV cache transfer with zero-copy operations where possible.
When GPU memory fills up, KVBM automatically moves cold KV cache to CPU RAM → SSD → remote storage, and fetches it back when needed. This means you can serve 10x more concurrent users than GPU memory alone would allow.
Monitors TTFT (time to first token) and ITL (inter-token latency) against SLA targets. If TTFT is too slow, it allocates more GPUs to prefill. If ITL is too slow, it adds decode GPUs. If load drops, it scales down.
All of this works with SGLang, TensorRT-LLM, or vLLM. You pick the engine; Dynamo provides the orchestration.
Source: NVIDIA Dynamo architecture documentation [github.com] and NVIDIA Dynamo product page [nvidia.com]
| Prefill | Decode | |
|---|---|---|
| What it does | Processes entire prompt at once | Generates tokens one at a time |
| Bottleneck | Compute (FLOPS) | Memory bandwidth (GB/s) |
| GPU utilization | 60-90% compute | 5-15% compute, 80%+ bandwidth |
| Parallelism | All tokens in parallel | Strictly sequential |
| Duration | One-time (milliseconds) | Ongoing (seconds) |
| Output | KV cache + first token | All subsequent tokens |
| Optimization | More FLOPS, higher TP | Higher batch size, more bandwidth |
| Metric | TTFT (time to first token) | ITL (inter-token latency) |
Bottom line: Prefill and decode are fundamentally different workloads that happen to be part of the same inference call. Running them on the same GPU is a compromise. Disaggregation removes that compromise, giving each phase dedicated hardware optimized for its specific bottleneck. The cost is KV cache transfer, which modern interconnects (NVLink, InfiniBand) make negligible. This is why disaggregated serving — and NVIDIA Dynamo — represent the future of LLM inference at scale.
Every claim in this page can be verified from these sources:
Built as an educational resource. All data sourced from published research and official documentation.
NVIDIA Dynamo ·
Docs