LLM Inference Explained

Prefill vs Decode
and Why Disaggregation Matters

A visual, simulation-driven explanation of the two phases of LLM inference, why they have fundamentally different hardware needs, and how separating them unlocks massive performance gains.

Chapter 1

What actually happens when you send a prompt to an LLM?

When you type "Explain quantum computing" into ChatGPT, it feels instant. But under the hood, the GPU does two fundamentally different jobs, one after the other:

The Two Phases of Every LLM Request
Your prompt"Explain quantum computing"
PREFILLProcess all input tokens at once
KV CacheStored in GPU memory
DECODEGenerate tokens one-by-one
Response"Quantum computing uses..."

That's it. Every single LLM inference call — whether it's GPT-4, Llama, DeepSeek, or Claude — goes through these exact two phases. They're not optional. But they're wildly different in how they use the GPU.

Think of it like reading a book vs writing a book. Reading (prefill) = you consume the entire input at once, building understanding. Writing (decode) = you produce words one at a time, each depending on everything before it. You can read fast because it's all there. Writing is slow because each word depends on the last.

Source: This two-phase architecture is described in the original Transformer paper "Attention Is All You Need" (Vaswani et al., 2017) [arxiv.org] and is the standard in all modern autoregressive LLMs.

Chapter 2

Prefill: Processing your entire prompt in one shot

What happens technically

During prefill, the model takes all your input tokens and processes them through every transformer layer simultaneously in a single forward pass. Here's what that means step by step:

1. Tokenization
Your text "Explain quantum computing" gets split into tokens: ["Explain", " quantum", " computing"]. For a typical prompt, this might be 100-4000 tokens.
2. Embedding lookup
Each token is converted to a high-dimensional vector (e.g., 8192 dimensions for a 70B model). All tokens are batched into a matrix of shape [seq_len, hidden_dim].
3. Self-Attention (the expensive part)
At each layer, every token attends to every other token. This produces Query, Key, and Value matrices. The attention computation is softmax(QK^T / sqrt(d)) * V. For N tokens, this is O(N²) — quadratic in sequence length.
4. KV Cache storage
The Key and Value vectors for every token at every layer are saved to GPU memory. This is the "KV cache" — it's what lets decode skip recomputing everything.
5. Output: first token prediction
After the final layer, a logit vector is produced and sampled to get the first generated token. This is "Time To First Token" (TTFT).

Key insight: Prefill processes all N input tokens in parallel. The GPU's thousands of cores are fully utilized doing massive matrix multiplications. This makes prefill compute-bound — the bottleneck is raw FLOPS (floating-point operations per second), not memory.

The math behind KV cache size

For each token, at each layer, the model stores a Key vector and a Value vector. The total KV cache size is:

KV Cache = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_param

For Llama-3-70B with 4096 tokens in FP16:

ParameterValue
Layers80
KV Heads (GQA)8
Head dimension128
Sequence length4,096
Bytes per param (FP16)2
Total KV cache2 × 80 × 8 × 128 × 4096 × 2 = ~1.34 GB per request

With 50 concurrent users, that's 67 GB of KV cache alone — nearly an entire H100's memory just for cache, before the model weights even load.

Source: KV cache sizing analysis from "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon et al., vLLM paper, 2023) [arxiv.org]

Simulation: Watch prefill process your tokens

Prefill Simulation

Watch all input tokens get processed in parallel, building the KV cache:

INPUT TOKENS
Explain
quantum
computing
in
simple
terms
for
a
beginner
KV CACHE (per token)
K,V
K,V
K,V
K,V
K,V
K,V
K,V
K,V
K,V
GPU Compute Utilization
0%
Memory Bandwidth Usage
0%
Chapter 3

Decode: Generating tokens one at a time

What happens technically

After prefill, decode takes over. It generates the response one token at a time. Each step:

1. Take the last generated token
Only the single new token goes through the model (not the whole sequence again).
2. Compute attention against the ENTIRE KV cache
The new token's Query vector attends to all previous Keys and Values stored in the cache. This means reading the entire KV cache from GPU memory for every single token.
3. Append new K,V to cache
The new token's Key and Value are added to the cache, making it one entry longer.
4. Predict next token
Sample from the output logits to get the next token. Then repeat from step 1.

Key insight: Each decode step does very little computation (just 1 token through the model) but reads a huge amount of data (the entire KV cache + all model weights). This makes decode memory-bandwidth-bound — the bottleneck is how fast you can read from GPU memory (HBM), not raw FLOPS.

Imagine a librarian writing a report. For each new sentence (1 token), they must walk to the library (KV cache), read every single book they've referenced so far (all cached K/V pairs), walk back to their desk, write one sentence, then repeat. The walking (memory reads) takes 10x longer than the writing (compute). The library keeps growing with each sentence.

Why decode is slow: the arithmetic intensity problem

Arithmetic intensity = FLOPs / Bytes read from memory. It measures how much compute you do per byte of data loaded.

Prefill

High arithmetic intensity. Processing N tokens means NxN attention matrices. Lots of compute per memory read. GPU cores stay busy. Compute utilization: 60-80%.

Decode

Low arithmetic intensity. Processing 1 token means reading the full KV cache + weights but doing minimal math. GPU cores are idle waiting for data. Compute utilization: 5-15%.

Prefill: ~N² FLOPs, reads ~N data → intensity ~ N (high)
Decode: ~N FLOPs, reads ~N data → intensity ~ 1 (low)

Source: Analysis from "Splitwise: Efficient generative LLM inference using phase splitting" (Patel et al., 2024, Microsoft Research) [arxiv.org]. Also see "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving" (Zhong et al., 2024) [arxiv.org]

Simulation: Watch decode generate tokens

Decode Simulation

Watch tokens get generated one at a time. Notice how GPU compute is barely used while memory bandwidth maxes out:

GENERATED TOKENS (one at a time)
Quantum
computing
uses
qubits
instead
of
classical
bits
KV CACHE (grows with each token)
9 entries (from prefill)
GPU Compute Utilization
0%
Memory Bandwidth Usage
0%
Chapter 4

The conflict: Why mixing them on one GPU is terrible

Now you understand: prefill wants max compute (FLOPS), decode wants max memory bandwidth. When you run both on the same GPU, they fight:

Problem 1: Prefill starves decode

A long prefill (e.g., 8K token prompt) takes hundreds of milliseconds. During that time, ALL decode iterations for other users are paused. Users see "stuttering" — tokens flowing, then freezing, then flowing again.

Problem 2: Decode wastes prefill capacity

Decode uses ~10% of GPU compute. The other 90% of FLOPS are wasted. But you can't use them for more prefill because decode is occupying the GPU's memory bandwidth.

Problem 3: KV cache memory pressure

Each active decode holds KV cache in GPU memory. As more users join, cache eats memory needed for prefill's compute buffers. You hit OOM faster.

Problem 4: Can't optimize for both

Prefill benefits from high tensor parallelism (more GPUs = faster). Decode benefits from high batch size (more users per GPU = better utilization). You can't optimize for both simultaneously.

It's like forcing a sprinter and a marathon runner to share the same pair of legs. The sprinter (prefill) needs explosive power in short bursts. The marathon runner (decode) needs steady endurance over a long distance. They have completely different training regimens. Making them share the same body means neither performs at their peak.

Source: Interference between prefill and decode is analyzed in "Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" (Agrawal et al., 2023) [arxiv.org] and NVIDIA Dynamo's design documentation [github.com]

Chapter 5

The solution: Disaggregated serving

The insight is simple: don't run prefill and decode on the same GPU. Give each its own dedicated hardware, optimized for its specific needs.

Disaggregated Architecture
Request
Router
Prefill GPUCompute-optimized
KV Transfervia NIXL
Decode GPUBandwidth-optimized
Tokens

How it works, step by step

1. Request arrives at the Router
The router checks which prefill GPU is available and which decode GPU has the most relevant KV cache already loaded.
2. Prefill GPU processes the prompt
A dedicated compute-optimized GPU runs the full forward pass, generating the KV cache. Since it ONLY does prefill, it can run at max FLOPS with no decode interference.
3. KV cache transfers via NIXL
The KV cache (potentially GBs of data) is moved from the prefill GPU to a decode GPU using the fastest available transport: NVLink (900 GB/s), InfiniBand (400 Gb/s), or PCIe.
4. Decode GPU generates tokens
A bandwidth-optimized GPU takes over. It loads the transferred KV cache and generates tokens one by one. Since it ONLY does decode, it can batch many users together for better bandwidth utilization.
5. Prefill GPU is immediately free
While decode is running (could be seconds), the prefill GPU is already processing the NEXT request. No idle time. Max throughput.

Why this is faster: the numbers

~90%
Prefill GPU compute
utilization (was ~60%)
~85%
Decode GPU bandwidth
utilization (was ~40%)
2x
Throughput per GPU
(single node)
15x
Total throughput
(GB200 NVL72)

Source: Performance numbers from NVIDIA Dynamo benchmarks [nvidia.com], Splitwise paper [arxiv.org], and DistServe [arxiv.org]. Single-node ~30% improvement per GPU comes from NVIDIA Dynamo design docs [github.com]

Now the sprinter and marathon runner each get their own body. The sprinter (prefill GPUs) does explosive bursts, finishes, and starts the next sprint immediately. The marathon runner (decode GPUs) maintains a steady pace with many users simultaneously. A relay handoff (NIXL KV transfer) connects them. Both perform at their peak.

Chapter 6

Simulation: Aggregated vs Disaggregated

Watch 8 requests processed under both strategies. Pay attention to total completion time and GPU utilization.

Head-to-Head Comparison

AGGREGATED — 4 GPUs each doing both prefill + decode:

GPU 0 (both)
Idle
GPU 1 (both)
Idle
GPU 2 (both)
Idle
GPU 3 (both)
Idle

DISAGGREGATED — 2 prefill GPUs + 2 decode GPUs:

Prefill 0
Idle
Prefill 1
Idle
Decode 0
Idle
Decode 1
Idle
AGGREGATED TIME
--
DISAGGREGATED TIME
--
SPEEDUP
--
Chapter 7

But wait — doesn't transferring KV cache add latency?

Yes. This is the key tradeoff. Disaggregation adds a KV transfer step that doesn't exist in aggregated serving. So when is it worth it?

KV cache sizeNVLink (900 GB/s)InfiniBand (50 GB/s)RoCE (25 GB/s)
500 MB (short prompt)~0.6 ms~10 ms~20 ms
1.5 GB (medium prompt)~1.7 ms~30 ms~60 ms
5 GB (long prompt)~5.6 ms~100 ms~200 ms

The rule: Disaggregation is beneficial when the time saved by specialization (better utilization of both prefill and decode GPUs) exceeds the time added by KV transfer. With NVLink (same node), the transfer cost is negligible (~1ms). Cross-node via InfiniBand, it's still worthwhile for most workloads. This is exactly what NVIDIA's NIXL library optimizes.

When NOT to disaggregate: Very short prompts (<100 tokens) with very short responses (<50 tokens) on slow networks. The transfer overhead may exceed the benefit. NVIDIA Dynamo's Planner can dynamically decide whether to disaggregate or keep aggregated based on current conditions.

Source: Transfer latency analysis from NIXL documentation [github.com] and "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving" (Qin et al., 2024) [arxiv.org]

Chapter 8

How NVIDIA Dynamo implements all of this

Dynamo isn't just disaggregation. It's a complete system with 5 components working together:

Dynamo's 5 Components
FrontendOpenAI API (Rust)
RouterKV-aware radix tree
WorkersSGLang/TRT-LLM/vLLM
NIXL
KV transfer engine
KVBM
4-tier memory mgmt
Planner
SLA autoscaling

1. Router: KV-aware request routing

Instead of random load balancing, the router maintains a global radix tree of cached token prefixes across all GPUs. When a request arrives, it finds the GPU with the most overlapping cached tokens and routes there — saving prefill time by reusing existing KV cache.

Result: 3x faster TTFT, 2x lower average latency.

2. NIXL: Optimized KV cache transfer

Automatically selects the fastest transport (NVLink > InfiniBand > RoCE > TCP) and performs the KV cache transfer with zero-copy operations where possible.

3. KVBM: 4-tier memory hierarchy

When GPU memory fills up, KVBM automatically moves cold KV cache to CPU RAM → SSD → remote storage, and fetches it back when needed. This means you can serve 10x more concurrent users than GPU memory alone would allow.

4. Planner: Dynamic GPU allocation

Monitors TTFT (time to first token) and ITL (inter-token latency) against SLA targets. If TTFT is too slow, it allocates more GPUs to prefill. If ITL is too slow, it adds decode GPUs. If load drops, it scales down.

5. Backend agnostic

All of this works with SGLang, TensorRT-LLM, or vLLM. You pick the engine; Dynamo provides the orchestration.

Source: NVIDIA Dynamo architecture documentation [github.com] and NVIDIA Dynamo product page [nvidia.com]

Summary

The complete picture

PrefillDecode
What it doesProcesses entire prompt at onceGenerates tokens one at a time
BottleneckCompute (FLOPS)Memory bandwidth (GB/s)
GPU utilization60-90% compute5-15% compute, 80%+ bandwidth
ParallelismAll tokens in parallelStrictly sequential
DurationOne-time (milliseconds)Ongoing (seconds)
OutputKV cache + first tokenAll subsequent tokens
OptimizationMore FLOPS, higher TPHigher batch size, more bandwidth
MetricTTFT (time to first token)ITL (inter-token latency)

Bottom line: Prefill and decode are fundamentally different workloads that happen to be part of the same inference call. Running them on the same GPU is a compromise. Disaggregation removes that compromise, giving each phase dedicated hardware optimized for its specific bottleneck. The cost is KV cache transfer, which modern interconnects (NVLink, InfiniBand) make negligible. This is why disaggregated serving — and NVIDIA Dynamo — represent the future of LLM inference at scale.

References

Sources and further reading

Every claim in this page can be verified from these sources:

Built as an educational resource. All data sourced from published research and official documentation.
NVIDIA Dynamo · Docs