Prefill vs Decode in LLM Inference

Chapter 1

What actually happens when you send a prompt to an LLM?

When you type "Explain quantum computing" into ChatGPT, it feels instant. But under the hood, the GPU does two fundamentally different jobs, one after the other:

The Two Phases of Every LLM Request

Your prompt"Explain quantum computing"

→

PREFILLProcess all input tokens at once

→

KV CacheStored in GPU memory

→

DECODEGenerate tokens one-by-one

→

Response"Quantum computing uses..."

That's it. Every single LLM inference call — whether it's GPT-4, Llama, DeepSeek, or Claude — goes through these exact two phases. They're not optional. But they're wildly different in how they use the GPU.

Think of it like reading a book vs writing a book. Reading (prefill) = you consume the entire input at once, building understanding. Writing (decode) = you produce words one at a time, each depending on everything before it. You can read fast because it's all there. Writing is slow because each word depends on the last.

Source: This two-phase architecture is described in the original Transformer paper "Attention Is All You Need" (Vaswani et al., 2017) [arxiv.org] and is the standard in all modern autoregressive LLMs.

Chapter 2

Prefill: Processing your entire prompt in one shot

What happens technically

During prefill, the model takes all your input tokens and processes them through every transformer layer simultaneously in a single forward pass. Here's what that means step by step:

1. Tokenization

Your text "Explain quantum computing" gets split into tokens: ["Explain", " quantum", " computing"]. For a typical prompt, this might be 100-4000 tokens.

2. Embedding lookup

Each token is converted to a high-dimensional vector (e.g., 8192 dimensions for a 70B model). All tokens are batched into a matrix of shape [seq_len, hidden_dim].

3. Self-Attention (the expensive part)

At each layer, every token attends to every other token. This produces Query, Key, and Value matrices. The attention computation is softmax(QK^T / sqrt(d)) * V. For N tokens, this is O(N²) — quadratic in sequence length.

4. KV Cache storage

The Key and Value vectors for every token at every layer are saved to GPU memory. This is the "KV cache" — it's what lets decode skip recomputing everything.

5. Output: first token prediction

After the final layer, a logit vector is produced and sampled to get the first generated token. This is "Time To First Token" (TTFT).

Key insight: Prefill processes all N input tokens in parallel. The GPU's thousands of cores are fully utilized doing massive matrix multiplications. This makes prefill compute-bound — the bottleneck is raw FLOPS (floating-point operations per second), not memory.

The math behind KV cache size

For each token, at each layer, the model stores a Key vector and a Value vector. The total KV cache size is:

KV Cache = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_param

For Llama-3-70B with 4096 tokens in FP16:

Parameter	Value
Layers	80
KV Heads (GQA)	8
Head dimension	128
Sequence length	4,096
Bytes per param (FP16)	2
Total KV cache	2 × 80 × 8 × 128 × 4096 × 2 = ~1.34 GB per request

With 50 concurrent users, that's 67 GB of KV cache alone — nearly an entire H100's memory just for cache, before the model weights even load.

Source: KV cache sizing analysis from "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon et al., vLLM paper, 2023) [arxiv.org]

Simulation: Watch prefill process your tokens

Prefill Simulation

Watch all input tokens get processed in parallel, building the KV cache:

INPUT TOKENS

Explain

quantum

computing

simple

terms

for

beginner

KV CACHE (per token)

K,V

GPU Compute Utilization

Memory Bandwidth Usage

Chapter 3

Decode: Generating tokens one at a time

What happens technically

After prefill, decode takes over. It generates the response one token at a time. Each step:

1. Take the last generated token

Only the single new token goes through the model (not the whole sequence again).

2. Compute attention against the ENTIRE KV cache

The new token's Query vector attends to all previous Keys and Values stored in the cache. This means reading the entire KV cache from GPU memory for every single token.

3. Append new K,V to cache

The new token's Key and Value are added to the cache, making it one entry longer.

4. Predict next token

Sample from the output logits to get the next token. Then repeat from step 1.

Key insight: Each decode step does very little computation (just 1 token through the model) but reads a huge amount of data (the entire KV cache + all model weights). This makes decode memory-bandwidth-bound — the bottleneck is how fast you can read from GPU memory (HBM), not raw FLOPS.

Imagine a librarian writing a report. For each new sentence (1 token), they must walk to the library (KV cache), read every single book they've referenced so far (all cached K/V pairs), walk back to their desk, write one sentence, then repeat. The walking (memory reads) takes 10x longer than the writing (compute). The library keeps growing with each sentence.

Why decode is slow: the arithmetic intensity problem

Arithmetic intensity = FLOPs / Bytes read from memory. It measures how much compute you do per byte of data loaded.

Prefill

High arithmetic intensity. Processing N tokens means NxN attention matrices. Lots of compute per memory read. GPU cores stay busy. Compute utilization: 60-80%.

Decode

Low arithmetic intensity. Processing 1 token means reading the full KV cache + weights but doing minimal math. GPU cores are idle waiting for data. Compute utilization: 5-15%.

Prefill: ~N² FLOPs, reads ~N data → intensity ~ N (high)
Decode: ~N FLOPs, reads ~N data → intensity ~ 1 (low)

Source: Analysis from "Splitwise: Efficient generative LLM inference using phase splitting" (Patel et al., 2024, Microsoft Research) [arxiv.org]. Also see "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving" (Zhong et al., 2024) [arxiv.org]

Simulation: Watch decode generate tokens

Decode Simulation

Watch tokens get generated one at a time. Notice how GPU compute is barely used while memory bandwidth maxes out:

GENERATED TOKENS (one at a time)

Quantum

computing

uses

qubits

instead

classical

bits

KV CACHE (grows with each token)

9 entries (from prefill)

GPU Compute Utilization

Memory Bandwidth Usage

Chapter 4

The conflict: Why mixing them on one GPU is terrible

Now you understand: prefill wants max compute (FLOPS), decode wants max memory bandwidth. When you run both on the same GPU, they fight:

Problem 1: Prefill starves decode

A long prefill (e.g., 8K token prompt) takes hundreds of milliseconds. During that time, ALL decode iterations for other users are paused. Users see "stuttering" — tokens flowing, then freezing, then flowing again.

Problem 2: Decode wastes prefill capacity

Decode uses ~10% of GPU compute. The other 90% of FLOPS are wasted. But you can't use them for more prefill because decode is occupying the GPU's memory bandwidth.

Problem 3: KV cache memory pressure

Each active decode holds KV cache in GPU memory. As more users join, cache eats memory needed for prefill's compute buffers. You hit OOM faster.

Problem 4: Can't optimize for both

Prefill benefits from high tensor parallelism (more GPUs = faster). Decode benefits from high batch size (more users per GPU = better utilization). You can't optimize for both simultaneously.

It's like forcing a sprinter and a marathon runner to share the same pair of legs. The sprinter (prefill) needs explosive power in short bursts. The marathon runner (decode) needs steady endurance over a long distance. They have completely different training regimens. Making them share the same body means neither performs at their peak.

Source: Interference between prefill and decode is analyzed in "Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" (Agrawal et al., 2023) [arxiv.org] and NVIDIA Dynamo's design documentation [github.com]

Chapter 5

The solution: Disaggregated serving

The insight is simple: don't run prefill and decode on the same GPU. Give each its own dedicated hardware, optimized for its specific needs.

Disaggregated Architecture

Request

→

Router

→

Prefill GPUCompute-optimized

⟶

KV Transfervia NIXL

⟶

Decode GPUBandwidth-optimized

→

Tokens

How it works, step by step

1. Request arrives at the Router

The router checks which prefill GPU is available and which decode GPU has the most relevant KV cache already loaded.

2. Prefill GPU processes the prompt

A dedicated compute-optimized GPU runs the full forward pass, generating the KV cache. Since it ONLY does prefill, it can run at max FLOPS with no decode interference.

3. KV cache transfers via NIXL

The KV cache (potentially GBs of data) is moved from the prefill GPU to a decode GPU using the fastest available transport: NVLink (900 GB/s), InfiniBand (400 Gb/s), or PCIe.

4. Decode GPU generates tokens

A bandwidth-optimized GPU takes over. It loads the transferred KV cache and generates tokens one by one. Since it ONLY does decode, it can batch many users together for better bandwidth utilization.

5. Prefill GPU is immediately free

While decode is running (could be seconds), the prefill GPU is already processing the NEXT request. No idle time. Max throughput.

Why this is faster: the numbers

~90%

Prefill GPU compute
utilization (was ~60%)

~85%

Decode GPU bandwidth
utilization (was ~40%)

Throughput per GPU
(single node)

15x

Total throughput
(GB200 NVL72)

Source: Performance numbers from NVIDIA Dynamo benchmarks [nvidia.com], Splitwise paper [arxiv.org], and DistServe [arxiv.org]. Single-node ~30% improvement per GPU comes from NVIDIA Dynamo design docs [github.com]

Now the sprinter and marathon runner each get their own body. The sprinter (prefill GPUs) does explosive bursts, finishes, and starts the next sprint immediately. The marathon runner (decode GPUs) maintains a steady pace with many users simultaneously. A relay handoff (NIXL KV transfer) connects them. Both perform at their peak.

Chapter 6

Simulation: Aggregated vs Disaggregated

Watch 8 requests processed under both strategies. Pay attention to total completion time and GPU utilization.

Head-to-Head Comparison

AGGREGATED — 4 GPUs each doing both prefill + decode:

GPU 0 (both)

Idle

GPU 1 (both)

Idle

GPU 2 (both)

Idle

GPU 3 (both)

Idle

DISAGGREGATED — 2 prefill GPUs + 2 decode GPUs:

Prefill 0

Idle

Prefill 1

Idle

Decode 0

Idle

Decode 1

Idle

AGGREGATED TIME

DISAGGREGATED TIME

SPEEDUP

Chapter 7

But wait — doesn't transferring KV cache add latency?

Yes. This is the key tradeoff. Disaggregation adds a KV transfer step that doesn't exist in aggregated serving. So when is it worth it?

KV cache size	NVLink (900 GB/s)	InfiniBand (50 GB/s)	RoCE (25 GB/s)
500 MB (short prompt)	~0.6 ms	~10 ms	~20 ms
1.5 GB (medium prompt)	~1.7 ms	~30 ms	~60 ms
5 GB (long prompt)	~5.6 ms	~100 ms	~200 ms

The rule: Disaggregation is beneficial when the time saved by specialization (better utilization of both prefill and decode GPUs) exceeds the time added by KV transfer. With NVLink (same node), the transfer cost is negligible (~1ms). Cross-node via InfiniBand, it's still worthwhile for most workloads. This is exactly what NVIDIA's NIXL library optimizes.

When NOT to disaggregate: Very short prompts (<100 tokens) with very short responses (<50 tokens) on slow networks. The transfer overhead may exceed the benefit. NVIDIA Dynamo's Planner can dynamically decide whether to disaggregate or keep aggregated based on current conditions.

Source: Transfer latency analysis from NIXL documentation [github.com] and "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving" (Qin et al., 2024) [arxiv.org]

Chapter 8

How NVIDIA Dynamo implements all of this

Dynamo isn't just disaggregation. It's a complete system with 5 components working together:

Dynamo's 5 Components

FrontendOpenAI API (Rust)

→

RouterKV-aware radix tree

→

WorkersSGLang/TRT-LLM/vLLM

NIXL
KV transfer engine

KVBM
4-tier memory mgmt

Planner
SLA autoscaling

1. Router: KV-aware request routing

Instead of random load balancing, the router maintains a global radix tree of cached token prefixes across all GPUs. When a request arrives, it finds the GPU with the most overlapping cached tokens and routes there — saving prefill time by reusing existing KV cache.

Result: 3x faster TTFT, 2x lower average latency.

2. NIXL: Optimized KV cache transfer

Automatically selects the fastest transport (NVLink > InfiniBand > RoCE > TCP) and performs the KV cache transfer with zero-copy operations where possible.

3. KVBM: 4-tier memory hierarchy

When GPU memory fills up, KVBM automatically moves cold KV cache to CPU RAM → SSD → remote storage, and fetches it back when needed. This means you can serve 10x more concurrent users than GPU memory alone would allow.

4. Planner: Dynamic GPU allocation

Monitors TTFT (time to first token) and ITL (inter-token latency) against SLA targets. If TTFT is too slow, it allocates more GPUs to prefill. If ITL is too slow, it adds decode GPUs. If load drops, it scales down.

5. Backend agnostic

All of this works with SGLang, TensorRT-LLM, or vLLM. You pick the engine; Dynamo provides the orchestration.

Source: NVIDIA Dynamo architecture documentation [github.com] and NVIDIA Dynamo product page [nvidia.com]

Summary

The complete picture

	Prefill	Decode
What it does	Processes entire prompt at once	Generates tokens one at a time
Bottleneck	Compute (FLOPS)	Memory bandwidth (GB/s)
GPU utilization	60-90% compute	5-15% compute, 80%+ bandwidth
Parallelism	All tokens in parallel	Strictly sequential
Duration	One-time (milliseconds)	Ongoing (seconds)
Output	KV cache + first token	All subsequent tokens
Optimization	More FLOPS, higher TP	Higher batch size, more bandwidth
Metric	TTFT (time to first token)	ITL (inter-token latency)

Bottom line: Prefill and decode are fundamentally different workloads that happen to be part of the same inference call. Running them on the same GPU is a compromise. Disaggregation removes that compromise, giving each phase dedicated hardware optimized for its specific bottleneck. The cost is KV cache transfer, which modern interconnects (NVLink, InfiniBand) make negligible. This is why disaggregated serving — and NVIDIA Dynamo — represent the future of LLM inference at scale.

References

Sources and further reading

Every claim in this page can be verified from these sources:

1 Attention Is All You Need — Vaswani et al., 2017. The original Transformer paper defining self-attention and the KV mechanism. arxiv.org/abs/1706.03762
2 Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al., 2023. The vLLM paper that pioneered PagedAttention for KV cache management. arxiv.org/abs/2309.06180
3 Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Patel et al., 2024. Microsoft Research's analysis of prefill vs decode characteristics and the case for disaggregation. arxiv.org/abs/2311.18677
4 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving — Zhong et al., 2024. Academic framework for disaggregated serving with goodput optimization. arxiv.org/abs/2401.09670
5 Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Agrawal et al., 2023. Analysis of prefill-decode interference on shared GPUs. arxiv.org/abs/2308.16369
6 Mooncake: A KVCache-centric Disaggregated Architecture — Qin et al., 2024. KV cache transfer optimization in disaggregated systems. arxiv.org/abs/2407.00079
7 NVIDIA Dynamo GitHub Repository — Open-source distributed inference framework. Design docs, disaggregated serving architecture, NIXL, KVBM. github.com/ai-dynamo/dynamo
8 NVIDIA Dynamo Product Page — Performance benchmarks, GB200 NVL72 results, customer testimonials. nvidia.com/en-us/ai/dynamo
9 NVIDIA Dynamo Disaggregated Serving Design Doc — Technical design document with architecture details and performance analysis. github.com/.../disagg-serving.md
10 FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al., 2022. IO-aware attention algorithm that optimizes memory access patterns. arxiv.org/abs/2205.14135

Built as an educational resource. All data sourced from published research and official documentation.
NVIDIA Dynamo · Docs