KV Cache Calculator

Calculate KV cache memory for LLM inference.

Model Configuration

Model Size (Billions)

Number of Layers

Hidden Size

Query Heads

KV Heads

Sequence Length

Batch Size

Precision

Total KV Cache Memory

2.00 GB

512.00 KB per token

🎯Head Dimension

128

💾Memory Saved (GQA)

0.0%

Cache Details

Cache per Token512.00 KB

Growth per Token512.00 KB

Query/KV Head Ratio1:1

Est. Max Batch (24GB GPU)4

Memory by Sequence Length

512 tokens256 MB

1,024 tokens512 MB

2,048 tokens1.00 GB

4,096 tokens2.00 GB

8,192 tokens4.00 GB

16,384 tokens8.00 GB

32,768 tokens16.00 GB

Performance Metrics

Bandwidth Required2000.0 GB/s

MHA Equivalent Cache2.00 GB

Tip: Using fewer KV heads (GQA/MQA) can significantly reduce KV cache memory while maintaining quality. LLaMA-2 uses 8 KV heads with 32 query heads.

What the KV Cache Calculator Does

The KV Cache Calculator estimates how much GPU memory the key-value cache consumes during autoregressive transformer inference. When a large language model generates text one token at a time, it must remember the key and value projections of every previous token so that the attention mechanism does not recompute them on each step. Those cached tensors are the KV cache, and on long-context workloads they frequently dominate the memory budget more than the model weights themselves.

This calculator takes the architectural parameters that actually drive cache size: the number of transformer layers, the hidden size, the number of query heads, the number of key-value heads (which captures Grouped-Query Attention and Multi-Query Attention), the sequence length, the batch size, and the numeric precision. From these it computes the cache footprint per token, the total cache for the full sequence, the growth rate as generation continues, the memory saved by using fewer KV heads, the bandwidth required to stream the cache, and an estimated maximum batch size for a 24 GB GPU.

Understanding KV cache memory is essential for anyone deploying LLMs in production. The cache grows linearly with both context length and batch size, so a configuration that fits comfortably at 2,048 tokens can overflow a GPU at 32,768 tokens. Engineers serving chat assistants, retrieval-augmented generation pipelines, and high-throughput inference endpoints use these numbers to choose the right GPU, set safe concurrency limits, and decide whether techniques like GQA, MQA, or KV-cache quantization are worth adopting. The calculator turns abstract attention math into concrete gigabytes so you can size hardware before you rent it.

Because the cache is read on every single decoding step, its size also affects latency. Each new token requires the model to read back the entire accumulated cache from high-bandwidth memory, so a larger cache means more bytes moved per token and a higher memory-bandwidth bill. That is why the calculator also reports a bandwidth figure: KV cache is not only a capacity problem but a throughput problem, and both matter when you are optimizing tokens-per-second on real serving hardware.

KV Cache Memory Formula

The core formula multiplies a per-layer, per-token cost across every layer and every token in the sequence and batch. The factor of two accounts for storing both the key tensor and the value tensor. The head dimension is derived from the hidden size divided by the number of query heads, and the number of KV heads (not query heads) sets the cache width, which is exactly how Grouped-Query Attention saves memory.

Per token, the cache for one layer is 2 × KV heads × head dimension × bytes. Multiply by the layer count to get the per-token cache for the whole model, then by sequence length and batch size for the full footprint. With FP16 each element is 2 bytes; INT8 and FP8 cut that to 1 byte, halving the cache.

Total KV Cache Memory

Total = 2 × L × kvH × (d / h) × bytes × seqLen × batch

Where:

L= Number of transformer layers
kvH= Number of key-value heads (sets cache width; lower with GQA/MQA)
d= Hidden size (model dimension)
h= Number of query heads (head dimension = d / h)
bytes= Bytes per element: 4 (FP32), 2 (FP16/BF16), 1 (INT8/FP8)
seqLen= Sequence length in tokens
batch= Batch size (concurrent sequences)

GQA, MQA, and Memory Savings

Multi-Head Attention (MHA) gives every query head its own key and value heads, so the KV cache scales with the full head count. Grouped-Query Attention (GQA) shares each KV head across a group of query heads, and Multi-Query Attention (MQA) takes this to the extreme with a single KV head for all query heads. Because the cache width depends on KV heads rather than query heads, shrinking the KV head count shrinks the cache proportionally without changing the number of query heads doing the attention work.

The calculator reports the memory saved relative to an MHA baseline that uses all query heads as KV heads. The savings percentage follows a simple ratio: with h query heads and kvH KV heads, the cache is reduced to kvH/h of the MHA size, so the saving is 1 − (kvH / h). LLaMA-2-70B, for example, uses 64 query heads with only 8 KV heads, an 8:1 ratio that cuts the KV cache by 87.5% compared to full MHA.

Attention Type	Query : KV Ratio	Cache vs MHA	Memory Saved
MHA	1:1	100%	0%
GQA (4:1)	4:1	25%	75%
GQA (8:1)	8:1	12.5%	87.5%
MQA	h:1	1/h	~99%

Lower KV head counts trade a small amount of model quality for large memory and bandwidth wins, which is why nearly every modern open-weight model ships with GQA. The calculator lets you set query heads and KV heads independently so you can see the exact saving for your configuration before committing to it.

Precision, Bandwidth, and Throughput

Numeric precision is the second big lever on KV cache size. FP32 stores 4 bytes per element, FP16 and BF16 store 2 bytes, and INT8 or FP8 store just 1 byte. Switching from FP16 to an 8-bit format halves the cache outright, which is why KV-cache quantization has become a popular way to extend context length on memory-constrained GPUs. The calculator applies the bytes-per-element factor directly, so you can compare formats instantly.

Capacity is only half the story. During generation the model reads back the entire cache from GPU memory on every decoding step, so the cache size sets a hard floor on memory traffic. The calculator estimates the bandwidth required to stream the full cache within a 1-millisecond-per-token target, computed as total cache bytes ÷ 0.001 seconds, then converted to gigabytes per second. If the required bandwidth exceeds your GPU's peak HBM bandwidth, decoding becomes memory-bound and your tokens-per-second will plateau no matter how fast the compute cores are.

This bandwidth-bound behavior is the core reason long-context inference feels slow even on powerful accelerators. Reducing KV heads with GQA and shrinking elements with quantization attack both problems at once: a smaller cache occupies less memory and moves fewer bytes per token, improving capacity and latency together. When you plan a deployment, read the bandwidth figure alongside the total memory figure rather than in isolation.

How to Use the KV Cache Calculator

Enter the architecture of the model you plan to serve. Start with the number of layers and hidden size from the model card, then add the query heads and KV heads; if the model uses standard MHA, set these equal, and if it uses GQA or MQA, set KV heads to the smaller published value. Next set your target sequence length (the full context you intend to support) and the batch size for concurrent requests. Finally pick the precision you will store the cache in.

The result card shows the total KV cache memory in gigabytes or megabytes, along with the cache per token in kilobytes. Below it you will find the head dimension, the GQA memory-saved percentage, the per-token growth rate, the query-to-KV head ratio, and an estimated maximum batch size on a 24 GB GPU after reserving room for FP16 model weights. The memory-by-sequence-length table shows how the footprint scales from 512 up to 32,768 tokens so you can spot exactly where a configuration stops fitting.

Look up your model's layer count, hidden size, and head counts.
Enter your real serving context length and concurrency in the batch field.
Choose FP16 for a typical baseline, or INT8/FP8 to model quantized caching.
Read the total memory and bandwidth figures to size your GPU.
Lower the KV head count to see the savings from GQA or MQA.

Use the per-token growth number to reason about streaming workloads: every token you generate adds that fixed amount to the cache, so you can predict when a long conversation will exhaust available memory and plan eviction or paged-attention strategies accordingly.

Interpreting Your Results

A healthy deployment leaves comfortable headroom between the total KV cache plus model weights and your GPU's physical memory. If the estimated maximum batch size drops to 1, you are already memory-bound and should either reduce the context length, adopt GQA or KV-cache quantization, or move to a larger GPU. The memory saved percentage tells you how much your chosen KV head count is already helping relative to full MHA.

Watch the per-token and growth figures when serving chat or agentic workloads, where context accumulates over many turns. Because the cache grows linearly with sequence length, doubling the context doubles the cache, and the memory-by-sequence-length table makes that scaling obvious. If the bandwidth-required figure is large relative to your GPU, expect decoding to be limited by memory traffic, and treat quantization and GQA as latency tools, not just capacity tools. Together these readouts let you right-size hardware and concurrency for any transformer inference setup.

Worked Examples

7B model with full Multi-Head Attention

Problem:

A 7B model has 32 layers, hidden size 4096, 32 query heads, and 32 KV heads (standard MHA). Estimate the KV cache for a 4,096-token sequence at batch size 1 in FP16.

Solution Steps:

1Head dimension = d / h = 4096 / 32 = 128.
2Cache per layer per token = 2 × KV heads × head dim × bytes = 2 × 32 × 128 × 2 = 16,384 bytes.
3Cache per token = layers × per-layer = 32 × 16,384 = 524,288 bytes = 512 KB.
4Total = per token × seqLen × batch = 524,288 × 4,096 × 1 = 2,147,483,648 bytes.

Result:

The KV cache is exactly 2.00 GB (512 KB per token) for this 4,096-token sequence.

70B model with 8:1 Grouped-Query Attention

Problem:

A 70B model has 80 layers, hidden size 8192, 64 query heads, but only 8 KV heads (GQA). Compute the KV cache for 4,096 tokens, batch 1, FP16, and the memory saved versus MHA.

Solution Steps:

1Head dimension = 8192 / 64 = 128.
2Cache per token = 2 × 80 × 8 × 128 × 2 = 327,680 bytes = 320 KB.
3Total = 327,680 × 4,096 = 1,342,177,280 bytes ≈ 1.25 GB.
4MHA baseline uses 64 KV heads, so memory saved = 1 − (8 / 64) = 87.5%.

Result:

GQA yields a 1.25 GB cache versus a 10 GB MHA equivalent, saving 87.5% of KV memory.

INT8 quantized cache to extend context

Problem:

Take the same 70B GQA model (80 layers, 8 KV heads, head dim 128) but store the cache in INT8 (1 byte) at 4,096 tokens, batch 1. How much memory does quantization save?

Solution Steps:

1Cache per token in INT8 = 2 × 80 × 8 × 128 × 1 = 163,840 bytes = 160 KB.
2Total = 163,840 × 4,096 = 671,088,640 bytes ≈ 0.625 GB.
3Compare to the FP16 total of 1.25 GB from the previous example.

Result:

INT8 caching halves the footprint to about 0.625 GB, doubling the context you can fit.

Tips & Best Practices

✓Set query heads and KV heads equal to model standard MHA, and lower the KV heads to model GQA or MQA savings.
✓Try INT8 or FP8 precision to see how KV-cache quantization roughly halves your memory footprint.
✓Always size the cache at your maximum supported context length, not the average, to avoid out-of-memory surprises.
✓Add headroom beyond the model weights and KV cache for activations, CUDA buffers, and fragmentation.
✓Read the bandwidth-required figure alongside total memory; long-context decoding is often memory-bandwidth bound.
✓Use the per-token growth number to predict when a long, multi-turn conversation will exhaust GPU memory.
✓Larger batch sizes multiply the cache linearly, so cap concurrency based on the estimated maximum batch size.
✓Match the head dimension (hidden size divided by query heads) to the model card to keep estimates accurate.

Frequently Asked Questions

The KV cache stores the key and value tensors computed for every previous token during autoregressive generation. Caching them lets the model attend to past tokens without recomputing their projections on each step, which dramatically speeds up decoding. The trade-off is memory: the cache grows with every token and every concurrent request.

Cache size scales linearly with sequence length, so doubling the context doubles the cache. It also scales with the number of layers, KV heads, head dimension, batch size, and bytes per element. For long-context or high-batch serving, the KV cache often exceeds the size of the model weights themselves.

Grouped-Query Attention shares each key-value head across several query heads, and Multi-Query Attention uses a single KV head for all query heads. Because the cache width depends on KV heads rather than query heads, fewer KV heads means a proportionally smaller cache. The saving equals 1 minus the KV-to-query head ratio, so an 8:1 GQA ratio cuts cache by 87.5%.

Yes. FP16 and BF16 use 2 bytes per element while INT8 and FP8 use 1 byte, so switching from a 16-bit to an 8-bit cache halves the memory exactly. KV-cache quantization is a common way to extend usable context length on memory-limited GPUs, usually with minimal quality loss.

During generation the model reads the entire KV cache from GPU memory on every decoding step, so cache size directly sets memory traffic. The calculator estimates the bandwidth needed to stream the full cache within a 1-millisecond-per-token target. If that exceeds your GPU's HBM bandwidth, decoding becomes memory-bound and throughput stops improving.

The calculator reserves memory for FP16 model weights (model size in billions times 2 GB), then allows 90% of the remaining 24 GB for the KV cache. It divides that available memory by the per-token cache times the sequence length to find how many concurrent sequences fit. It is a planning estimate; real serving needs additional headroom for activations and fragmentation.

Sources & References

Last updated: 2026-06-05

💡

Help us improve!

How would you rate the KV Cache Calculator?

Editorial Note

MyCalcBuddy Editorial Team

This page is maintained as an educational calculator reference.

Source

Formula Source: Standard Mathematical References

by Various

UpdatedLast reviewed: May 2026

CheckedFormula checks are based on standard references and internal QA review.

KV Cache Calculator

Model Configuration

Cache Details

Memory by Sequence Length

Performance Metrics

What the KV Cache Calculator Does

KV Cache Memory Formula

Total KV Cache Memory

GQA, MQA, and Memory Savings

Precision, Bandwidth, and Throughput

How to Use the KV Cache Calculator

Interpreting Your Results

Worked Examples

7B model with full Multi-Head Attention

70B model with 8:1 Grouped-Query Attention

INT8 quantized cache to extend context

Tips & Best Practices

Related Calculators

Context Window Calculator

Inference Latency Calculator

Attention Head Calculator

Model Quantization Calculator

Throughput Calculator

Transformer Layer Calculator

Frequently Asked Questions

Sources & References

Help us improve!

Editorial Note