Speculative Decoding Calculator

Calculate speculative decoding speedup.

Model Configuration

Target Model Size (Billions)

Draft Model Size (Billions)

Speculation Length (K)

Acceptance Rate (0-1)

Target Model Latency (ms/token)

Draft Model Latency (ms/token)

Output Length (tokens)

Inference Speedup

1.82x

7020ms vs 12800ms

🎯Tokens/Round

3.29

⚡Throughput

36.6/s

Speculation Analysis

Expected Accepted Tokens2.53

Rounds Needed78

Time per Round90.0 ms

Draft Utilization63.3%

Memory & Efficiency

Target Model Memory130.4 GB

Draft Model Memory13.0 GB

Memory Overhead10.0%

Target Call Reduction69.5%

Speedup by Acceptance Rate

50% acceptance1.55x speedup

60% acceptance1.67x speedup

70% acceptance1.82x speedup

80% acceptance1.95x speedup

90% acceptance2.09x speedup

Optimal K: Based on your latency ratio, try K=3 for potentially better speedup.

What the Speculative Decoding Calculator Does

The Speculative Decoding Calculator estimates how much faster a large language model can generate text when a small, cheap draft model proposes several tokens at once and a large, accurate target model verifies them in a single forward pass. Speculative decoding is one of the most effective lossless inference acceleration techniques for modern LLMs: it produces exactly the same output distribution as standard autoregressive decoding, but trades a few inexpensive draft passes for fewer expensive target passes. This tool turns that trade-off into concrete numbers so you can decide whether the technique is worth deploying for your workload.

To use the calculator you provide the target model size in billions of parameters, the draft model size, the speculation length K (how many tokens the draft proposes per round), the acceptance rate alpha (the probability that a proposed token is accepted by the target), the target latency and draft latency in milliseconds per token, and the expected output length in tokens. From these seven inputs it computes the expected accepted tokens per round, the number of verification rounds needed, the total speculative decoding time versus standard decoding time, the resulting inference speedup, throughput in tokens per second, the draft model utilization, the memory overhead of hosting the draft alongside the target, and a sweep showing how speedup changes as the acceptance rate climbs from 50% to 90%.

Engineers serving chat assistants, code completion endpoints, and high-throughput batch pipelines use a speculative decoding calculator to answer practical questions: Is my draft model fast enough relative to the target? Will a longer speculation length actually help, or will rejections waste the extra draft passes? How much extra GPU memory does the draft model cost? By making the speedup math explicit, the calculator helps you choose a draft-target pairing, tune the speculation length K, and set realistic latency expectations before you commit hardware to the deployment.

Because every accepted token skips a full target forward pass, the headline metric is the inference speedup: the ratio of standard autoregressive time to speculative time. A speedup of 1.8x means the same response arrives in roughly 55% of the original wall-clock time. The calculator also reports the target call reduction, which shows how many of the original per-token target passes you avoid, the single biggest driver of cost savings on memory-bandwidth-bound inference hardware.

How Speculative Decoding Works

Standard autoregressive generation produces one token per target forward pass, so a 256-token response requires 256 sequential passes through the large model. Speculative decoding breaks this serial bottleneck. In each round, the small draft model autoregressively generates K candidate tokens. The target model then runs one parallel forward pass that scores all K candidates simultaneously. Using a rejection-sampling rule, the target accepts the longest matching prefix of candidates and corrects the first mismatch, guaranteeing the output matches what the target alone would have produced.

The efficiency hinges on the acceptance rate alpha. If the draft agrees with the target on most tokens, long prefixes get accepted and many target passes are skipped. The expected number of accepted tokens follows a geometric series: each successive candidate is accepted only if all earlier ones were, so the contribution of position i is alpha raised to the power i. Summing over K positions gives the expected accepted count. The calculator adds the bonus token the target itself emits on each verification, captures the cost of K draft passes plus one target pass per round, and divides the output length by the tokens produced per round to find how many rounds the full generation needs.

This is why the draft model must be both fast and well-aligned. A draft that is too slow erodes the savings from skipped target passes, while a draft that disagrees too often wastes its proposals. The sweet spot is a draft 5-15x smaller than the target that shares the same tokenizer and training distribution, typically yielding acceptance rates of 0.6-0.85 in practice.

Speculative Decoding Speedup Formula

The calculator builds the speedup from a chain of formulas that exactly mirror the math used in autoregressive verification. First it computes the expected accepted tokens per round as a truncated geometric series, then the time per round (K draft passes plus one target pass), then the tokens produced per round, the rounds needed, and finally the speedup as the ratio of standard time to speculative time.

Standard decoding time is simply the output length times the target latency. Speculative time is the number of rounds times the per-round cost. Each round produces the expected accepted tokens plus a correction token the target supplies whenever the round did not accept all K candidates, an event with probability 1 − alpha^K.

Quantity	Expression
Expected accepted tokens	(1 − α^K) / (1 − α)
Time per round	K × draftMs + targetMs
Tokens per round	E[accepted] + (1 − α^K)
Rounds needed	⌈ outLen / tokensPerRound ⌉
Speedup	(outLen × targetMs) / (rounds × timePerRound)

Inference Speedup from Speculative Decoding

Speedup = (outLen × targetMs) / (ceil(outLen / tokensPerRound) × (K × draftMs + targetMs))

Where:

outLen= Output length in tokens to generate
targetMs= Target (large) model latency in ms per token
draftMs= Draft (small) model latency in ms per token
K= Speculation length: tokens proposed per round
α= Acceptance rate (0-1): probability a proposed token is accepted
tokensPerRound= E[accepted] + (1 − αᴷ), where E[accepted] = (1 − αᴷ)/(1 − α)

Why Acceptance Rate and Speculation Length Matter

The two levers you control most directly are the acceptance rate alpha and the speculation length K. Acceptance rate measures how often the draft model's guess survives the target's verification. A higher alpha means longer accepted prefixes, fewer rounds, and a larger speedup. Acceptance is driven by how closely the draft mimics the target: same tokenizer, same domain, and a draft large enough to capture the easy, predictable tokens that make up much of natural text.

Speculation length K sets how aggressively you gamble. Increasing K can accept more tokens per round when alpha is high, but it also adds draft latency to every round and yields diminishing returns because the probability of accepting all K candidates, alpha^K, decays geometrically. With alpha = 0.7, the chance of accepting four straight tokens is only 0.7⁴ = 0.24, so pushing K to 8 or 10 rarely pays off. The calculator includes an optimal K hint that balances the geometric acceptance decay against your draft-to-target latency ratio.

The built-in acceptance-rate sweep makes the relationship visible. Holding K, latencies, and output length fixed, raising alpha from 0.5 to 0.9 dramatically increases tokens accepted per round and therefore speedup. This is why teams invest in aligning the draft model through distillation or shared-vocabulary training: a few points of acceptance rate translate into meaningful latency reductions across millions of requests.

Memory Overhead and Draft Utilization

Speculative decoding is not free in memory. You must host the draft model in GPU memory alongside the target. The calculator estimates each model's footprint assuming FP16 weights at 2 bytes per parameter, so a 70-billion-parameter target needs about 140 GB and a 7-billion-parameter draft about 14 GB. The memory overhead is reported as the draft footprint expressed as a percentage of the target footprint, which equals the draft-to-target parameter ratio. A 7B draft paired with a 70B target adds roughly 10% memory overhead.

The calculator also reports draft utilization, the expected accepted tokens divided by K. This tells you what fraction of the draft's proposed tokens actually survive verification. Low utilization means the draft is generating tokens that get thrown away, wasting compute. High utilization means the draft is well-matched and most of its work contributes to the final output. Combined with the target call reduction metric, which reports the percentage of original per-token target passes you avoid, these efficiency numbers help you judge whether the memory cost of the draft is justified by the latency win.

For memory-constrained deployments, the draft size is the key knob. A smaller draft adds less memory and lower per-round latency, but may align less well and lower the acceptance rate. The calculator lets you sweep both dimensions to find a draft that fits your GPU budget while still delivering a worthwhile speedup.

When to Use Speculative Decoding

Speculative decoding shines in latency-sensitive, single-stream serving where a user waits for a response token by token, such as interactive chat, coding copilots, and agentic tool-use loops. In these settings the target model is memory-bandwidth bound, meaning the GPU spends most of each step reading weights rather than doing math, so verifying K candidates in one pass costs barely more than verifying one. That free parallelism is exactly what speculative decoding exploits.

The technique is less compelling when the target is already compute-bound, for example at very large batch sizes where the GPU is saturated and the extra verification work competes for the same arithmetic units. It also underperforms on highly creative or out-of-distribution generation where the draft's acceptance rate collapses. Before deploying, use this speculative decoding calculator to confirm that your draft latency, acceptance rate, and output length combine into a speedup above 1.0; if the numbers say 1.1x, the added memory and operational complexity may not be worth it, but a clean 1.8x-2.5x on a chat workload usually is.

Modern frameworks such as vLLM, TensorRT-LLM, and Hugging Face Transformers ship speculative decoding (sometimes called assisted generation) out of the box, and self-speculative variants like Medusa and EAGLE remove the need for a separate draft model entirely. Whichever path you choose, the speedup math in this calculator gives you a quick, vendor-neutral estimate of the payoff.

Worked Examples

70B Target with 7B Draft on a Chat Response

Problem:

A 70B target model (50 ms/token) is paired with a 7B draft (10 ms/token). Speculation length K = 4, acceptance rate alpha = 0.7, generating 256 output tokens. What is the speedup?

Solution Steps:

1Standard time = outLen × targetMs = 256 × 50 = 12,800 ms.
2Expected accepted = (1 − 0.7^4)/(1 − 0.7) = (1 − 0.2401)/0.3 = 0.7599/0.3 = 2.533 tokens.
3Time per round = K × draftMs + targetMs = 4 × 10 + 50 = 90 ms.
4Tokens per round = 2.533 + (1 − 0.2401) = 2.533 + 0.7599 = 3.293; rounds = ceil(256 / 3.293) = ceil(77.74) = 78.
5Speculative time = 78 × 90 = 7,020 ms; speedup = 12,800 / 7,020 = 1.82x.

Result:

Speculative decoding delivers a 1.82x speedup, finishing the 256-token response in about 7,020 ms instead of 12,800 ms.

High Acceptance Rate Boosts the Speedup

Problem:

Same models as before (targetMs = 50, draftMs = 10, K = 4, outLen = 256), but a better-aligned draft pushes the acceptance rate to alpha = 0.9. How much does the speedup improve?

Solution Steps:

1Expected accepted = (1 − 0.9^4)/(1 − 0.9) = (1 − 0.6561)/0.1 = 0.3439/0.1 = 3.439 tokens.
2Tokens per round = 3.439 + (1 − 0.6561) = 3.439 + 0.3439 = 3.783.
3Time per round is unchanged at 4 × 10 + 50 = 90 ms; rounds = ceil(256 / 3.783) = ceil(67.67) = 68.
4Speculative time = 68 × 90 = 6,120 ms; standard time stays 256 × 50 = 12,800 ms.
5Speedup = 12,800 / 6,120 = 2.09x.

Result:

Raising acceptance from 0.7 to 0.9 lifts the speedup from 1.82x to 2.09x, showing how much draft alignment matters.

Memory Overhead of the Draft Model

Problem:

Using FP16 weights, how much GPU memory do a 70B target and a 7B draft require, and what is the memory overhead of adding the draft?

Solution Steps:

1Target memory = 70 × 10^9 params × 2 bytes / (1024^3) ≈ 130.4 GB.
2Draft memory = 7 × 10^9 params × 2 bytes / (1024^3) ≈ 13.0 GB.
3Memory overhead = draftMemory / targetMemory × 100 = 13.0 / 130.4 × 100 ≈ 10%.
4This 10% equals the 7B-to-70B parameter ratio, since both use the same bytes-per-parameter.

Result:

The draft adds about 13 GB on top of the target's 130 GB, roughly a 10% memory overhead, in exchange for the 1.8x-2.1x latency win.

Tips & Best Practices

✓Pair a draft model that is 5-15x smaller than the target and shares the same tokenizer to maximize the acceptance rate.
✓Start with K = 4 and adjust; the calculator's optimal-K hint accounts for your draft-to-target latency ratio.
✓Aim for a draft that runs at least 5x faster than the target so the K extra passes per round stay cheap.
✓Watch the draft utilization metric: low utilization means many proposed tokens are wasted, signaling a poorly aligned draft.
✓Confirm the computed speedup exceeds 1.0 before deploying; a marginal 1.1x rarely justifies the added complexity and memory.
✓Distill or fine-tune the draft on the target's outputs to raise acceptance rate by several points with no quality loss.
✓Use speculative decoding for interactive chat and code completion, where single-stream latency matters most.
✓Consider self-speculative methods like Medusa or EAGLE to avoid hosting a separate draft model entirely.

Frequently Asked Questions

No. Speculative decoding is mathematically lossless: the rejection-sampling verification step guarantees the final output follows exactly the same probability distribution as standard autoregressive decoding from the target model. You get faster generation with identical quality, not an approximation.

Well-matched draft and target models that share a tokenizer and training distribution typically achieve acceptance rates between 0.6 and 0.85 on natural text and code. Out-of-distribution or highly creative prompts push the rate lower, while predictable, templated text can push it higher. The calculator lets you test how sensitive your speedup is to this value.

Higher K accepts more tokens per round when acceptance is high, but adds draft latency every round and yields diminishing returns because the chance of accepting all K candidates is alpha raised to the K power, which decays fast. Most deployments find K between 3 and 6 optimal; the calculator's optimal-K hint balances acceptance decay against your draft-to-target latency ratio.

Each speculation round costs K draft forward passes plus one target pass. If the draft is slow relative to the target, those K extra passes erode the time saved by skipping target passes. A draft that runs 5-15x faster than the target keeps the per-round overhead small enough to deliver a real speedup.

It helps most for single-stream, latency-sensitive serving where the target is memory-bandwidth bound and verifying many candidates in one pass is nearly free. At very large batch sizes the GPU becomes compute-bound, the extra verification work competes for arithmetic units, and the speedup shrinks. Use the calculator to confirm the speedup exceeds 1.0 for your scenario.

You must host the draft model alongside the target. At FP16, memory scales at about 2 bytes per parameter, so a 7B draft adds roughly 14 GB. The calculator reports this as a memory overhead percentage equal to the draft-to-target parameter ratio, around 10% for a 7B draft with a 70B target.

Sources & References

Last updated: 2026-06-05

💡

Help us improve!

How would you rate the Speculative Decoding Calculator?

Editorial Note

MyCalcBuddy Editorial Team

This page is maintained as an educational calculator reference.

Source

Formula Source: Standard Mathematical References

by Various

UpdatedLast reviewed: May 2026

CheckedFormula checks are based on standard references and internal QA review.

Speculative Decoding Calculator

Model Configuration

Speculation Analysis

Memory & Efficiency

Speedup by Acceptance Rate

What the Speculative Decoding Calculator Does

How Speculative Decoding Works

Speculative Decoding Speedup Formula

Inference Speedup from Speculative Decoding

Why Acceptance Rate and Speculation Length Matter

Memory Overhead and Draft Utilization

When to Use Speculative Decoding

Worked Examples

70B Target with 7B Draft on a Chat Response

High Acceptance Rate Boosts the Speedup

Memory Overhead of the Draft Model

Tips & Best Practices

Related Calculators

Inference Latency Calculator

Throughput Calculator

KV Cache Calculator

Model Quantization Calculator

LLM Cost Comparison

Frequently Asked Questions

Sources & References

Help us improve!

Editorial Note