Inference Latency Calculator

Calculate AI model inference latency and throughput.

Inference Configuration

Model Size (Billions)

Input Tokens (Prompt)

Output Tokens (Response)

GPU Type

Quantization

Batch Size

Total Response Latency

3.00 s

100.0 tokens/sec

⚡Time to First Token

1.00 s

🔄Inter-token Latency

10.0 ms

Latency Breakdown

Prefill Time (Input)1.00 s

Generation Time (Output)2.00 s

Total Tokens700

Effective Throughput233.3 tok/s

What the Inference Latency Calculator Measures

The inference latency calculator estimates how long a large language model (LLM) takes to respond to a single request, broken into the two phases that every transformer goes through: prefill (reading your prompt) and decode (generating the answer one token at a time). Latency is the single most important user-facing metric for chatbots, copilots, RAG pipelines, and any real-time AI feature, because it determines whether a response feels instant or sluggish.

This LLM latency calculator takes six inputs that map directly to a real deployment: the model size in billions of parameters, the number of input tokens (your prompt), the number of output tokens (the response you expect), the GPU type running the model, the quantization precision, and the batch size of concurrent requests. From these it derives a sustained generation speed in tokens per second, then converts that into wall-clock milliseconds.

Four numbers come out the other side. Time to first token (TTFT) tells you how long the user waits before any text appears. Inter-token latency (ITL) is the gap between successive streamed tokens and governs how smoothly text flows. Total response latency is the full end-to-end time, and effective throughput is the average number of tokens processed per second across the whole request. Together these tell you whether a configuration will feel snappy or frustrating before you ever provision a GPU.

Because the calculator separates prefill from decode, it also explains why latency behaves the way it does. Long prompts inflate TTFT but barely touch inter-token latency, while long responses dominate total time because every output token is generated sequentially. Understanding that split is the key to optimizing real inference workloads, and it is exactly what this tool is built to surface.

How the Inference Latency Is Calculated

The model starts from a published baseline of tokens per second for a 7B model at FP16 on each supported GPU, then scales that baseline by four multipliers. A quantization multiplier rewards lower precision (INT4 runs faster than FP16), a size multiplier penalizes larger models using a square-root relationship, and a batch-efficiency factor models the throughput gain from processing multiple requests together, also as a square root.

Once the sustained generation speed is known, the calculator splits the request into prefill and decode. Prefill processes the entire prompt in parallel and is assumed to run roughly 5x faster than autoregressive generation, so the prompt is divided by five times the base speed to get time to first token. Decode then divides the output tokens by the plain generation speed, because each output token depends on the previous one and cannot be parallelized.

Inter-token latency is simply the reciprocal of the generation speed expressed in milliseconds, and effective throughput averages all tokens (input plus output) over the total wall-clock time. Every result you see on the page comes from these closed-form expressions, so the math is fully transparent and reproducible.

GPU	Base tok/s (7B, FP16)
NVIDIA H100	150
NVIDIA A100 80GB	100
NVIDIA A100 40GB	80
RTX 4090	70
NVIDIA V100	50
RTX 3090	40
NVIDIA T4	25

The quantization multipliers are FP32 = 0.5, FP16 = 1.0, BF16 = 1.0, INT8 = 1.5, and INT4 = 2.5, reflecting that lower-precision arithmetic moves less data through memory and runs faster on tensor cores.

Inference latency formulas

tokensPerSecond = baseSpeed x quantMult x sqrt(7e9 / params) x sqrt(batch); TTFT = inputTokens / (5 x tokensPerSecond) x 1000; generationTime = outputTokens / tokensPerSecond x 1000; totalLatency = TTFT + generationTime

Where:

baseSpeed= Baseline tokens/sec for a 7B FP16 model on the chosen GPU
quantMult= Quantization multiplier (FP32 0.5, FP16/BF16 1.0, INT8 1.5, INT4 2.5)
params= Model parameters = modelSize (billions) x 1e9
batch= Batch size, the number of concurrent requests
inputTokens= Prompt length in tokens, processed during prefill
outputTokens= Response length in tokens, generated during decode
TTFT= Time to first token in milliseconds (prefill time)
totalLatency= End-to-end response latency in milliseconds

Prefill vs Decode: The Two Phases of LLM Latency

Every LLM request has two distinct phases, and the inference latency calculator models both separately because they scale in completely different ways. The prefill phase ingests your full prompt at once. Because the transformer can attend to all prompt tokens in parallel, prefill is compute-bound and finishes quickly, which is why the calculator treats it as roughly five times faster than generation. Prefill time is exactly what the user experiences as time to first token.

The decode phase is fundamentally serial. Each new output token is produced from the model's prediction for the previous token, so the GPU cannot generate token 200 until it has generated token 199. This autoregressive bottleneck makes decode memory-bandwidth-bound and is the reason long responses dominate total latency. The gap between two streamed tokens is the inter-token latency, and at 100 tokens per second that gap is a steady 10 milliseconds.

This split has practical consequences. If your prompt is huge but the answer is short (for example a long RAG context summarized into one sentence), TTFT will be large but total latency stays modest. If your prompt is tiny but the answer is long (such as generating a full article), TTFT is near-instant yet total latency balloons because hundreds of tokens are decoded one after another. Knowing which phase dominates tells you whether to optimize context length, switch to a faster GPU, or apply quantization to speed up decoding.

Batching and speculative decoding both target this serial bottleneck. The batch-size input in the calculator raises throughput via a square-root efficiency curve, modeling how a GPU amortizes weight loads across several requests, even though each individual request still pays its own decode cost. Understanding prefill versus decode is the foundation for every latency optimization you will make in production.

Factors That Affect AI Inference Latency

Six levers control inference latency, and this calculator lets you adjust all of them. Model size matters most: doubling parameters slows generation by the square root of the ratio, so a 28B model runs at half the speed of a 7B model on the same hardware. Larger models read more weights from GPU memory for every single token, and that memory traffic is the dominant cost during autoregressive decoding.

GPU type sets the baseline. An H100 generates a 7B model at about 150 tokens per second while a T4 manages only 25, a six-fold difference that flows straight through to TTFT, inter-token latency, and total response time. Quantization trades numerical precision for speed: moving from FP16 to INT8 multiplies throughput by 1.5x, and INT4 multiplies it by 2.5x, because smaller weights move less data through the memory subsystem.

Prompt length drives prefill cost and therefore TTFT, while output length drives decode cost and therefore the bulk of total latency. Finally, batch size improves aggregate throughput, modeled here as a square-root gain, because the GPU can load model weights once and reuse them across several concurrent requests. The table below summarizes how each input pushes latency up or down.

Input	Effect on latency
Larger model size	Slower (1 / sqrt of parameter ratio)
Faster GPU	Faster (higher base tok/s)
Lower precision (INT8/INT4)	Faster (1.5x to 2.5x)
Longer prompt	Higher TTFT only
Longer response	Higher total latency
Larger batch	Higher throughput (sqrt gain)

How to Reduce LLM Inference Latency

Once the inference latency calculator shows where your time is going, you can attack the largest component first. If total latency is dominated by generation, the highest-leverage move is usually quantization: switching from FP16 to INT8 or INT4 multiplies generation speed by 1.5x or 2.5x in this model, cutting both inter-token latency and total response time with minimal quality loss for many tasks.

If TTFT is your problem, shorten the prompt. Trimming retrieved context in a RAG system, compressing system instructions, or caching repeated prefixes all reduce the number of input tokens the model must read before producing the first word. Because prefill runs in parallel, even large reductions here translate into a visibly snappier first response.

Upgrading the GPU is the bluntest instrument and the most reliable: moving a workload from a T4 to an A100 80GB roughly quadruples base throughput, and an H100 adds another 50% on top. Choosing a smaller model that still meets your quality bar is equally powerful, since the square-root size penalty means a 7B model can be more than twice as fast as a 30B model on identical hardware.

For high-traffic services, increasing batch size raises aggregate throughput so you serve more users per GPU, even though individual request latency is not reduced by batching alone. The right strategy is almost always a combination: pick the smallest acceptable model, quantize it, run it on the fastest GPU your budget allows, keep prompts lean, and batch concurrent traffic. Re-run the calculator after each change to confirm the win before committing hardware.

Interpreting Your Latency Results

The calculator returns four headline numbers, and each answers a different operational question. Total response latency is what a user waits for a complete, non-streamed answer; for interactive chat you generally want this under a few seconds. Time to first token matters most for streamed UIs, where the perception of speed depends on how quickly text begins to appear rather than when it finishes.

Inter-token latency determines whether streamed text reads smoothly or stutters; below about 50 milliseconds per token, output appears fluid to most readers, and at 10 milliseconds per token (100 tok/s) it streams faster than anyone can read. Effective throughput blends input and output over the whole request and is the right metric for capacity planning, because it reflects total tokens moved per second rather than just decode speed.

Read the latency breakdown to find your bottleneck. A large prefill time relative to generation means your prompt is the constraint, so context trimming or prefix caching will help most. A large generation time means the response length and decode speed dominate, pointing you toward quantization, a faster GPU, or a smaller model. These estimates are planning figures based on representative tokens-per-second baselines, so treat them as a strong starting point and validate against your own benchmarks before finalizing a production deployment.

Worked Examples

7B model on an A100, default chat request

Problem:

Estimate latency for a 7B model at FP16 on an A100 80GB, batch size 1, with a 500-token prompt and a 200-token response.

Solution Steps:

1tokensPerSecond = 100 (A100 base) x 1.0 (FP16) x sqrt(7e9/7e9)=1 x sqrt(1)=1 = 100 tok/s.
2TTFT = 500 / (5 x 100) x 1000 = 500 / 500 x 1000 = 1000 ms = 1.00 s.
3Generation time = 200 / 100 x 1000 = 2000 ms = 2.00 s, and inter-token latency = 1000 / 100 = 10 ms.
4Total latency = 1000 + 2000 = 3000 ms; effective throughput = 700 tokens / 3.00 s = 233.3 tok/s.

Result:

Total latency 3.00 s, TTFT 1.00 s, inter-token latency 10 ms, throughput 233.3 tok/s.

13B model on an H100 with INT8 quantization

Problem:

Estimate latency for a 13B model on an H100 using INT8, batch size 1, with a 1000-token prompt and a 500-token response.

Solution Steps:

1Size multiplier = sqrt(7e9/13e9) = sqrt(0.5385) = 0.7338; tokensPerSecond = 150 x 1.5 (INT8) x 0.7338 x 1 = 165.1 tok/s.
2TTFT = 1000 / (5 x 165.1) x 1000 = 1000 / 825.5 x 1000 = 1211 ms = 1.21 s.
3Generation time = 500 / 165.1 x 1000 = 3028 ms = 3.03 s; inter-token latency = 1000 / 165.1 = 6.06 ms.
4Total latency = 1211 + 3028 = 4240 ms = 4.24 s; throughput = 1500 / 4.24 s = 353.8 tok/s.

Result:

Total latency 4.24 s, TTFT 1.21 s, inter-token latency 6.06 ms, throughput 353.8 tok/s.

7B model on an RTX 4090 with INT4 and batching

Problem:

Estimate latency for a 7B model on an RTX 4090 using INT4, batch size 4, with a 200-token prompt and a 100-token response.

Solution Steps:

1Batch efficiency = sqrt(4) = 2; tokensPerSecond = 70 x 2.5 (INT4) x 1 (7B) x 2 = 350 tok/s.
2TTFT = 200 / (5 x 350) x 1000 = 200 / 1750 x 1000 = 114 ms.
3Generation time = 100 / 350 x 1000 = 286 ms; inter-token latency = 1000 / 350 = 2.86 ms.
4Total latency = 114 + 286 = 400 ms = 0.40 s; throughput = 300 / 0.40 s = 750 tok/s.

Result:

Total latency 0.40 s, TTFT 114 ms, inter-token latency 2.86 ms, throughput 750 tok/s.

Tips & Best Practices

✓Optimize TTFT by trimming prompt length and caching repeated system prefixes.
✓Apply INT8 or INT4 quantization to cut generation time with minimal quality loss.
✓Pick the smallest model that meets your quality bar; the size penalty scales with the square root of parameters.
✓Match the GPU to the workload: an H100 generates about six times faster than a T4.
✓Use batching to raise throughput and serve more users per GPU, not to speed up single requests.
✓Check the latency breakdown to see whether prefill or generation is your real bottleneck.
✓For streamed UIs, prioritize low TTFT and inter-token latency over raw total latency.
✓Always benchmark your chosen configuration in production before committing hardware budgets.

Frequently Asked Questions

TTFT is the delay between submitting a request and seeing the first generated token, and in this calculator it equals the prefill time needed to read your prompt. It matters because in streamed chat interfaces users judge responsiveness by how fast text starts appearing, not by when it finishes. Lowering TTFT usually means shortening the prompt or upgrading to a faster GPU.

Prompts are processed during prefill, which runs in parallel and is modeled as roughly five times faster than generation, so they mainly affect TTFT. Responses are generated autoregressively, one token at a time, so each output token adds the full inter-token latency to the total. That serial dependency is why output length dominates total response latency.

In this model, FP16 and BF16 are the 1.0x baseline, INT8 multiplies generation speed by 1.5x, and INT4 multiplies it by 2.5x, while FP32 is half the FP16 speed at 0.5x. Lower precision moves fewer bytes through GPU memory, which is the bottleneck during decoding. The trade-off is a small, task-dependent loss in output quality that is often negligible for INT8.

No. Batching raises aggregate throughput through a square-root efficiency gain because the GPU loads model weights once and reuses them across concurrent requests. Each individual request still pays its own decode cost, so per-request latency is not reduced by batching alone. Batching is for serving more users per GPU, not for speeding up a single response.

The calculator uses representative tokens-per-second baselines for each GPU at 7B FP16 and scales them with closed-form multipliers, so the results are solid planning estimates rather than exact benchmarks. Real latency varies with the serving framework, KV-cache behavior, network overhead, and concurrent load. Use the numbers to compare configurations, then validate the winner against your own measurements.

Inter-token latency is the gap between consecutive streamed tokens, computed as 1000 divided by the generation speed in tokens per second. Below roughly 50 milliseconds per token the stream reads smoothly, and at 10 milliseconds per token (100 tok/s) it arrives faster than people can read. Reducing it requires a faster GPU, a smaller model, or lower-precision quantization.

Sources & References

Last updated: 2026-06-05

💡

Help us improve!

How would you rate the Inference Latency Calculator?

Editorial Note

MyCalcBuddy Editorial Team

This page is maintained as an educational calculator reference.

Source

Formula Source: Standard Mathematical References

by Various

UpdatedLast reviewed: May 2026

CheckedFormula checks are based on standard references and internal QA review.

Inference Latency Calculator

Inference Configuration

Latency Breakdown

What the Inference Latency Calculator Measures

How the Inference Latency Is Calculated

Inference latency formulas

Prefill vs Decode: The Two Phases of LLM Latency

Factors That Affect AI Inference Latency

How to Reduce LLM Inference Latency

Interpreting Your Latency Results

Worked Examples

7B model on an A100, default chat request

13B model on an H100 with INT8 quantization

7B model on an RTX 4090 with INT4 and batching

Tips & Best Practices

Related Calculators

Throughput Calculator

KV Cache Calculator

Model Quantization Calculator

Speculative Decoding Calculator

LLM Cost Comparison

Frequently Asked Questions

Sources & References

Help us improve!

Editorial Note