Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second
Tokens per second is a fun number. It looks good in screenshots. It makes benchmark charts feel like race cars. But if you are operating production inference, tokens per second is only the opening line.
The real question is:
How many useful tokens can I deliver per dollar while meeting latency and quality targets?That sentence is less glamorous, but it buys the GPUs.
Welcome to tokenomics for engineers.
Start with useful tokens
Not all tokens are equal. A streamed answer token that reaches a user inside the latency SLO is useful. A token generated after the client disconnected is confetti. A token from a failed retry may be necessary, but it is not free. Draft tokens in speculative decoding may help, but you should count them differently from accepted output tokens.
The first rule: define the unit.
For production systems, I like:
useful output tokens
+ required input tokens
- wasted retry tokens
- canceled tokens
- low-quality fallback tokensYou do not need a perfect formula on day one. You need a formula honest enough to stop rewarding the wrong behavior.
Measure the latency envelope
MLCommons uses tokens per second for LLM throughput because requests vary widely in input and output length. It also calls out latency metrics like TTFT and TPOT because raw throughput without interactivity is not enough.
That distinction matters. A batch system may maximize total tokens per second. A chat system must protect time to first token and time between tokens. A coding agent may tolerate slower first token if the overall task completes reliably. A voice assistant has almost no patience at all.
Your cost metric should include the SLO:
tokens per dollar at TTFT <= target and TPOT <= targetOtherwise you will accidentally optimize for a system that is cheap because users leave.
Utilization is not the same as value
GPU utilization is useful, but it is not the business metric. A GPU can be busy generating low-priority retries while interactive users wait. It can be busy recomputing prefixes that should have been cached. It can be busy serving a model variant that misses quality targets and triggers downstream rework.
The better ladder is:
- Is the GPU busy?
- Is it busy on the right work?
- Is the work meeting latency SLOs?
- Is the output accepted by the product?
- Is the cost per accepted outcome improving?
Step one is infrastructure. Step five is economics.
Where NVIDIA’s story is strong
NVIDIA talks about cost per token a lot, and for once I think the framing is right. The reason is that inference economics are not only a chip benchmark. They are a full-stack optimization problem.
On the hardware side, H200 increased memory capacity and bandwidth over H100, which matters for large models and cache-heavy serving. Blackwell adds new precision options such as NVFP4 and a much larger scale-up fabric in GB200/GB300 NVL72-style systems. On the software side, TensorRT-LLM, vLLM, SGLang, NIM, and Dynamo all affect the realized cost per token. MLPerf results have shown continued gains from both hardware and software improvements, and NVIDIA’s own submissions are strongest when you look at the platform rather than a single component.
The important caveat: never copy a vendor benchmark into your financial model without translating it into your workload. Sequence length, batch shape, output length, model architecture, quality target, and interactivity change everything.
This is not cynicism. This is engineering hygiene.
The actual formula
For a first-pass model:
cost per useful token =
all-in hourly cost of serving stack
/ useful tokens delivered per hourThe numerator should include:
- GPU or instance cost
- CPU, memory, storage, and networking
- orchestration overhead
- idle capacity kept for SLO headroom
- engineering and operational cost if you are doing TCO
- energy if you run your own data center
The denominator should include:
- tokens delivered inside latency SLO
- accepted output tokens
- successful retries only once
- no tokens generated after cancellation
- quality filters and safety filters
A small worked example
Imagine two deployments:
| Deployment | Raw output tokens/sec | SLO success | Accepted output quality | Useful tokens/sec |
|---|---|---|---|---|
| A | 10,000 | 70% | 95% | 6,650 |
| B | 8,000 | 96% | 97% | 7,450 |
Deployment A wins the benchmark screenshot. Deployment B wins the business. It delivers fewer raw tokens, but more useful tokens because fewer responses miss latency or quality gates.
That is why cost per useful token is a better steering metric. It punishes systems that look fast while creating waste somewhere else.
Common traps
Benchmarking only offline throughput. Useful for capacity planning, incomplete for interactive products.
Ignoring input tokens. Retrieval and agent systems often spend most of their money reading context before generating anything.
Counting canceled tokens. If the client disconnects and the backend keeps generating, your accounting should shame the system a little.
Treating all models as interchangeable. A cheaper model that causes retries, escalations, or wrong answers may be expensive.
Forgetting cache hit rate. Prefix caching can turn a painful repeated prompt into a cheap incremental request.
Assuming utilization equals efficiency. A full queue can mean efficiency. It can also mean users are waiting in a hallway.
A dashboard I would trust
At minimum:
- input tokens/sec and output tokens/sec
- useful tokens/sec
- TTFT p50/p95/p99
- TPOT p50/p95/p99
- cache hit rate
- tokens lost to retries
- tokens lost after cancellation
- GPU memory pressure
- per-model cost per useful million tokens
- per-tenant budget burn
Then add a weekly review: which optimization actually moved cost per useful token? Quantization, batching, routing, cache retention, model choice, hardware, or prompt shape?
That review is where engineering stops being benchmark tourism and becomes infrastructure strategy.
Closing
Tokens per second still matters. Please do not throw it away. Just stop worshipping it alone.
The grown-up metric is cost per useful token under latency and quality constraints. NVIDIA’s platform story is compelling because it attacks that metric from many sides: hardware, memory bandwidth, precision, kernels, deployment packaging, and distributed serving. But the metric only becomes real when you measure it on your workload.
The spreadsheet is allowed back in the meeting now. It just has to bring better columns.
Sources and receipts
- MLCommons on LLM throughput and latency metrics: Llama 2 70B benchmark note.
- NVIDIA H200 and TensorRT-LLM MLPerf results: NVIDIA blog.
- NVIDIA Blackwell MLPerf Inference v5.0: technical blog.
- NVIDIA inference economics framing: How the Economics of Inference Can Maximize AI Value and Rethinking AI TCO.
- NIM deployment and engine support: NVIDIA NIM overview.