Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second

#inference #tokenomics #gpu #cost #llm #mlperf #blackwell #dynamo

Tokens per second is a fun number. It looks good in screenshots. It makes benchmark charts feel like race cars. But if you are operating production inference, tokens per second is only the opening line.

The real question is:

How many useful tokens can I deliver per dollar while meeting latency and quality targets?

That sentence is less glamorous, but it buys the GPUs.

Welcome to tokenomics for engineers.

Start with useful tokens

Not all tokens are equal. A streamed answer token that reaches a user inside the latency SLO is useful. A token generated after the client disconnected is confetti. A token from a failed retry may be necessary, but it is not free. Draft tokens in speculative decoding may help, but you should count them differently from accepted output tokens.

The first rule: define the unit.

For production systems, I like:

useful output tokens
+ required input tokens
- wasted retry tokens
- canceled tokens
- low-quality fallback tokens

You do not need a perfect formula on day one. You need a formula honest enough to stop rewarding the wrong behavior.

If the metric ignores quality or latency, it is not tokenomics. It is a treadmill with a GPU attached.

Measure the latency envelope

MLCommons uses tokens per second for LLM throughput because requests vary widely in input and output length. It also calls out latency metrics like TTFT and TPOT because raw throughput without interactivity is not enough.

That distinction matters. A batch system may maximize total tokens per second. A chat system must protect time to first token and time between tokens. A coding agent may tolerate slower first token if the overall task completes reliably. A voice assistant has almost no patience at all.

Your cost metric should include the SLO:

tokens per dollar at TTFT <= target and TPOT <= target

Otherwise you will accidentally optimize for a system that is cheap because users leave.

Utilization is not the same as value

GPU utilization is useful, but it is not the business metric. A GPU can be busy generating low-priority retries while interactive users wait. It can be busy recomputing prefixes that should have been cached. It can be busy serving a model variant that misses quality targets and triggers downstream rework.

The better ladder is:

Is the GPU busy?
Is it busy on the right work?
Is the work meeting latency SLOs?
Is the output accepted by the product?
Is the cost per accepted outcome improving?

Step one is infrastructure. Step five is economics.

Where NVIDIA’s story is strong

NVIDIA talks about cost per token a lot, and for once I think the framing is right. The reason is that inference economics are not only a chip benchmark. They are a full-stack optimization problem.

On the hardware side, H200 increased memory capacity and bandwidth over H100, which matters for large models and cache-heavy serving. Blackwell adds new precision options such as NVFP4 and a much larger scale-up fabric in GB200/GB300 NVL72-style systems. On the software side, TensorRT-LLM, vLLM, SGLang, NIM, and Dynamo all affect the realized cost per token. MLPerf results have shown continued gains from both hardware and software improvements, and NVIDIA’s own submissions are strongest when you look at the platform rather than a single component.

The important caveat: never copy a vendor benchmark into your financial model without translating it into your workload. Sequence length, batch shape, output length, model architecture, quality target, and interactivity change everything.

This is not cynicism. This is engineering hygiene.

The actual formula

For a first-pass model:

cost per useful token =
  all-in hourly cost of serving stack
  / useful tokens delivered per hour

The numerator should include:

GPU or instance cost
CPU, memory, storage, and networking
orchestration overhead
idle capacity kept for SLO headroom
engineering and operational cost if you are doing TCO
energy if you run your own data center

The denominator should include:

tokens delivered inside latency SLO
accepted output tokens
successful retries only once
no tokens generated after cancellation
quality filters and safety filters

The cheapest token is not the one generated fastest. It is the one that solves the user problem inside the SLO.

A small worked example

Imagine two deployments:

Deployment	Raw output tokens/sec	SLO success	Accepted output quality	Useful tokens/sec
A	10,000	70%	95%	6,650
B	8,000	96%	97%	7,450

Deployment A wins the benchmark screenshot. Deployment B wins the business. It delivers fewer raw tokens, but more useful tokens because fewer responses miss latency or quality gates.

That is why cost per useful token is a better steering metric. It punishes systems that look fast while creating waste somewhere else.

Common traps

Benchmarking only offline throughput. Useful for capacity planning, incomplete for interactive products.

Ignoring input tokens. Retrieval and agent systems often spend most of their money reading context before generating anything.

Counting canceled tokens. If the client disconnects and the backend keeps generating, your accounting should shame the system a little.

Treating all models as interchangeable. A cheaper model that causes retries, escalations, or wrong answers may be expensive.

Forgetting cache hit rate. Prefix caching can turn a painful repeated prompt into a cheap incremental request.

Assuming utilization equals efficiency. A full queue can mean efficiency. It can also mean users are waiting in a hallway.

A dashboard I would trust

At minimum:

input tokens/sec and output tokens/sec
useful tokens/sec
TTFT p50/p95/p99
TPOT p50/p95/p99
cache hit rate
tokens lost to retries
tokens lost after cancellation
GPU memory pressure
per-model cost per useful million tokens
per-tenant budget burn

Then add a weekly review: which optimization actually moved cost per useful token? Quantization, batching, routing, cache retention, model choice, hardware, or prompt shape?

That review is where engineering stops being benchmark tourism and becomes infrastructure strategy.

Closing

Tokens per second still matters. Please do not throw it away. Just stop worshipping it alone.

The grown-up metric is cost per useful token under latency and quality constraints. NVIDIA’s platform story is compelling because it attacks that metric from many sides: hardware, memory bandwidth, precision, kernels, deployment packaging, and distributed serving. But the metric only becomes real when you measure it on your workload.

The spreadsheet is allowed back in the meeting now. It just has to bring better columns.

Sources and receipts

MLCommons on LLM throughput and latency metrics: Llama 2 70B benchmark note.
NVIDIA H200 and TensorRT-LLM MLPerf results: NVIDIA blog.
NVIDIA Blackwell MLPerf Inference v5.0: technical blog.
NVIDIA inference economics framing: How the Economics of Inference Can Maximize AI Value and Rethinking AI TCO.
NIM deployment and engine support: NVIDIA NIM overview.

Why Agentic Workloads Break Traditional Inference Gateways KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs