Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste
LLM cost leaks rarely announce themselves.
They hide in places that feel harmless:
- giant tool schemas copied into every request
- RAG chunks nobody reads
- retries that double-charge users internally
- background agents looping politely forever
- streamed tokens generated after the client disconnected
- hidden reasoning tokens
- verbose system prompts that grew one incident at a time
- “temporary” debug context that became permanent
The bill arrives later. It is always on time.
Start with token accounting, not vibes
Every request should produce a usage record:
{
"tenant": "acme",
"feature": "support-chat",
"model": "gpt-4.1",
"request_id": "req_123",
"input_tokens": 8400,
"cached_input_tokens": 6200,
"output_tokens": 620,
"reasoning_tokens": 180,
"tool_calls": 4,
"retries": 1,
"cache_status": "semantic_miss_prefix_hit",
"estimated_cost_usd": 0.48,
"user_visible": true
}OpenAI’s APIs expose token usage details such as cached input tokens and reasoning tokens for supported models and endpoints. OpenTelemetry’s GenAI semantic conventions define common attributes for model, operation, token usage, and request metadata. Use those ideas even if your serving stack is custom.
The goal is simple: every generated token should have an owner.
Hidden leak 1: schema bloat
Tool schemas are useful. They are also repetitive. A large agent with dozens of tools can send thousands of schema tokens per request.
Mitigations:
- send only tools relevant to the current state
- group tools by workflow
- shorten descriptions after evals prove safety
- use prompt caching for stable tool schemas
- measure schema tokens separately
If the tool list grows every sprint, your token bill is doing product management without permission.
Hidden leak 2: RAG over-retrieval
RAG systems love stuffing context. The model may use three paragraphs while the prompt carries twenty.
Track:
- retrieved tokens
- cited tokens
- answer-supported chunks
- context precision
- no-answer cases
- per-source token cost
If a chunk never supports an answer, it should not keep getting invited to dinner.
Hidden leak 3: retries and fallbacks
Retries are necessary. Blind retries are expensive.
Record:
- original request cost
- retry count
- retry reason
- fallback model
- tokens wasted before failure
- user-visible success
A retry after a malformed JSON output is different from a retry after a timeout. One may need constrained decoding. The other may need queue control.
Hidden leak 4: agents that do not stop
Agents can burn tokens in tool loops:
search -> summarize -> search again -> summarize again -> rethink -> searchThe loop may not be technically infinite. It can simply be expensive enough to feel infinite to finance.
Set budgets:
- max steps
- max tool calls
- max tokens per turn
- max wall-clock time
- max repeated tool/action pairs
- no-progress detector
Hidden leak 5: cancellation that does not propagate
If the user closes the tab, the backend should stop generating. If the stream disconnects and the GPU keeps decoding for 30 seconds, those tokens are pure waste.
Every streaming system needs:
- client disconnect detection
- cancellation propagation to model backend
- token accounting after cancel
- aborted stream metrics
- cleanup for held KV cache and queue slots
Cost dashboard that actually helps
At minimum:
| Panel | Why it matters |
|---|---|
| Cost by tenant / feature / model | Finds owners |
| Input, cached input, output, reasoning tokens | Separates cost classes |
| RAG context tokens vs cited tokens | Finds retrieval waste |
| Tool schema tokens | Finds agent bloat |
| Retry cost | Finds reliability tax |
| Canceled-token waste | Finds stream cleanup bugs |
| Cost per successful task | Avoids optimizing raw tokens |
| Budget burn rate | Catches runaway jobs early |
Budget controls
Dashboards are not enough. Add controls:
- per-tenant daily and monthly budgets
- per-feature budgets
- model allowlists by use case
- max input tokens by endpoint
- max output tokens by task
- tool-call budgets
- agent step budgets
- cost-aware fallback
- alert on spend velocity, not just absolute spend
The useful alert is:
support-chat spend is 3.2x normal for this hour
because retry tokens increased after deploy abc123Not:
bill highThe metric to optimize
Do not optimize for lowest token count. Optimize for:
cost per successful task inside SLOSometimes a longer prompt prevents a retry. Sometimes a bigger model avoids escalation. Sometimes a semantic cache hit is safe and saves everything. Sometimes cutting context makes the answer cheap and wrong.
Cost monitoring should help engineering make those tradeoffs consciously.
Sources worth reading
- OpenAI usage and token details for request usage fields such as cached and reasoning tokens.
- OpenAI Usage API cookbook for organization-level usage analysis.
- OpenTelemetry GenAI semantic conventions for tracing generative AI requests.
- LiteLLM spend tracking for practical multi-provider cost controls.
