Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste

#ai #inference #llm #cost #observability #tokens #monitoring

Abhishek Kumar

LLM cost leaks rarely announce themselves.

They hide in places that feel harmless:

giant tool schemas copied into every request
RAG chunks nobody reads
retries that double-charge users internally
background agents looping politely forever
streamed tokens generated after the client disconnected
hidden reasoning tokens
verbose system prompts that grew one incident at a time
“temporary” debug context that became permanent

The bill arrives later. It is always on time.

Start with token accounting, not vibes

Every request should produce a usage record:

{
  "tenant": "acme",
  "feature": "support-chat",
  "model": "gpt-4.1",
  "request_id": "req_123",
  "input_tokens": 8400,
  "cached_input_tokens": 6200,
  "output_tokens": 620,
  "reasoning_tokens": 180,
  "tool_calls": 4,
  "retries": 1,
  "cache_status": "semantic_miss_prefix_hit",
  "estimated_cost_usd": 0.48,
  "user_visible": true
}

OpenAI’s APIs expose token usage details such as cached input tokens and reasoning tokens for supported models and endpoints. OpenTelemetry’s GenAI semantic conventions define common attributes for model, operation, token usage, and request metadata. Use those ideas even if your serving stack is custom.

The goal is simple: every generated token should have an owner.

A cost dashboard that ignores hidden work will make the wrong team look innocent.

Hidden leak 1: schema bloat

Tool schemas are useful. They are also repetitive. A large agent with dozens of tools can send thousands of schema tokens per request.

Mitigations:

send only tools relevant to the current state
group tools by workflow
shorten descriptions after evals prove safety
use prompt caching for stable tool schemas
measure schema tokens separately

If the tool list grows every sprint, your token bill is doing product management without permission.

Hidden leak 2: RAG over-retrieval

RAG systems love stuffing context. The model may use three paragraphs while the prompt carries twenty.

Track:

retrieved tokens
cited tokens
answer-supported chunks
context precision
no-answer cases
per-source token cost

If a chunk never supports an answer, it should not keep getting invited to dinner.

Hidden leak 3: retries and fallbacks

Retries are necessary. Blind retries are expensive.

Record:

original request cost
retry count
retry reason
fallback model
tokens wasted before failure
user-visible success

A retry after a malformed JSON output is different from a retry after a timeout. One may need constrained decoding. The other may need queue control.

Hidden leak 4: agents that do not stop

Agents can burn tokens in tool loops:

search -> summarize -> search again -> summarize again -> rethink -> search

The loop may not be technically infinite. It can simply be expensive enough to feel infinite to finance.

Set budgets:

max steps
max tool calls
max tokens per turn
max wall-clock time
max repeated tool/action pairs
no-progress detector

Hidden leak 5: cancellation that does not propagate

If the user closes the tab, the backend should stop generating. If the stream disconnects and the GPU keeps decoding for 30 seconds, those tokens are pure waste.

Every streaming system needs:

client disconnect detection
cancellation propagation to model backend
token accounting after cancel
aborted stream metrics
cleanup for held KV cache and queue slots

Cost dashboard that actually helps

At minimum:

Panel	Why it matters
Cost by tenant / feature / model	Finds owners
Input, cached input, output, reasoning tokens	Separates cost classes
RAG context tokens vs cited tokens	Finds retrieval waste
Tool schema tokens	Finds agent bloat
Retry cost	Finds reliability tax
Canceled-token waste	Finds stream cleanup bugs
Cost per successful task	Avoids optimizing raw tokens
Budget burn rate	Catches runaway jobs early

Use real numbers from your stack. The categories are the point; the sample values are placeholders.

Budget controls

Dashboards are not enough. Add controls:

per-tenant daily and monthly budgets
per-feature budgets
model allowlists by use case
max input tokens by endpoint
max output tokens by task
tool-call budgets
agent step budgets
cost-aware fallback
alert on spend velocity, not just absolute spend

The useful alert is:

support-chat spend is 3.2x normal for this hour
because retry tokens increased after deploy abc123

Not:

bill high

The metric to optimize

Do not optimize for lowest token count. Optimize for:

cost per successful task inside SLO

Sometimes a longer prompt prevents a retry. Sometimes a bigger model avoids escalation. Sometimes a semantic cache hit is safe and saves everything. Sometimes cutting context makes the answer cheap and wrong.

Cost monitoring should help engineering make those tradeoffs consciously.

Sources worth reading

OpenAI usage and token details for request usage fields such as cached and reasoning tokens.
OpenAI Usage API cookbook for organization-level usage analysis.
OpenTelemetry GenAI semantic conventions for tracing generative AI requests.
LiteLLM spend tracking for practical multi-provider cost controls.

Reduce LLM Inference Cost by 60% Without Serving Stale Answers Agents Need Seatbelts: Guardrails and Infinite-Loop Detection for Tool-Using AI