Skip to content
Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste

Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste

LLM cost leaks rarely announce themselves.

They hide in places that feel harmless:

  • giant tool schemas copied into every request
  • RAG chunks nobody reads
  • retries that double-charge users internally
  • background agents looping politely forever
  • streamed tokens generated after the client disconnected
  • hidden reasoning tokens
  • verbose system prompts that grew one incident at a time
  • “temporary” debug context that became permanent

The bill arrives later. It is always on time.

Start with token accounting, not vibes

Every request should produce a usage record:

{
  "tenant": "acme",
  "feature": "support-chat",
  "model": "gpt-4.1",
  "request_id": "req_123",
  "input_tokens": 8400,
  "cached_input_tokens": 6200,
  "output_tokens": 620,
  "reasoning_tokens": 180,
  "tool_calls": 4,
  "retries": 1,
  "cache_status": "semantic_miss_prefix_hit",
  "estimated_cost_usd": 0.48,
  "user_visible": true
}

OpenAI’s APIs expose token usage details such as cached input tokens and reasoning tokens for supported models and endpoints. OpenTelemetry’s GenAI semantic conventions define common attributes for model, operation, token usage, and request metadata. Use those ideas even if your serving stack is custom.

The goal is simple: every generated token should have an owner.

Token cost ledgerA token ledger breaks cost into visible answer tokens, hidden reasoning, retries, tool calls, RAG context, and cached tokens.Every token needs an ownerIf the ledger only shows final answer tokens, most of the cost story is missing.total request costAnswervisible tokensReasoninghidden tokensRAGcontext tokensRetriesduplicate workCacheddiscounted inputcost = visible tokens + hidden work + failed attempts
A cost dashboard that ignores hidden work will make the wrong team look innocent.

Hidden leak 1: schema bloat

Tool schemas are useful. They are also repetitive. A large agent with dozens of tools can send thousands of schema tokens per request.

Mitigations:

  • send only tools relevant to the current state
  • group tools by workflow
  • shorten descriptions after evals prove safety
  • use prompt caching for stable tool schemas
  • measure schema tokens separately

If the tool list grows every sprint, your token bill is doing product management without permission.

Hidden leak 2: RAG over-retrieval

RAG systems love stuffing context. The model may use three paragraphs while the prompt carries twenty.

Track:

  • retrieved tokens
  • cited tokens
  • answer-supported chunks
  • context precision
  • no-answer cases
  • per-source token cost

If a chunk never supports an answer, it should not keep getting invited to dinner.

Hidden leak 3: retries and fallbacks

Retries are necessary. Blind retries are expensive.

Record:

  • original request cost
  • retry count
  • retry reason
  • fallback model
  • tokens wasted before failure
  • user-visible success

A retry after a malformed JSON output is different from a retry after a timeout. One may need constrained decoding. The other may need queue control.

Hidden leak 4: agents that do not stop

Agents can burn tokens in tool loops:

search -> summarize -> search again -> summarize again -> rethink -> search

The loop may not be technically infinite. It can simply be expensive enough to feel infinite to finance.

Set budgets:

  • max steps
  • max tool calls
  • max tokens per turn
  • max wall-clock time
  • max repeated tool/action pairs
  • no-progress detector

Hidden leak 5: cancellation that does not propagate

If the user closes the tab, the backend should stop generating. If the stream disconnects and the GPU keeps decoding for 30 seconds, those tokens are pure waste.

Every streaming system needs:

  • client disconnect detection
  • cancellation propagation to model backend
  • token accounting after cancel
  • aborted stream metrics
  • cleanup for held KV cache and queue slots

Cost dashboard that actually helps

At minimum:

PanelWhy it matters
Cost by tenant / feature / modelFinds owners
Input, cached input, output, reasoning tokensSeparates cost classes
RAG context tokens vs cited tokensFinds retrieval waste
Tool schema tokensFinds agent bloat
Retry costFinds reliability tax
Canceled-token wasteFinds stream cleanup bugs
Cost per successful taskAvoids optimizing raw tokens
Budget burn rateCatches runaway jobs early
Hidden token leak dashboardA dashboard layout showing token leaks from RAG context, retries, tools, agent loops, and canceled streams.Make waste visible before it becomes normalThe best cost controls are boring dashboards with owners and budgets.Retry tax12.4% of spendRAG overfill8.7M unused tokensTool schema2.1K tokens / callCache savings43% input cachedAgent loops391 budget stopsCanceled waste0.8% output tokens
Use real numbers from your stack. The categories are the point; the sample values are placeholders.

Budget controls

Dashboards are not enough. Add controls:

  • per-tenant daily and monthly budgets
  • per-feature budgets
  • model allowlists by use case
  • max input tokens by endpoint
  • max output tokens by task
  • tool-call budgets
  • agent step budgets
  • cost-aware fallback
  • alert on spend velocity, not just absolute spend

The useful alert is:

support-chat spend is 3.2x normal for this hour
because retry tokens increased after deploy abc123

Not:

bill high

The metric to optimize

Do not optimize for lowest token count. Optimize for:

cost per successful task inside SLO

Sometimes a longer prompt prevents a retry. Sometimes a bigger model avoids escalation. Sometimes a semantic cache hit is safe and saves everything. Sometimes cutting context makes the answer cheap and wrong.

Cost monitoring should help engineering make those tradeoffs consciously.

Sources worth reading