16/20 - Streaming Generation: The First Token Is a Product Decision

#streaming #sse #grpc #llm-inference #backpressure

Streaming does not make the model finish sooner. It changes when useful output becomes visible. That turns time to first token, buffering, event contracts, cancellation, and proxy behavior into product architecture rather than transport trivia.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A restaurant can serve every course at the end or bring each course when ready. The kitchen time may be identical, but the experience is not. The waiter also needs a clear signal when the meal is finished or interrupted.

Follow the state and work from left to right.

Description: Start with Model decoder, where emit token ids. The middle stage, Stream adapter, decode text safely. The final stage, Client UI, shows the observable result: render deltas. The arrows describe dependency order, not necessarily separate services.

What actually happens

The engine produces token IDs. A streamer incrementally decodes them while respecting byte-pair or sentence-piece boundaries, applies stop conditions, and emits deltas. The server frames those deltas as SSE, chunked HTTP, WebSocket, or streaming gRPC messages.

Production streams need typed lifecycle events: started, text delta, tool call, usage, completed, cancelled, and error. A stream that only emits text cannot reliably distinguish a clean finish from a broken connection.

Backpressure propagates from a slow client through proxy and server buffers. The system must bound per-stream queues, decide whether generation can pause, and cancel upstream work when the client disconnects. Otherwise the GPU keeps spending tokens for nobody.

A worked example

A model starts generating after 600 ms and finishes at 4 seconds. A non-streaming UI feels idle for 4 seconds. A stream shows useful text at 600 ms. If a reverse proxy buffers 16 KB before flushing, however, the user may still wait several seconds despite correct server code.

The performance model

Perceived latency is dominated by TTFT and flush cadence; total latency is unchanged unless streaming alters scheduling. Tiny writes increase framing and syscall overhead, while large buffers damage interactivity. Coalesce small token fragments for a few milliseconds, not seconds.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Streaming changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Unicode and tokenizer boundaries matter. One token ID may not decode to valid standalone text, and naive concatenation can emit replacement characters or duplicate bytes. Use incremental tokenizer decoding and test multilingual output.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Buffered response, characterized by one final payload and simple accounting. The right panel applies Streaming response, changing the cost profile to early visible deltas and typed lifecycle events. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Chat, coding, and long-form generation
Tool workflows that expose progress events
Applications where perceived latency matters

Where it disappoints

Assuming streaming reduces total compute
Sending one HTTP write per token without buffering policy
Omitting final usage and terminal events
Failing to cancel generation on disconnect

Production checklist

Define a versioned event schema
Disable unintended proxy buffering
Bound queues and propagate cancellation
Handle UTF-8 and tokenizer boundaries
Test partial errors and reconnect behavior

What to measure

Time to first token and first visible byte
Inter-chunk gap distribution
Buffered bytes per layer
Cancelled tokens avoided
Streams ending without a terminal event

From one GPU to a production service

A local server writes directly to a browser. Production adds ingress, service mesh, gateway, API server, and sometimes CDN. Every hop can buffer, time out, compress, or transform the stream. Validate the complete path with timestamps at each boundary.

Backpressure policy should distinguish temporary slowness from abandonment. A bounded queue can pause reads from the engine only if the engine supports per-request flow control; otherwise cancel the request before memory grows without bound.

Schema evolution matters. Additive event fields are easy, but changing terminal semantics breaks clients. Version event types and maintain conformance tests for web, CLI, SDK, and agent consumers.

Design-review questions

Which hop first flushes visible bytes to the user?
How are slow clients bounded or cancelled?
Is usage durable if the terminal event is lost?
Can every client parse every terminal state?
How quickly does disconnect reach the GPU scheduler?

How it connects to the rest of the series

Continuous batching schedules the active streams. Chunked prefill protects inter-token latency. Backpressure and queue policy determine whether slow clients consume scarce decode capacity.

From equation to implementation

SSE frames use a text event stream over HTTP. A practical event includes an id, event type, and JSON data line, terminated by a blank line. HTTP/2 and HTTP/3 alter connection behavior, but proxies may still buffer application writes unless configured to flush.

Usage accounting should be terminal but durable. If the client disconnects before receiving the final event, server-side accounting must still finalize from generation state. Client-visible completion and billing completion are related but separate transactions.

Implementation sketch

emit(event='response.started', id=request_id)
for token_ids in engine.stream(request):
    if client.cancelled():
        engine.cancel(request_id)
        finalize_usage(status='cancelled')
        return
    text_delta = incremental_decoder.push(token_ids)
    bounded_queue.put(event('output_text.delta', text_delta))
emit(event='usage.completed', usage=final_usage()))
emit(event='response.completed')

Capacity planning

Budget memory per active stream: queued deltas, tool payloads, trace state, and proxy buffers. Thousands of slow clients can consume more host memory than the model server. Bound queues and enforce idle-write timeouts.

Benchmarking without fooling yourself

Measure server token time and client visible-byte time separately.
Test nginx, Envoy, CDN, and browser paths end to end.
Throttle clients to validate backpressure and queue limits.
Cancel during prefill, decode, tool calls, and finalization.

A production failure to design for

The API server detects disconnects, but cancellation is not propagated through the gateway to the engine. Dashboards show healthy response latency while GPUs spend 18 percent of tokens on abandoned streams. Correlate disconnect, cancellation acknowledgment, and last generated token.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Contract to Transport, then Cancel and Observe. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Streaming is a transport protocol around an incremental decoder. The server emits structured events for role, text delta, tool arguments, usage, finish reason, and errors. Those events must preserve UTF-8 boundaries, JSON framing, ordering, and one terminal outcome across proxies and client libraries.

Streaming converts model progress into a reliable ordered protocol.

Description: Follow the state from Decode through Assemble and Transport to Consume. Each box is an ownership or computation boundary. In particular, the stream contract includes event order, retry semantics, and final accounting. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Time to first token and inter-token latency shape perceived performance, but excessive flushing wastes CPU and bandwidth. Coalesce tiny byte fragments without delaying semantic tokens. Proxies may buffer unless compression, headers, and idle timeouts are configured correctly. Heartbeats protect long tool or reasoning gaps without pretending they are model output.

Total compute may be equal while time to visible value changes dramatically.

Description: The bars compare Buffered response with Token stream on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Users see progress and can cancel unhelpful output.”, remains larger than the risk, “Too many tiny flushes increase CPU and network overhead.”, under production traffic.

Disconnect is a cancellation signal, not proof the GPU stopped. Propagate cancellation through gateway, scheduler, and engine; release KV and reserved output capacity idempotently. If billing reflects generated tokens, account for work completed before cancellation even when delivery failed.

A strict lifecycle prevents duplicate completion and ambiguous client state.

Description: State advances from Open to Emitting, Finishing, and finally Closed. The labels below each state identify what becomes true at that boundary. The governing invariant is: After a terminal event, no additional content or usage event is legal. Retries and cancellation must preserve the same transition rules.

A stream that works on localhost can still fail behind production buffering.

Description: The four panels are independent review axes: Framing, Text, Liveness, and Accounting. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Test every contract through the real load balancer and proxy chain.

End-to-end streaming health cannot be inferred from server-side TTFT alone.

Description: This is a causal chain, not four unrelated symptoms. Tokens emit triggers Proxy buffers, which creates Timeout fires. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Streaming is a contract from GPU to human. A good implementation makes progress visible, completion unambiguous, and abandoned work stop quickly.

15/20 - Memory Offloading: Trading Bandwidth for Capacity 17/20 - Continuous Batching: The GPU Schedule That Never Stands Still