Streaming Generation: The First Token Is a Product Decision

#streaming #sse #grpc #llm-inference #backpressure

Streaming does not make the model finish sooner. It changes when useful output becomes visible. That turns time to first token, buffering, event contracts, cancellation, and proxy behavior into product architecture rather than transport trivia.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A restaurant can serve every course at the end or bring each course when ready. The kitchen time may be identical, but the experience is not. The waiter also needs a clear signal when the meal is finished or interrupted.

Follow the state and work from left to right.

What actually happens

The engine produces token IDs. A streamer incrementally decodes them while respecting byte-pair or sentence-piece boundaries, applies stop conditions, and emits deltas. The server frames those deltas as SSE, chunked HTTP, WebSocket, or streaming gRPC messages.

Production streams need typed lifecycle events: started, text delta, tool call, usage, completed, cancelled, and error. A stream that only emits text cannot reliably distinguish a clean finish from a broken connection.

Backpressure propagates from a slow client through proxy and server buffers. The system must bound per-stream queues, decide whether generation can pause, and cancel upstream work when the client disconnects. Otherwise the GPU keeps spending tokens for nobody.

A worked example

A model starts generating after 600 ms and finishes at 4 seconds. A non-streaming UI feels idle for 4 seconds. A stream shows useful text at 600 ms. If a reverse proxy buffers 16 KB before flushing, however, the user may still wait several seconds despite correct server code.

The performance model

Perceived latency is dominated by TTFT and flush cadence; total latency is unchanged unless streaming alters scheduling. Tiny writes increase framing and syscall overhead, while large buffers damage interactivity. Coalesce small token fragments for a few milliseconds, not seconds.

Expert lens

Unicode and tokenizer boundaries matter. One token ID may not decode to valid standalone text, and naive concatenation can emit replacement characters or duplicate bytes. Use incremental tokenizer decoding and test multilingual output.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Chat, coding, and long-form generation
Tool workflows that expose progress events
Applications where perceived latency matters

Where it disappoints

Assuming streaming reduces total compute
Sending one HTTP write per token without buffering policy
Omitting final usage and terminal events
Failing to cancel generation on disconnect

Production checklist

Define a versioned event schema
Disable unintended proxy buffering
Bound queues and propagate cancellation
Handle UTF-8 and tokenizer boundaries
Test partial errors and reconnect behavior

What to measure

Time to first token and first visible byte
Inter-chunk gap distribution
Buffered bytes per layer
Cancelled tokens avoided
Streams ending without a terminal event

From one GPU to a production service

A local server writes directly to a browser. Production adds ingress, service mesh, gateway, API server, and sometimes CDN. Every hop can buffer, time out, compress, or transform the stream. Validate the complete path with timestamps at each boundary.

Backpressure policy should distinguish temporary slowness from abandonment. A bounded queue can pause reads from the engine only if the engine supports per-request flow control; otherwise cancel the request before memory grows without bound.

Schema evolution matters. Additive event fields are easy, but changing terminal semantics breaks clients. Version event types and maintain conformance tests for web, CLI, SDK, and agent consumers.

Design-review questions

Which hop first flushes visible bytes to the user?
How are slow clients bounded or cancelled?
Is usage durable if the terminal event is lost?
Can every client parse every terminal state?
How quickly does disconnect reach the GPU scheduler?

How it connects to the rest of the series

Continuous batching schedules the active streams. Chunked prefill protects inter-token latency. Backpressure and queue policy determine whether slow clients consume scarce decode capacity.

From equation to implementation

SSE frames use a text event stream over HTTP. A practical event includes an id, event type, and JSON data line, terminated by a blank line. HTTP/2 and HTTP/3 alter connection behavior, but proxies may still buffer application writes unless configured to flush.

Usage accounting should be terminal but durable. If the client disconnects before receiving the final event, server-side accounting must still finalize from generation state. Client-visible completion and billing completion are related but separate transactions.

Implementation sketch

emit(event='response.started', id=request_id)
for token_ids in engine.stream(request):
    if client.cancelled():
        engine.cancel(request_id)
        finalize_usage(status='cancelled')
        return
    text_delta = incremental_decoder.push(token_ids)
    bounded_queue.put(event('output_text.delta', text_delta))
emit(event='usage.completed', usage=final_usage()))
emit(event='response.completed')

Capacity planning

Budget memory per active stream: queued deltas, tool payloads, trace state, and proxy buffers. Thousands of slow clients can consume more host memory than the model server. Bound queues and enforce idle-write timeouts.

Benchmarking without fooling yourself

Measure server token time and client visible-byte time separately.
Test nginx, Envoy, CDN, and browser paths end to end.
Throttle clients to validate backpressure and queue limits.
Cancel during prefill, decode, tool calls, and finalization.

A production failure to design for

The API server detects disconnects, but cancellation is not propagated through the gateway to the engine. Dashboards show healthy response latency while GPUs spend 18 percent of tokens on abandoned streams. Correlate disconnect, cancellation acknowledgment, and last generated token.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Streaming is a contract from GPU to human. A good implementation makes progress visible, completion unambiguous, and abandoned work stop quickly.

Memory Offloading: Trading Bandwidth for Capacity Continuous Batching: The GPU Schedule That Never Stands Still