15/21 - The Router Is Part of the Model: Routing, Hedging, and Fallback
Routing changes which model, provider, region, cache, and policy serves a request; it therefore changes quality, latency, cost, compliance, and failure behavior.
A useful design review for this system begins with a request, follows its state, and refuses to hide waiting time inside a single latency number. We will build the intuition first, then keep going into capacity math, placement, failure behavior, and the measurements worth putting on an operator dashboard.
The mental model
The useful unit of design is not the library name but the contract between capability filter, policy gate, scorer, and circuit/fallback state. Each boundary needs stable identity, bounded resource use, explicit error semantics, and telemetry. Hidden coupling at one boundary usually appears later as tail latency, unreproducible state, or unsafe recovery.
Description: The diagram separates the user-visible contract from state placement, execution, and control. Read it top to bottom. A tuning change in a lower layer is safe only when the upper-layer contract remains true.
What actually happens
The critical path is classify -> filter -> score -> attempt -> account. Some stages may overlap, but correctness dependencies cannot simply be parallelized away. Separate control metadata from the high-volume data plane, preserve deadlines across calls, and make every retry aware of idempotency and remaining budget.
A field note
A fallback is not automatically equivalent to the primary. Models differ in tool schemas, context limits, safety behavior, data residency, and output quality. Retries also consume the original deadline. The router must carry remaining budget, cap attempts, distinguish overload from bad requests, and record why a route changed—not just which endpoint won.
Description: Follow one unit of work from left to right. The lower panel is the accounting model. It is intentionally explicit because unmeasured queueing and data movement are the most common reasons that component benchmarks fail to predict production behavior.
The capacity equation
expected utility = quality value - latency penalty - cost - failure risk
Treat this as a model to validate, not a constant to copy. Measure each term on the exact hardware, model revision, input distribution, and concurrency regime. Capacity planning should reserve failure headroom; running permanently at the cliff makes recovery impossible when a replica, link, or dependency disappears.
A worked production example
Start with one representative workload and record an end-to-end baseline. Apply the equation expected utility = quality value - latency penalty - cost - failure risk using measured—not advertised—rates. Increase concurrency until the first queue grows, then identify whether capability filter, policy gate, scorer, or circuit/fallback state owns that queue. The saturation point and recovery curve are more useful than an isolated peak number.
Run the experiment in at least three regimes: one request for floor latency, a realistic concurrency distribution for normal operation, and controlled overload for backpressure and recovery. A system is not healthy merely because it eventually completes every request. Queue age, deadline misses, quality, and resource recovery all belong in the acceptance criteria.
Description: The timeline identifies where work waits and where it executes. Instrument both sides of every transition so queue time cannot be mistaken for compute time. Compare steady state with the warm-up and recovery periods rather than deleting them from the report.
Placement, topology, and scale
Logical architecture hides physical asymmetry. Two workers can have the same configuration while differing in accelerator generation, NUMA path, network hops, cache warmth, storage locality, or noisy-neighbor pressure. Placement must therefore be expressed as constraints and verified through telemetry.
Description: The two domains are intentionally independent. Local queues contain transient pressure; durable identity lets work move; the fabric is treated as a finite resource. A cross-domain design should say what happens when the fabric is slow, partitioned, or only partially available.
Failure analysis
The triggering event is rarely the entire incident. Cascades occur when a local failure creates retries, retries create more load, and overloaded dependencies become less responsive. Bound attempts, preserve the original deadline, add jitter, and open circuits by route or failure domain rather than disabling an entire platform.
Description: Trace the trigger downward into three distinct consequences. Correctness, performance, and operability require different detection and recovery controls; one generic health check cannot represent all three.
The control loop
Production optimization is a feedback system. Signals must be fresh and correctly scoped; decisions need hysteresis or cooldown; actions need bounds; verification must compare the intended metric without hiding regressions elsewhere. If a controller can add load faster than the system can observe the result, it will oscillate.
Description: A safe controller closes the loop. It does not stop after changing a batch size, replica count, route weight, or precision. It checks quality and SLOs, attributes the outcome, and rolls back when the invariant is violated.
What to measure
- latency and queue time around capability filter
- capacity and pressure for policy gate
- throughput, failures, and retries in scorer
- decision reasons emitted by circuit/fallback state
- quality, cost, and SLO goodput by workload slice
Always segment these measurements by model revision, workload class, hardware type, and outcome. A fleet-wide average can look healthy while one tenant, long-context bucket, adapter, or accelerator generation is failing.
From laboratory result to production capability
A laboratory result proves that one configuration worked once. A production capability proves that the same contract survives concurrency, skew, partial failure, deployment, and rollback. Record the complete experiment envelope: hardware SKU and topology, driver and runtime versions, model and tokenizer digests, request distribution, warm-up policy, concurrency, precision, and every non-default control. Without that envelope, a performance number is not reproducible evidence.
Separate floor latency, sustainable throughput, and recovery capacity. Floor latency is measured with no queue. Sustainable throughput is the highest rate that keeps queue age and SLO violations bounded over a long run. Recovery capacity is spare work the system can absorb after a replica, link, node, or dependency is lost. These are different numbers. Peak throughput is usually above the sustainable point and says little about safe production capacity.
Roll out in stages. First shadow inputs where policy permits, then canary a narrow workload slice, then increase traffic while comparing quality and operational distributions with the baseline. Make the rollback trigger machine-readable before rollout begins. A rollback that requires an operator to rediscover the previous model, state schema, or runtime image is not a rollback plan.
Debugging order
Debug from the outside inward. Confirm the request identity and deadline, then measure admission and queueing, then state lookup or transfer, then execution, then serialization and downstream delivery. Correlate all five with one trace identity. This order prevents a common mistake: optimizing the most visible kernel while the actual delay is a queue, a copy, a collective, a storage read, or a retry outside the profiler window.
Change one independent variable at a time and retain the raw samples. If a change improves the median but damages the p99, quality, or recovery time, it is not an unconditional improvement. Explain which workload segment benefits and encode that scope in routing or policy instead of applying the change globally.
Design-review checklist
- Is every artifact and state transition bound to a stable version or digest?
- Where does work wait, what bounds that queue, and what happens at the bound?
- Which failures are retryable, and how are deadline and idempotency preserved?
- Which resource saturates first under representative load?
- Can operators distinguish correctness failure from overload and dependency failure?
- Does rollback restore both code and state compatibility?
- Are sensitive inputs, outputs, credentials, and telemetry scoped and redacted?
- Has the recovery path been tested under partial failure rather than described only on paper?
Primary and official references
The takeaway
Routing changes which model, provider, region, cache, and policy serves a request; it therefore changes quality, latency, cost, compliance, and failure behavior. The engineering discipline is to make that claim measurable: define the contract, map state and work to real resources, test the failure boundary, and operate a feedback loop that protects correctness before chasing peak throughput.
