Skip to content
Control Planes for Distributed Systems: A Practitioner's Guide

Control Planes for Distributed Systems: A Practitioner's Guide

Every non-trivial distributed system has a control plane. The only question is whether you designed one, or one emerged by accident through scripts, cron jobs, and Slack incidents.

A good control plane is the difference between running a platform and operating one. It is what lets a team go from “babysitting deployments” to “shipping infrastructure as a product.” This is the architecture I have lived with — at Omnistrate, on Kubernetes-native platforms, and now on AI inference systems serving production traffic — distilled into the patterns that actually hold up.

What a control plane actually does

Control plane vs data plane is a useful split. The data plane is where the work happens — packets are routed, requests are served, queries are executed. The control plane is what decides what the data plane should be doing — placement, configuration, lifecycle, policy, observability.

Concretely, a real control plane is responsible for four things:

  1. Declare — accept the user’s intent (“I want 5 replicas of this model in us-west-2 with autoscaling between 2 and 20”).
  2. Reconcile — drive observed state toward desired state, continuously, in the face of failures and concurrent changes.
  3. Observe — emit truth about what is actually happening, not just what was intended.
  4. Govern — enforce policy, quotas, RBAC, audit, and cost controls before any of the above happens.

Miss any one of these and you do not have a control plane — you have a deployment script with extra steps.

The reconciliation loop is the heart of it

If you read one Kubernetes design doc in your life, read the one on the controller pattern. The reconciliation loop is the most important idea in modern infrastructure software:

loop:
  desired = read_spec()
  observed = read_status()
  if desired == observed:
    sleep(jitter)
    continue
  actions = diff(desired, observed)
  for action in actions:
    apply(action)
  emit_events()

What makes this powerful, and what makes it hard:

  • Idempotent. Running it twice should be safe. Apply must be a no-op if already applied.
  • Convergent. It runs forever; eventually observed approaches desired regardless of the path taken.
  • Crash-safe. If the controller dies mid-reconciliation, the next iteration must pick up cleanly. There is no “transaction in progress.”
  • Composable. Multiple controllers can reconcile different aspects of the same resource without stepping on each other, if their concerns are well-separated.

The wrong shape — and I have seen this in too many internal platforms — is an imperative pipeline. “User submits a request → API queues a job → worker reads the job → worker mutates state → worker reports done.” That looks fine until the worker dies between two steps and you are now reasoning about partial state at 2 AM.

The four primitives every control plane needs

If you are designing one from scratch, these are the components you will end up building or borrowing:

1. A declarative API

The user expresses what they want, not how to get there. CRDs in Kubernetes, Pulumi/Terraform resources, and Crossplane Compositions all do this. The declarative shape forces you to think in terms of desired state, which forces you toward reconciliation, which is the right mental model.

Anti-pattern: an imperative API that takes commands (“create this”, “scale that”). Imperative APIs trap you in stateful workflows where the only correct answer to “what is going on?” is “wait.”

2. A reconciliation engine

Whatever runs your loops. In the Kubernetes world, controller-runtime is the canonical implementation. Outside Kubernetes, you can build it on top of Temporal, AWS Step Functions, or your own state machine — but if you do, you will reinvent leader election, retry backoff, exponential jitter, and event de-duplication. Take the existing thing if you can.

3. A persistent state store

Etcd in Kubernetes. Postgres for many SaaS control planes. The store needs to be:

  • Authoritative — the source of truth for both desired and observed state.
  • Watchable — controllers need to react to changes, not poll. Postgres LISTEN/NOTIFY, etcd watch, or change-data-capture all work.
  • Versioned — optimistic concurrency control on resource versions prevents lost updates from concurrent reconcilers.

4. An observability surface

Status fields on resources, structured events, metrics labeled with the resource identity, and traces that follow a request from API submission through reconciliation to actuation. Without this, you have a control plane you cannot debug — which is the same as not having a control plane.

The architecture in one diagram

       ┌───────────────────────────────────────────────────────┐
       │                  USER / CLIENT / CI                   │
       └───────────────────┬───────────────────────────────────┘
                           │  declarative request
                           ▼
       ┌───────────────────────────────────────────────────────┐
       │                       API LAYER                       │
       │  authn / authz / validation / quota / audit / RBAC    │
       └───────────────────┬───────────────────────────────────┘
                           │  validated spec
                           ▼
       ┌───────────────────────────────────────────────────────┐
       │                     STATE STORE                       │
       │       desired-state │  observed-state  │  events      │
       └─────┬───────────────┴────▲──────────────┴─────────────┘
             │ watch                │ status updates
             ▼                      │
       ┌───────────────────────────────────────────────────────┐
       │                  RECONCILERS (N)                      │
       │     each reconciles one slice of the resource         │
       └─────┬─────────────────────────────────────────────────┘
             │  actuate
             ▼
       ┌───────────────────────────────────────────────────────┐
       │                     DATA PLANE                        │
       │  the actual workloads, models, queues, services       │
       └───────────────────────────────────────────────────────┘

The reconcilers never talk to the user. The user never talks to the data plane. The state store is the only crossing point. Keep that boundary sacred.

Anti-patterns I see often

Tight coupling of API and reconciliation. The API call does the work synchronously. Now your API is slow, can’t be retried safely, and can’t recover from a process restart mid-call.

Reconcilers that hold long-lived state. They cache things from previous iterations and act on them. The first time the cache is wrong, you discover that the only person who understood the cache invalidation logic left two quarters ago.

A “global” reconciler doing everything. It tries to manage networking, storage, and compute from a single loop. Composability dies; one bug stalls all of it.

No quotas before reconciliation. A user submits a request for 10,000 replicas. Reconcilers happily start trying. By the time you notice, the data plane is paged.

Status fields written by reconcilers, read by users. Looks fine until two reconcilers write conflicting status. Use server-side apply (Kubernetes) or per-controller status sub-resources to give each reconciler its own slice.

The patterns that hold up

Single-resource, single-reconciler. Each reconciler owns one resource type. It may read others to gather context, but it only writes its own.

Status sub-resources. Spec is what the user wants. Status is what the controller has done. They live on the same object but are written by different parties.

Events for narrative, metrics for state. Events tell the story (“scale-up started”, “scale-up completed”, “scale-up failed”). Metrics tell the gauge (“desired_replicas{name=X} = 10”). Both matter, for different audiences.

Finalizers for cleanup. When a resource is deleted, controllers that have side effects elsewhere (a cloud LB, an external account) need a chance to clean up before the resource is gone. Finalizers in Kubernetes formalize this; outside Kubernetes you build the same with deletion intent + reconciler-acknowledged-removed semantics.

Drift detection on a slow loop. Even after reconciliation declares “done”, run a slow background scan that re-checks desired vs observed. It will find the drift — manual edits, partial failures, external mutations — that fast loops miss.

Build vs adopt

Honest take: do not build a control plane from scratch unless your domain genuinely cannot be expressed in an existing one.

  • If you are managing Kubernetes-native resources, build a CRD + controller using controller-runtime. The boilerplate is solved.
  • If you are managing cloud resources, Crossplane or Pulumi Operator give you the declarative + reconciliation shape on top of Kubernetes.
  • If you are managing SaaS instances of a complex distributed system, Omnistrate-like platforms exist precisely for this and save 6-12 months of platform work.
  • If your domain is genuinely outside these — say, scheduling AI inference across heterogeneous GPU pools with custom placement constraints — then yes, build it. But borrow the patterns; do not reinvent the loop.

What “good” looks like operationally

A mature control plane has these properties on a Tuesday at 3 PM:

  • A user can describe everything in their environment as a YAML or JSON spec, and the diff between two snapshots tells them what is changing.
  • An on-call engineer can find the reconciliation history of any resource without grepping logs across 14 services.
  • Adding a new feature means adding a new reconciler, not modifying an existing one.
  • Failure of one reconciler does not affect any other.
  • Observability is so good that “what is going on with X” is a one-second answer.

If your platform doesn’t have these, you have an opportunity. Pick one, fix it, and move to the next.

The lesson I keep relearning

Control planes are about separation of concerns enforced by architecture, not by discipline. The shape of the system has to make the wrong thing hard. If a junior engineer can accidentally take down half the platform by writing a perfectly reasonable function, the architecture is wrong, not the engineer.

Good control planes survive their authors. They survive reorgs, rewrites, and on-call rotations. The patterns are not new — Kubernetes did not invent them, it productized them — but they are still under-applied. The teams that learn them ship infrastructure as a product. The teams that don’t keep shipping deployment scripts.