Cloud Agnostic Engineering: The Real Cost of Multi-Cloud Portability

#cloud #cloud-agnostic #multi-cloud #architecture #aws #azure #gcp #kubernetes #terraform

There is a pitch you have heard a hundred times: go cloud-agnostic so you are never locked in. Auditors love it. Procurement loves it. Boards love it. Then a team actually has to ship something on top of it, and the bill — measured in latency, in engineering hours, in features unbuilt — comes due.

This post is the honest version of that pitch. What portability actually costs, where it pays for itself, and how to decide which parts of your stack should be cloud-agnostic and which absolutely should not.

The four layers where portability lives or dies

Cloud portability is not one decision. It is four, and you can answer each one differently:

Infrastructure provisioning — how machines, networks, and disks come into existence.
Workload runtime — how your code actually runs once those machines exist.
Stateful services — databases, queues, caches, object storage.
Higher-order managed services — anything with “AI”, “ML”, “analytics”, or “serverless” in the name.

Layer 1 is genuinely portable today. Layer 4 is genuinely not. Most arguments about cloud-agnosticism collapse the moment you realize the speaker is talking about a different layer than you are.

Layer 1: Infrastructure provisioning

This is the layer where cloud-agnostic engineering is real and worth the effort. Terraform, Pulumi, and Crossplane all let you describe networking, compute, IAM, and storage in a uniform-ish way. The provider blocks differ, but the shape of your infrastructure code is the same across clouds.

The win: when AWS has a regional outage and you have a parallel Terraform module for GCP, your DR plan is hours of work, not weeks. The cost: every cloud-specific feature (Nitro Enclaves, Confidential VMs, Spot fleet allocation strategies) you skip in the name of portability is real money or capability left on the table.

Practical rule. Use Terraform/Pulumi modules per cloud, with a shared interface layer. Do not pretend that one Terraform module can target three clouds — it cannot, and the abstractions you build to fake it will leak constantly.

Layer 2: Workload runtime

Kubernetes is the agnostic-runtime story most teams settle on. It works. EKS, AKS, and GKE are similar enough that a workload manifest mostly moves cleanly. Helm charts, Argo CD, and Flux are all cloud-agnostic.

The catch: the control plane of Kubernetes is portable, but the integrations are not. IAM (IRSA on AWS, workload identity on GCP, AAD pod identity on Azure), load balancer controllers, CSI drivers, and ingress controllers diverge. Your pod manifests are agnostic; your ServiceAccount/IAM glue is not.

Practical rule. Treat your cluster as a layered cake: the application layer (Deployments, Services, ConfigMaps) stays portable; the platform layer (cluster setup, identity, networking) is per-cloud. Do not waste effort trying to make the platform layer portable.

Layer 3: Stateful services

This is where the romance dies. There is no honest way to claim that “I run PostgreSQL on RDS” and “I run PostgreSQL on Cloud SQL” gives you the same operational profile. Failover behavior, IAM integration, replication lag characteristics, point-in-time recovery semantics, backup retention — they all differ. Even the wire protocol details matter for connection pooling.

You have three actual choices:

Self-managed everything on raw VMs or Kubernetes (CockroachDB, ScyllaDB, Redpanda, ClickHouse, MinIO). Genuinely portable, but you now run a database — and that is its own engineering org if you want it to stay up.
Managed per cloud, with abstraction — pick a “PostgreSQL is PostgreSQL” view, accept that you will hit operational divergence, and build runbooks per cloud.
Pick one cloud’s managed service and commit — fastest velocity, highest lock-in.

Most teams should pick option 3 unless they have an explicit reason — regulatory, geographic, contractual — to demand portability.

Layer 4: Higher-order managed services

Bedrock, SageMaker, Vertex AI, BigQuery, Athena, Redshift, Snowflake-on-cloud-X, Lambda, Cloud Run, Step Functions, Durable Functions. None of these are portable. The interfaces are not standardized, the data formats diverge, the SDKs are bespoke.

You have exactly one honest choice here: pick the service you need, write a thin facade in your own code so that the call sites in your application don’t depend on the SDK, and accept that swapping providers will be a project, not a config change.

A facade buys you the option to migrate, not the cheapness of migration. That is still worth a lot.

What portability actually buys you

Strip the marketing away and you get three real benefits:

Disaster recovery options. When a region or cloud has a multi-day outage — and they do — you can fail over.
Negotiating leverage. When your cloud bill goes up 30% at renewal, the credible threat to leave matters. It only works if you actually could.
Regulatory adaptability. If sovereignty rules force you into a specific region or provider you weren’t using, you have a path.

Notice what is not on this list: cost reduction. Multi-cloud almost always costs more than single-cloud at the same scale. Egress charges between clouds are predatory. Operational complexity multiplies. Engineering time gets spent on portability instead of features.

If your justification for cloud-agnostic engineering is “save money,” you are doing it wrong. The justification is risk management, and risk management has a price.

What portability actually costs

Concretely, in the order you will notice them:

Slower velocity. Every cloud-specific feature you cannot use is a feature you have to build, and one that built-in services would handle for free.
More glue code. Identity, secrets, observability, networking — every layer needs a per-cloud adapter.
More cognitive load. Engineers have to know two or three cloud’s worth of quirks instead of one. Onboarding gets longer.
Higher infra cost. Egress, redundant managed services, idle capacity in standby regions.
Operational divergence. You will discover that your Azure deployment behaves differently than your AWS one in some specific way at exactly the worst time.

A reasonable estimate, from teams I’ve worked with: a fully cloud-agnostic stack costs 20-40% more in engineering hours and 10-25% more in infrastructure spend than the equivalent single-cloud stack. Whether that is worth it depends entirely on what you get for the money.

A practical decision framework

When someone asks “should this be cloud-agnostic?”, run it through this:

Is there a regulatory or contractual requirement? If yes, portability is mandatory; argue about which layer, not whether.
What is the blast radius of being trapped on one cloud? If your business goes under when AWS raises prices 20%, portability is a hedge. If you can absorb a 20% bill increase, it is theater.
What is the failure mode you are worried about? Cloud-wide outages happen but are rare. Region-wide outages are more common and don’t require multi-cloud — multi-region within one cloud often suffices.
Where in the stack is the dependency? Layer 1 portability is cheap. Layer 4 portability is sometimes impossible. Do the easy layers; do not pretend the hard ones are easy.
What is the time horizon? Portability is an option you exercise when something goes wrong. Options have premium. Pay for the options you might actually exercise.

The pattern that actually works

The teams I have seen do this well share a structure:

Single cloud as the primary with all current production traffic.
A second cloud that is provisioned, has the application running, and takes a small percentage of traffic — enough that it is exercised, not so much that it doubles cost.
Stateful services on the primary cloud only; the secondary has a way to recover state (snapshots, async replication) but is not active-active.
Layer 1 (infra) genuinely portable via Terraform.
Layer 2 (Kubernetes) genuinely portable.
Layer 3 (state) primary-only, with a documented migration path.
Layer 4 (higher-order) primary-only, behind facades.

This is sometimes called “cloud-portable, not cloud-agnostic.” You can move if you have to. You haven’t paid the full cost of pretending you are equally on three clouds at once.

The honest answer

Cloud-agnostic engineering is real and worth doing for the layers where it is cheap, and an expensive fantasy for the layers where it is hard. The teams that get burned are the ones who treat it as a yes/no decision rather than a per-layer trade.

Pick the layers, pay the price for the ones that matter to your risk profile, and stop apologizing for using the managed services that make you fast where they don’t.

Harnessing the Power of Next.js and React: A Comprehensive Guide Beyond Goroutines: Production Patterns for Go Concurrency