Apache Kafka in Production: Beyond the Quickstart

#kafka #streaming #distributed-systems #event-driven #kraft #tiered-storage #best-practices

Getting Kafka running is a one-page tutorial. Running Kafka — at production scale, where a partition rebalance at 3 AM is not allowed to take you down — is a multi-year discipline.

This is the cheat sheet I wish someone had handed me before my first real Kafka deployment. None of it is exotic; all of it is the difference between a system that survives and one that doesn’t.

Topic design is the foundation, and it is permanent

Once a topic has data, you do not get to redesign it. Plan accordingly.

Partition count

The single most under-thought decision. Partitions determine:

Maximum consumer parallelism. A consumer group can have at most as many active consumers as partitions. Want 64 parallel workers? You need ≥64 partitions.
Per-partition ordering. Order is guaranteed within a partition, never across them. If you need ordering for a key, all events for that key must hash to the same partition.
Per-broker load. Each partition lives on a leader broker. Too many partitions per broker hurts throughput and recovery time.

Sensible starting points:

6 to 12 partitions for low-volume topics.
30 to 60 for medium-volume.
100+ for high-volume.
Never start with num.partitions=1. You can increase later, but it changes the partition-key-to-partition mapping, breaking ordering for old keys.

Replication and ISR

replication.factor=3 is non-negotiable for production. Three brokers means you survive one failing during maintenance on another.

min.insync.replicas=2 plus acks=all on producers gives you durability against any single broker failure with bounded latency. The combination is what you actually want; either alone is partial.

If min.insync.replicas=replication.factor, a single slow broker stalls the entire topic. If min.insync.replicas=1, you have effectively no durability guarantee.

Retention

Two axes: time (retention.ms) and size (retention.bytes). The first one to hit triggers deletion. Pick deliberately.

For event-sourcing topics, retention is forever; use a compacted topic if you also need point-in-time reads of latest-by-key. For log/metric ingestion, days-to-weeks is standard. For commit-log-as-source-of-truth, longer.

Compaction (cleanup.policy=compact) is its own world. It is what makes Kafka useful as a key-value log — Kafka Streams’ state stores depend on it. Mix compact,delete carefully; the interaction is subtle.

Consumer groups: the rebalance problem

Every introduction to Kafka explains consumer groups. Almost none explain that the rebalance protocol is the single biggest source of production pain.

What happens during a rebalance

When a consumer joins or leaves a group, all consumers in the group stop processing, sync state with the coordinator, get reassigned partitions, and resume. This is “stop-the-world” by default.

In a high-throughput pipeline, a rebalance during a deployment can mean 30+ seconds of zero throughput, plus a backlog spike, plus duplicate processing of in-flight records.

The mitigations

Static membership (group.instance.id) — a consumer with a stable ID does not trigger rebalances on graceful restart. Use this for any long-lived consumer.

Cooperative incremental rebalancing (partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor) — only the partitions that need to move actually move. The rest keep processing. This is the protocol since Kafka 2.4 and is what you should be using by default.

Tune session.timeout.ms and heartbeat.interval.ms deliberately. Defaults are fine for most workloads. If your consumers do GC pauses or long polls, raise the session timeout to avoid spurious rebalances.

Use max.poll.interval.ms for long-processing consumers. If a single poll takes longer than this, the consumer is considered dead. Long-running message processing needs a high value (5-10 minutes) or a thread pool that returns control to the poll loop quickly.

Exactly-once semantics: read the fine print

Kafka’s “exactly-once semantics” (EOS) is real but narrower than the marketing.

EOS guarantees: a message produced by transactional producer P, consumed by transactional consumer C in the same transaction, will be processed exactly once if all systems involved cooperate. Specifically:

Producer must use enable.idempotence=true and transactional.id.
Consumer must commit offsets within the same transaction as its produces (read-process-write pattern).
Reads must use isolation.level=read_committed.
Both ends must be Kafka. Sinks to external systems (Postgres, S3, Elasticsearch) are not exactly-once unless those sinks support idempotent or transactional writes.

In practice, “effectively exactly once” — at-least-once delivery + idempotent processing — is what most production systems use. It is simpler and works across any sink. EOS is for cases where the entire pipeline is Kafka-native.

KRaft: ZooKeeper is finally optional

As of Kafka 3.3 (2022), KRaft (Kafka Raft) is production-ready. As of 3.5+, it is the recommended deployment mode. New clusters should not use ZooKeeper.

What you get:

Fewer moving parts. One distributed system instead of two.
Faster controller failover (seconds vs. tens of seconds).
Simpler cluster operations.
Higher metadata scalability — millions of partitions become tractable.

If you have a ZooKeeper-based cluster, KIP-866 documents the migration path. It is a real project, but the destination is worth it.

Tiered storage (KIP-405): the operational unlock

Kafka 3.6+ supports tiered storage — partitions can have hot data on local disks and cold data in object storage (S3, GCS, Azure Blob). This is the most important Kafka feature in years.

Why it matters operationally:

Retention is no longer a disk-cost problem. You can retain months of data at S3 prices.
Broker rebalances move less data. Cold segments stay in object storage; only hot data moves with brokers.
Storage and compute scale independently. A small cluster with deep retention is suddenly viable.

Set local.retention.ms to keep recent data (hours-to-days) on local disk for low-latency consumption, and retention.ms to your real retention requirement. The broker handles the tiering.

Watch for: object-storage egress costs if consumers frequently fetch cold data. Tiered storage assumes hot reads dominate.

The metrics that actually matter

If you are setting up Kafka monitoring, these are the dashboards that pay rent:

Broker side:

UnderReplicatedPartitions — should be zero. Non-zero for more than a few minutes is a problem.
ActiveControllerCount — should be exactly 1 across the cluster.
OfflinePartitionsCount — should be zero. Anything else is unavailability.
RequestHandlerAvgIdlePercent — below 30% means brokers are CPU-saturated.
LogFlushTimeMs — disk-write pressure.

Producer side:

record-error-rate and record-retry-rate — non-zero is fine, growing is not.
request-latency-avg — production latency, not just consumer lag.
buffer-available-bytes — if hitting zero, your producer is back-pressured.

Consumer side:

records-lag-max — partition with the worst lag. The single most important consumer metric.
commit-latency-avg — slow commits indicate broker-side or coordinator stress.
Rebalance count per minute — should be near zero in steady state.

A useful rule: if you can’t answer “is consumer group X keeping up?” in five seconds, your monitoring is not good enough.

Operational hygiene

Run a separate cluster per failure domain you care about. Sharing a cluster between business-critical pipelines and dev experimentation is asking for outages.

Don’t share consumer groups across application versions. When deploying, give the new version a new group ID and let the old one drain. Mixing versions in one group means rebalances during deploys.

Set quotas. producer_byte_rate, consumer_byte_rate, and request rate quotas per client ID. Without them, one misbehaving client takes down everyone.

Backup the config, not just the data. Topic configs, ACLs, consumer group offsets — these are the things you need to recreate the cluster.

Test broker failure regularly. Either chaos engineer it or do rolling broker restarts every few weeks. Brokers that don’t get restarted accumulate weird state.

When NOT to use Kafka

Kafka is fantastic at what it is good at. It is wrong for some things people use it for anyway:

As a database. Compacted topics give you key-value semantics, but no secondary indexes, no transactions across keys, no joins. If you find yourself building those on top of Kafka, use a real database.
As a generic message queue with low throughput, low partition count, complex routing. Pulsar, RabbitMQ, NATS, or even AWS SQS are often a better fit.
For sub-millisecond pub/sub. Kafka’s per-record latency is great by streaming-system standards (~5-10 ms typical). For HFT-grade latency, look elsewhere.
For ad-hoc point-to-point messaging. Kafka’s strength is durable, replayable, ordered logs. If you are using it as a temporary courier between two services, you have over-engineered.

The closing thought

Kafka rewards investment. The teams that go past “it works” and into “it works in the worst case we can construct” build platforms that other teams can rely on without thinking about Kafka. That is the goal: making the substrate so reliable that the people building on top of it never have to learn its quirks.

Get there with deliberate topic design, modern protocols (KRaft, cooperative rebalancing, tiered storage), boring observability, and the discipline to say no to the use cases Kafka is wrong for.

Control Planes for Distributed Systems: A Practitioner's Guide Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context