Architecture BlueprintProven

Event-Driven Architecture with Apache Kafka

A production-grade blueprint for enterprise event-driven systems — covering cluster design, topic strategy, producer/consumer patterns, schema management, error handling, and observability.

Budhisamvad Research·Jan 2026·15 min read·Includes architecture diagram
10k+
events/sec where Kafka becomes the right choice
Practitioner threshold
3
minimum brokers for production high-availability
Kafka reference architecture
RF 3
replication factor for all critical topics
Kafka reference architecture
100%
of production topics should have a dead-letter queue
Budhisamvad standard

Event-driven architecture with Kafka is appropriate when you have multiple services that need to react to the same events asynchronously, when you need to decouple producers from consumers, or when you need high-throughput, fault-tolerant message processing with replay capability. It is also one of the most over-adopted patterns in enterprise architecture — used where a simpler request-response model would serve better.

If you cannot articulate why you need event replay, multiple independent consumers, or throughput above ten thousand events per second, you probably do not need Kafka. You need a message queue, and the operational overhead of Kafka will cost you more than it returns.

The Kafka adoption test
Use this when
  • Multiple consumers need to react to the same event independently
  • High throughput — above 10,000 events per second
  • Audit trail and event replay capability are required
  • Loose coupling between producing and consuming services
Avoid when
  • Simple request-response communication would suffice
  • Low message volumes (under 1,000 events/sec)
  • Strong transactional consistency is the primary requirement
  • The team lacks Kafka operational expertise and has no time to build it
Architecture diagram — Enterprise Event-Driven Platform with Apache Kafka
Enterprise event-driven architecture with Apache KafkaSchema RegistryAvro / Protobuf schema versioningProducersOrder ServicePayment GatewayInventory APIIoT SensorsUser EventsApache Kafka ClusterBroker 1Broker 2Broker 3Topics (partitioned, replicated)orders.*payments.*inventory.*notifications.*audit.*dead-letter-queue.* ← failed messagesConsumersPostgreSQL / DBElasticsearchAnalytics (Spark)Notification SvcAudit LedgerObservability & MonitoringPrometheus · Grafana · Kafka UI · Lag monitoring · Consumer group health · Topic partition metricsKRaft mode (Kafka 3.x) — no ZooKeeper dependency
Practitioner insight
From the field: The most common Kafka failure in enterprise environments is not technical — it is operational. Teams stand up a cluster, get it working in a proof-of-concept, then discover that running Kafka in production requires dedicated expertise: partition rebalancing, consumer lag monitoring, broker capacity planning, and schema evolution governance. Budget for the operational capability, not just the cluster. A managed Kafka service (Confluent Cloud, AWS MSK, Azure Event Hubs) is frequently the right call for teams without a dedicated streaming platform team.

Topic Design Patterns

Topic design is where most event-driven architectures succeed or fail. These patterns are the difference between a system that scales cleanly and one that becomes an unmaintainable tangle of poorly-named topics with inconsistent ordering guarantees.

CriterionPatternWhy it matters
Domain prefix namingorders.created, orders.shipped — enables team ownership and fine-grained ACLs per namespace
Past-tense eventsuser.registered, payment.processed — events describe facts that occurred, not commands
Partition by entity keyPartition by customer_id — guarantees all events for an entity are processed in order
Tiered retentionTransactional: 7 days. Audit: 7 years — match retention to compliance and replay needs
Dead-letter queuesEvery production topic needs a DLQ — failed messages must never be silently dropped
Watch out
Partitioning by a random key (or round-robin) destroys ordering guarantees. If event order matters for an entity — and it almost always does for financial or state-change events — you must partition by a stable entity key such as customer_id or order_id. This is one of the hardest mistakes to fix after the fact, because changing the partition key requires reprocessing the entire topic.
FrameworkThe Dead-Letter Discipline
Every production topic gets a corresponding dead-letter queue. When message processing fails, the message goes to the DLQ — never silently dropped, never infinitely retried. DLQ growth rate is a first-class alerting metric: any sustained DLQ growth is a service incident, because it means messages are failing to process and business events are being lost. Most teams discover they need this discipline only after losing data in production.

Get the Kafka Architecture Diagram as a PDF

The enterprise event-driven architecture diagram, topic naming guide, and production checklist — for architecture review boards.

Production Implementation Sequence

  1. 01
    Provision a 3-broker cluster with rack awarenessWeek 1

    Minimum 3 brokers for production HA. Replication factor 3 for all critical topics. Enable rack awareness for multi-AZ deployment. Use KRaft mode (Kafka 3.3+) to eliminate the ZooKeeper dependency.

  2. 02
    Establish schema governanceWeek 2

    Deploy Confluent Schema Registry or AWS Glue Schema Registry. Enforce Avro or Protobuf — never plain JSON in production. Define a backward-compatibility policy owned by the producing team.

  3. 03
    Implement security baselineWeek 2–3

    TLS encryption in transit. SASL authentication (SCRAM-SHA-256 or OAuth). ACLs per topic namespace per team. Audit logging for all admin operations.

  4. 04
    Build the observability stack before going liveWeek 3–4

    Consumer group lag monitoring with alerts at 5-minute lag. Broker metrics: under-replicated partitions, ISR size. Producer metrics: record-error-rate, request-latency. DLQ growth-rate alerting.

Found this useful? Share it →
This article is free to read. No paywall, no limits, ever.
✦ You just finished this article

There are 9 more like this. Plus AI advisors that go deeper.

Sign up free to get new research in your inbox, download frameworks as PDFs, and try the Cloud Architecture Advisor — AI that personalises this guidance for your specific situation.

The Leadership Brief

Weekly practitioner intelligence — platform engineering, AI, cloud architecture. Every Monday. Free forever.

Downloadable frameworks

Platform Gravity Model™, IDP selection flowchart, AI Deployment Ladder — as one-pager PDFs for your team.

Early access to research

New reports and frameworks reach members before public release.

1 free AI Advisor question

Try a Reymentos AI Advisor on what you just read. No subscription needed to try.

P
S
A
M
R
Join technology leaders worldwide

Free forever · No credit card · Unsubscribe anytime · $39/mo for AI advisors