If you are getting into data engineering, you have almost certainly come across the name Apache Kafka.
It shows up in job descriptions, system architecture diagrams, and data engineering roadmaps constantly. But for many beginners, it can feel abstract and intimidating, and full of unfamiliar terms like producers, consumers, topics, partitions, offsets, and brokers.
The good news is that once you understand the core idea behind Kafka, everything else clicks into place.
In this guide, we will break down Apache Kafka from the ground up — what it is, why it exists, how it works, and how it is used in real-world data engineering systems in a clear, practical, beginner-friendly way.
What Is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform designed to handle real-time data streams at massive scale.
In simpler terms, Kafka is a system that lets applications send, store, and receive streams of data reliably and at very high speed.
It was originally built by engineers at LinkedIn in 2010 to handle their internal data pipeline challenges. It processes billions of events per day across hundreds of services. It was open-sourced through the Apache Software Foundation in 2011 and has since become one of the most widely used data infrastructure tools in the world.
Simple Analogy
Think of Kafka like a post office for data.
- Applications that generate data are like people dropping letters into a mailbox (producers)
- Kafka is the post office that receives, stores, and organizes those letters (brokers)
- Applications that need that data are like people picking up their mail (consumers)
- Different types of mail go into different mailboxes (topics)
The post office does not care who sent the letter or when it will be picked up. It stores it safely until the recipient is ready to collect it.
Why Does Kafka Exist — The Problem It Solves
To understand why Kafka matters, you need to understand the problem it was built to solve.
The Traditional Data Pipeline Problem
In a traditional system, applications communicate directly with each other. If you have 10 applications that all need to send data to 5 other applications, you end up with 50 point-to-point connections each one custom built and maintained separately.
This creates:
- Tight coupling — If one system goes down, others are affected
- Scalability problems — Adding new consumers means modifying producers
- Data loss risk — If a consumer is slow or offline, data from the producer is lost
- Complexity — 50 connections to monitor, debug, and maintain
How Kafka Solves This
Kafka sits in the middle of your data infrastructure as a central hub. Producers send data to Kafka. Consumers read data from Kafka. Producers and consumers never talk to each other directly.
This means:
- Loose coupling — Producers and consumers are completely independent
- Scalability — Add new consumers without changing producers at all
- Durability — Data is stored in Kafka for a configurable period. Consumers can read it whenever they are ready
- Simplicity — Instead of 50 connections, every system connects to one place
Core Concepts of Apache Kafka
Before going into architecture and setup, you need to understand the fundamental building blocks of Kafka.
1. Event (Message)
An event is the fundamental unit of data in Kafka. Every piece of data that flows through Kafka is an event.
An event has three main parts:
- Key — An optional identifier for the event (used for partitioning)
- Value — The actual data payload (the content of the message)
- Timestamp — When the event was created or received
Example event:
json
{
"key": "user_123",
"value": {
"user_id": "user_123",
"action": "purchase",
"product_id": "prod_456",
"amount": 89.99,
"timestamp": "2024-01-15T14:30:00Z"
},
"timestamp": 1705327800000
}
Events are also called messages or records. These terms are used interchangeably in Kafka documentation.
2. Topic
A topic is a named category or feed to which events are written and from which events are read.
Think of a topic as a folder or channel for a specific type of data. You create different topics for different types of events.
Example topics in an e-commerce system:
user-events— User clicks, page views, sign-upsorder-events— Order placed, order shipped, order deliveredpayment-events— Payment initiated, payment completed, payment failedinventory-events— Stock updated, item added, item removed
Producers write to specific topics. Consumers read from specific topics. A topic can have many producers writing to it and many consumers reading from it simultaneously.
3. Partition
A partition is how Kafka achieves scalability and parallelism within a topic.
Each topic is divided into one or more partitions. Each partition is an ordered, immutable sequence of events. Events within a partition are assigned a sequential number called an offset.
Think of a topic as a book and partitions as chapters. Multiple readers can read different chapters simultaneously.
Why partitions matter:
- Scalability — A topic with 10 partitions can be processed by up to 10 consumers simultaneously
- Ordering — Events within a partition are strictly ordered. Events across partitions are not
- Throughput — More partitions = higher throughput because work is distributed across more machines
How events are assigned to partitions:
- If the event has a key, Kafka uses a hash of the key to consistently assign all events with the same key to the same partition
- If there is no key, Kafka distributes events across partitions in a round-robin fashion
4. Offset
An offset is a unique sequential number assigned to each event within a partition. Offsets start at 0 and increment by 1 for each new event.
Offsets serve two critical purposes:
- Ordering — Events within a partition are read in offset order, guaranteeing sequence
- Consumer tracking — Consumers track which offset they have read up to, so they know where to resume if they stop and restart
Think of the offset like a bookmark in a book. It tells you exactly where you left off.
5. Producer
A producer is any application or service that writes events to a Kafka topic.
Producers push data into Kafka without caring who reads it or when. They just publish events and move on.
Examples of producers:
- A web application writing user click events
- A payment service writing transaction events
- An IoT sensor writing temperature readings
- A database change data capture (CDC) system writing database changes
6. Consumer
A consumer is any application or service that reads events from a Kafka topic.
Consumers pull data from Kafka at their own pace. They maintain their position (offset) so they can resume exactly where they left off if they restart.
Examples of consumers:
- A fraud detection service reading payment events
- An analytics dashboard reading user behavior events
- A data warehouse loader reading transaction events
- An email service reading order completion events
7. Consumer Group
A consumer group is a collection of consumers that work together to consume a topic.
Each partition in a topic is assigned to exactly one consumer in a consumer group at any given time. This allows parallel processing i.e. multiple consumers handle different partitions simultaneously.
Key rules of consumer groups:
- If you have 4 partitions and 4 consumers in a group → each consumer handles 1 partition
- If you have 4 partitions and 2 consumers in a group → each consumer handles 2 partitions
- If you have 4 partitions and 6 consumers in a group → 4 consumers are active, 2 are idle (you cannot have more active consumers than partitions)
Different consumer groups can read the same topic completely independently and each group maintains its own offset. This means the same event can be consumed by multiple groups for different purposes.
8. Broker
A broker is a Kafka server — a machine that stores and serves events.
A Kafka cluster is made up of one or more brokers working together. Each broker stores a subset of the partitions across all topics.
Brokers handle:
- Receiving events from producers
- Storing events on disk
- Serving events to consumers
- Replicating data to other brokers for fault tolerance
9. Cluster
A Kafka cluster is a group of brokers working together as a single system.
Clusters provide:
- Fault tolerance — If one broker fails, others take over
- Scalability — Add more brokers to handle more data
- High availability — Data is replicated across multiple brokers
10. ZooKeeper and KRaft
Historically, Kafka used Apache ZooKeeper to manage cluster metadata: tracking which brokers are alive, which broker is the leader for each partition, and configuration information.
In newer versions of Kafka (2.8+), Kafka introduced KRaft mode (Kafka Raft) which is a built-in consensus mechanism that eliminates the dependency on ZooKeeper entirely. KRaft is now the default in Kafka 3.3+ and is the direction all new Kafka deployments should follow.
Kafka Architecture — How It All Fits Together
Now that you understand the individual components, let us see how they work together.
PRODUCERS KAFKA CLUSTER CONSUMERS
[Web App] ──────────► [Topic: user-events] ◄──────────── [Analytics Service]
Partition 0
[Mobile App] ────────► Partition 1 ◄──────────── [Recommendation Engine]
Partition 2
[API Service] ───────► ◄──────────── [Data Warehouse Loader]
[Payment Service] ──► [Topic: payment-events] ◄─────────── [Fraud Detection Service]
Partition 0
Partition 1 ◄─────────── [Notification Service]
The flow:
- Producers publish events to specific topics
- Kafka brokers receive the events and write them to the appropriate partition
- Events are stored durably on disk for a configurable retention period (days, weeks, or indefinitely)
- Consumers connect to their topics and pull events at their own pace
- Each consumer group tracks its own offset — knowing exactly where it left off
- Multiple consumer groups can read the same topic independently
Kafka vs Traditional Message Queues
Many beginners confuse Kafka with traditional message queues like RabbitMQ or ActiveMQ. While they solve related problems, they work very differently.
| Feature | Traditional Message Queue | Apache Kafka |
|---|---|---|
| Message Retention | Deleted after consumption | Retained for configurable period |
| Consumer Model | Push — broker pushes to consumer | Pull — consumer pulls from broker |
| Ordering | Within a queue | Within a partition |
| Replay Messages | No — gone once consumed | Yes — re-read from any offset |
| Scalability | Moderate | Extremely high |
| Multiple Consumers | Competing consumers | Multiple independent consumer groups |
| Throughput | Moderate | Millions of events per second |
| Use Case | Task queues, job processing | Event streaming, data pipelines |
| Storage | In-memory or short term | Disk-based, long term |
The biggest difference is retention and replayability. In a traditional queue, once a consumer reads a message, it is gone. In Kafka, events are retained and can be replayed from any point making it fundamentally different in how it enables data architectures.
Kafka Retention — How Long Is Data Kept?
Kafka stores events on disk and keeps them for a configurable retention period. This is one of Kafka’s most powerful features.
Time-Based Retention
# Keep events for 7 days (default)
log.retention.hours=168
# Keep events for 30 days
log.retention.hours=720
Size-Based Retention
# Keep up to 1GB per partition
log.retention.bytes=1073741824
Compact Topics
For some use cases, you want to keep only the latest event for each key rather than all historical events. This is called log compaction.
# Enable log compaction
cleanup.policy=compact
Log compaction is useful for maintaining the current state of an entity like the latest user profile or current inventory level.
Getting Started With Kafka — Local Setup
Let us set up a basic Kafka environment locally so you can follow along with practical examples.
Prerequisites
- Java 11 or higher installed
- Terminal / command line access
Step 1: Download Kafka
bash
# Download Kafka (check kafka.apache.org for latest version)
wget https://downloads.apache.org/kafka/3.6.0/kafka_2.13-3.6.0.tgz
# Extract
tar -xzf kafka_2.13-3.6.0.tgz
cd kafka_2.13-3.6.0
Step 2: Start Kafka in KRaft Mode
bash
# Generate a cluster UUID
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"
# Format the storage directory
bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties
# Start Kafka broker
bin/kafka-server-start.sh config/kraft/server.properties
Step 3: Create a Topic
bash
# Create a topic called 'user-events' with 3 partitions and replication factor 1
bin/kafka-topics.sh --create \
--topic user-events \
--partitions 3 \
--replication-factor 1 \
--bootstrap-server localhost:9092
Step 4: List Topics
bash
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
Step 5: Describe a Topic
bash
bin/kafka-topics.sh --describe \
--topic user-events \
--bootstrap-server localhost:9092
Output:
Topic: user-events Partitions: 3 ReplicationFactor: 1
Partition: 0 Leader: 1 Replicas: 1 Isr: 1
Partition: 1 Leader: 1 Replicas: 1 Isr: 1
Partition: 2 Leader: 1 Replicas: 1 Isr: 1
Producing and Consuming Messages From the Command Line
Producing Messages
bash
# Start a console producer
bin/kafka-console-producer.sh \
--topic user-events \
--bootstrap-server localhost:9092
# Type messages and press Enter to send each one
> {"user_id": "123", "action": "page_view", "page": "/home"}
> {"user_id": "456", "action": "purchase", "product": "laptop"}
> {"user_id": "123", "action": "logout"}
Consuming Messages
Open a new terminal window:
bash
# Start a console consumer — reads from the beginning
bin/kafka-console-consumer.sh \
--topic user-events \
--from-beginning \
--bootstrap-server localhost:9092
You will immediately see all the messages you produced. Any new messages produced will appear in real time.
Consuming with a Consumer Group
bash
bin/kafka-console-consumer.sh \
--topic user-events \
--group analytics-service \
--bootstrap-server localhost:9092
Using Kafka With Python
In real data engineering projects, you interact with Kafka programmatically. The most popular Python library for Kafka is kafka-python.
Installation
bash
pip install kafka-python
Python Producer Example
python
from kafka import KafkaProducer
import json
import time
from datetime import datetime
# Create a producer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
key_serializer=lambda k: k.encode('utf-8') if k else None
)
# Produce events
events = [
{"user_id": "user_123", "action": "page_view", "page": "/home"},
{"user_id": "user_456", "action": "purchase", "product_id": "prod_789", "amount": 149.99},
{"user_id": "user_123", "action": "add_to_cart", "product_id": "prod_101"},
{"user_id": "user_789", "action": "sign_up", "email": "newuser@example.com"},
]
for event in events:
# Use user_id as the key so all events from the same user go to the same partition
key = event.get("user_id")
future = producer.send(
topic='user-events',
key=key,
value=event
)
# Optional: wait for confirmation
record_metadata = future.get(timeout=10)
print(f"Sent to topic={record_metadata.topic} "
f"partition={record_metadata.partition} "
f"offset={record_metadata.offset}")
time.sleep(0.5)
# Flush and close
producer.flush()
producer.close()
print("All events produced successfully")
Output:
Sent to topic=user-events partition=2 offset=0
Sent to topic=user-events partition=0 offset=0
Sent to topic=user-events partition=2 offset=1
Sent to topic=user-events partition=1 offset=0
All events produced successfully
Notice that user_123 events go to the same partition (partition 2) because they share the same key guaranteeing ordered processing for that user.
Python Consumer Example
python
from kafka import KafkaConsumer
import json
# Create a consumer
consumer = KafkaConsumer(
'user-events',
bootstrap_servers=['localhost:9092'],
group_id='analytics-consumer-group',
auto_offset_reset='earliest', # Start from beginning if no offset saved
enable_auto_commit=True, # Automatically commit offsets
auto_commit_interval_ms=1000, # Commit every second
value_deserializer=lambda m: json.loads(m.decode('utf-8')),
key_deserializer=lambda k: k.decode('utf-8') if k else None
)
print("Consumer started. Waiting for messages...")
# Poll for messages
try:
for message in consumer:
print(f"\n--- New Event Received ---")
print(f"Topic: {message.topic}")
print(f"Partition: {message.partition}")
print(f"Offset: {message.offset}")
print(f"Key: {message.key}")
print(f"Value: {message.value}")
# Process the event
event = message.value
if event.get('action') == 'purchase':
print(f">>> Purchase detected! Amount: ${event.get('amount')}")
except KeyboardInterrupt:
print("\nConsumer stopped by user")
finally:
consumer.close()
print("Consumer closed")
Python Consumer with Manual Offset Commit
For more control over when offsets are committed. This is useful when you need to ensure processing is complete before acknowledging:
python
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'user-events',
bootstrap_servers=['localhost:9092'],
group_id='payment-processor-group',
auto_offset_reset='earliest',
enable_auto_commit=False, # Manual offset management
value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)
try:
for message in consumer:
try:
# Process the message
event = message.value
print(f"Processing event: {event}")
# Simulate processing logic
if event.get('action') == 'purchase':
# process_payment(event)
print(f"Payment processed for user {event.get('user_id')}")
# Only commit after successful processing
consumer.commit()
print(f"Offset committed: partition={message.partition} offset={message.offset}")
except Exception as e:
print(f"Error processing message: {e}")
# Do NOT commit — message will be reprocessed on restart
except KeyboardInterrupt:
print("Stopping consumer")
finally:
consumer.close()
Kafka Replication — How Fault Tolerance Works
One of Kafka’s most important features is replication. This means copying partition data across multiple brokers so that if one broker fails, no data is lost.
Replication Factor
The replication factor determines how many copies of each partition exist across the cluster.
bash
# Create a topic with replication factor 3
# Each partition will have 3 copies on 3 different brokers
bin/kafka-topics.sh --create \
--topic payment-events \
--partitions 6 \
--replication-factor 3 \
--bootstrap-server localhost:9092
Leader and Followers
For each partition, one broker is the leader and the others are followers.
- Producers always write to the leader
- Consumers always read from the leader (by default)
- Followers replicate data from the leader and stand ready to take over
- If the leader fails, one of the followers is automatically elected as the new leader
This happens automatically and transparently, producers and consumers do not need to know which broker is the leader.
ISR — In-Sync Replicas
The ISR (In-Sync Replicas) is the set of replicas that are fully caught up with the leader. Kafka only acknowledges a write as successful when the required number of ISR replicas have confirmed they have received the data.
Kafka Delivery Guarantees
Kafka supports three levels of delivery guarantee, controlled by producer configuration.
At-Most-Once (Fastest, Possible Data Loss)
python
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
acks=0 # Don't wait for any acknowledgment
)
The producer fires and forgets. If the broker crashes before storing the message, it is lost. Use only when occasional data loss is acceptable like logging or metrics.
At-Least-Once (Default, Possible Duplicates)
python
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
acks='all', # Wait for all ISR replicas to acknowledge
retries=3 # Retry on failure
)
The producer retries on failure, so every message is delivered but network issues can cause duplicate deliveries. The consumer must be idempotent — able to handle duplicates gracefully.
Exactly-Once (Strongest, Most Complex)
python
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
acks='all',
enable_idempotence=True, # Prevents duplicate writes from retries
transactional_id='my-producer-1'
)
producer.init_transactions()
try:
producer.begin_transaction()
producer.send('output-topic', key=b'key', value=b'value')
producer.commit_transaction()
except Exception as e:
producer.abort_transaction()
Exactly-once guarantees that each message is delivered and processed exactly one time — no duplicates, no losses. This requires both producer idempotence and transactional APIs.
Real-World Kafka Use Cases in Data Engineering
1. Real-Time Data Pipelines
Kafka is the backbone of most modern real-time data pipelines. Data flows from source systems through Kafka into data warehouses, data lakes, or analytics systems.
[Source Systems] → [Kafka] → [Spark Streaming / Flink] → [Data Warehouse]
→ [Elasticsearch]
→ [Real-time Dashboard]
2. Change Data Capture (CDC)
Kafka Connect with Debezium captures every change (insert, update, delete) from a database and streams it into Kafka topics — enabling real-time synchronization between systems.
[PostgreSQL Database] → [Debezium CDC] → [Kafka] → [Analytics DB]
→ [Search Index]
→ [Cache Invalidation]
3. Event-Driven Microservices
Instead of microservices calling each other directly, they communicate through Kafka topics. Completely decoupled and independently scalable.
[Order Service] → [order-placed topic] → [Payment Service]
→ [Inventory Service]
→ [Notification Service]
→ [Analytics Service]
4. Real-Time Fraud Detection
Financial transactions flow through Kafka. A fraud detection consumer reads each transaction in milliseconds, applies ML models, and triggers alerts for suspicious activity in real time.
[Payment Gateway] → [Kafka: payment-events] → [Fraud Detection ML Model]
→ [Transaction Logger]
→ [Compliance Reporting]
5. IoT Data Processing
Thousands of IoT sensors send readings to Kafka every second. Kafka handles the massive ingestion volume while downstream consumers process, aggregate, and store the data.
[10,000 IoT Sensors] → [Kafka: sensor-readings] → [Anomaly Detection]
→ [Time Series DB]
→ [Operations Dashboard]
6. Log Aggregation
Multiple application servers send their logs to Kafka. A centralized consumer collects all logs into a searchable log management system like Elasticsearch.
[App Server 1] → [Kafka: application-logs] → [Elasticsearch / Kibana]
[App Server 2] → → [Alerting System]
[App Server 3] → → [Long-term Storage]
Kafka Ecosystem — Tools You Should Know
Kafka does not work in isolation. It has a rich ecosystem of complementary tools.
Kafka Connect
A framework for building connectors that move data between Kafka and external systems without writing custom producer/consumer code.
- Source connectors — Pull data FROM external systems INTO Kafka (databases, file systems, APIs)
- Sink connectors — Push data FROM Kafka INTO external systems (databases, data warehouses, search engines)
Popular connectors: Debezium (CDC), JDBC, S3, Elasticsearch, BigQuery, Snowflake
Kafka Streams
A Java library for building real-time stream processing applications that read from and write to Kafka topics.
java
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> stream = builder.stream("user-events");
stream
.filter((key, value) -> value.contains("purchase"))
.mapValues(value -> processEvent(value))
.to("purchase-events");
ksqlDB
This is an SQL interface for stream processing on top of Kafka. It allows you to write SQL queries against real-time data streams without writing Java or Python code.
sql
-- Create a stream from a Kafka topic
CREATE STREAM user_events (
user_id VARCHAR,
action VARCHAR,
amount DOUBLE
) WITH (KAFKA_TOPIC='user-events', VALUE_FORMAT='JSON');
-- Query: Count purchases per user in real time
SELECT user_id, COUNT(*) as purchase_count
FROM user_events
WHERE action = 'purchase'
GROUP BY user_id
EMIT CHANGES;
Schema Registry
This is a service that manages and enforces schemas for Kafka messages.It ensure producers and consumers agree on data structure using Avro, Protobuf, or JSON Schema formats.
Key Kafka Concepts
| Concept | What It Is | Analogy |
|---|---|---|
| Event/Message | Unit of data in Kafka | A letter |
| Topic | Named category of events | A mailbox |
| Partition | Subdivision of a topic | A section in the mailbox |
| Offset | Position of event in partition | Page number in a book |
| Producer | App that writes events | Person sending mail |
| Consumer | App that reads events | Person receiving mail |
| Consumer Group | Team of consumers sharing work | Team of mail sorters |
| Broker | Kafka server | Post office building |
| Cluster | Group of brokers | Post office network |
| Replication | Copies of partitions across brokers | Backup copies of mail |
Advantages and Disadvantages of Apache Kafka
Advantages
- Handles millions of events per second with extremely high throughput
- Events are stored durably on disk and can be replayed from any point
- Multiple consumer groups can independently read the same topic
- Horizontally scalable — add more brokers and partitions as needed
- Fault tolerant through replication across multiple brokers
- Decouples producers and consumers completely
- Rich ecosystem — Connect, Streams, ksqlDB, Schema Registry
Disadvantages
- Steep learning curve for beginners — many new concepts to understand
- Operational complexity — managing clusters, monitoring, tuning requires expertise
- Not designed for simple request-reply patterns — there are better tools for that
- Small message overhead — Kafka is optimized for throughput, not tiny individual messages
- Requires careful partition planning — too few partitions limits scalability, too many has overhead
- Exactly-once semantics add significant complexity
Common Mistakes to Avoid
- Creating too few partitions — You can increase partitions later but it affects key-based ordering. Plan your partition count based on expected throughput from the start
- Not setting a retention policy — By default, Kafka retains data for 7 days. Make sure this matches your business requirements and storage budget
- Ignoring consumer group offsets — Always monitor consumer lag — the difference between the latest offset and the consumer’s current offset. High lag means your consumers are falling behind
- Using too many small topics — Each topic and partition has overhead. Group related event types together and use event type fields inside the message to differentiate
- Not handling consumer rebalancing — When consumers join or leave a group, Kafka triggers a rebalance that briefly pauses consumption. Make sure your processing logic handles this gracefully
- Skipping schema management — Without Schema Registry, producers and consumers can break each other by changing message formats. Always use a schema management approach in production
- Setting replication factor to 1 in production — A replication factor of 1 means a single broker failure causes data loss. Always use at least 3 in production
Apache Kafka is one of the most important technologies in modern data engineering. Once you understand its core concepts like events, topics, partitions, producers, consumers, and consumer groups. The rest of the ecosystem falls into place naturally.
Here is a quick recap of everything we covered:
- Kafka is a distributed event streaming platform for handling real-time data at scale
- Events are organized into topics, which are divided into partitions for scalability
- Producers write events to topics. Consumers read events from topics at their own pace
- Consumer groups enable parallel processing — each partition is handled by one consumer in a group
- Kafka retains events durably on disk, enabling replay and multiple independent consumers
- Replication across brokers provides fault tolerance and high availability
- The Kafka ecosystem includes Connect, Streams, ksqlDB, and Schema Registry for building complete data pipelines
- Kafka is the foundation of real-time data pipelines, event-driven architectures, CDC, fraud detection, IoT processing, and much more
Start with the core concepts, practice locally with the command-line tools, then build your first Python producer and consumer. From there, the path to production Kafka deployments and advanced stream processing becomes clear.
FAQs
What is Apache Kafka used for?
Kafka is used for real-time data streaming and event-driven architectures including data pipelines, microservice communication, change data capture, fraud detection, IoT data processing, and log aggregation.
Is Kafka a message queue?
Kafka is often compared to message queues but is fundamentally different. Unlike traditional queues, Kafka retains messages after consumption, supports multiple independent consumer groups, and is designed for much higher throughput and long-term storage.
What is a Kafka topic?
A topic is a named category in Kafka where producers write events and consumers read events. Topics are divided into partitions for scalability and parallelism.
What is the difference between a partition and a topic in Kafka?
A topic is a logical category for events. A partition is a physical subdivision of a topic that enables parallel processing. Each partition is an ordered, append-only log of events stored on a specific broker.
What programming languages can I use with Kafka?
Kafka has official clients for Java and Scala, and community clients for Python, Go, JavaScript, .NET, Ruby, and many others. The kafka-python library is the most popular Python client.
What is a consumer group in Kafka?
A consumer group is a set of consumers that work together to consume a topic in parallel. Each partition is assigned to exactly one consumer in the group, enabling high-throughput parallel processing.