Brief Look at Kafka

What is Kafka

Before going into Kafka, lets do a brief overview of what it is, and what it’s most commonly used for.

Kafka is a messaging system, where it takes it large amount of data from various sources (Producers), and allows various services to read the data off it (Consumers). Central to the architecture is something called a Kafka Broker, which handles the messages coming in and out (just like a broker does).

Kafka was created by former LinkedIn data engineers, and the whole reason for that was to facilitate quick and scalable passing of messages.

Under The Hoods of Kafka

As mentioned above, the central piece is something called a Kafka broker, and within the broker are 2 more important components:

  1. Topics
  2. Partitions

How these gel together are: Kafka has a set of Topics, which are distinct logical grouping of messages. One topic could contain messages pertaining to the weather, while the other topics could have messags about the traffic.

Each Topic is then partitioned into… Partitions. This allows parallel access to the various messages in a Topic. To track where the reading of the Partitions have happened until, the Partitions are further broken down into Segments. Each Segment stores a certain amount of messages, before a new Segment is formed.

A single Kafka cluster can have multiple brokers for redundancy. Each Parition is replicated across the various brokers. There is then a concept of a Partition leader and followers, where one of the replicas handle the read/write requests (thereby changing the index and data appended), while the rest of the replicas copy the changes made to the leader.

Partition Reading and Zero Copy

Tracking of the read data is handled by the consumer. The consumer is responsible for tracking the offset position of the log, which is where it was last read until. After reading the messages, the consumer has a choice of how to move the offset, be it linearly downwards, or resetting to the top to reread all the messages.

These Partitions are read only, and append only. And because we have an index keeping track of where the last message was read, message retrival is exactly O(1). Another reason for Kafka’s speed of data transfer is due its adoption of Zero-Copy, which is a kernel level transferof data, instead of going through the User Land, and Kernel Space. Zero-Copy

Partition Writing

Within each topic, writing to a partition is done in a round robin manner. And as mentioned, writing to a partition is append only. This means that a producer wanting to write also operates at O(1)

The producer can also specify which partition to write to by attaching a key to the message. All records with the same key will go to the same partition.

Before a producer write to Kafka, it has to request for metadata, which tells it who is the leader to write to. One common error is setting the key to null or the same, and all the messages end up in the same partition, which defeats scalability.

Consumers and Consumer Groups

Low-Level Consumer is a single consumer, while a High-Level Consumer is a group of consumers, called a Consumer Group.

The broker dictates which consumer should read from which partition. 1 partition can only be read by 1 consumer, while 1 consumer can read from many partitions (Many-to-One). Because of this, when your number of consumers = number of partitions, optimization makes it One-to-One relationship. Adding more consumers are useless, as each partition is already occupied by a single consumer.

Messages are never pushed down to the consumers, but are always pulled.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s