Pages

December 01, 2019

Kafka Part 1: Basics

What is Apache Kafka?
Apache Kafka is a publish-subscribe messaging system developed by Apache written in Scala. It is a distributed, partitioned and replicated log service. It is horizontally scalable, fault tolerant and fast messaging system.

Why Kafka?
Let's say we have a source system and a target system, where in the source consumes data from target system. In a simplest case, we have one source and one target system,so it would be easy for a source system to connect with the target. But now lets say there are x number of sources and y number of targets, and each source need to connect with all the targets. In this case it will become really difficult to maintain the whole system.



Kafka is used by:
  • Netflix use Kafka to apply recommendations in real-time while we are watching shows/movies. When we leave, what we were watching we get a new recommendation right away because of Kafka.
  • LinkedIn uses Kafka to prevent spam in their platform, collect user interactions and make better connection recommendations. 
  • Uber also uses Kafka to gather user, taxi and trip data in real-time to compute and forecast demand and compute the almighty surge pricing in real-time.
Why is Kafka technology significant to use?
Kafka, being a distributed publish–subscribe system, has the following advantages:
  • Fast: Kafka comprises a broker, and a single broker can serve thousands of clients by handling megabytes of reads and writes per second.
  • Scalable: Data is partitioned and streamlined over a cluster of machines to enable large information.
  • Durable: Messages are persistent and is replicated in the cluster to prevent record loss.
  • Distributed by design: It provides fault-tolerance and robustness.
List the various components in Kafka.
  • Topic – a stream of messages belonging to the same type
  • Producer – that can publish messages to a topic
  • Brokers – a set of servers where the publishes messages are stored
  • Consumer – that subscribes to various topics and pulls data from the brokers.
Explain Topics, Partitions and Offsets
  • The topic in Kafka is a particular stream of data.
  • Topic is very similar to a table in a NoSQL database. Like tables in a NoSQL database, the topic is split into partitions that enable topic to be distributed across various nodes. Like primary keys in tables, topics have offsets per partitions. We can uniquely identify a message using its topic, partition and offset.
  • A topic is identified by its name.  
  • Each topic is split in partitions. Partitions enable topics to be distributed across the cluster. 
  • Partitions are a unit of parallelism for horizontal scalability. One topic can have more than one partition scaling across nodes.
  • Each partition have numbers and these numbers are zero and go all the way to whatever.
  • Each partition is going to be ordered and within each message with each partition, we'll get an incremental ID called offset. Offsets are infinite and unbounded. Order is guaranteed within a partition.
  • Whenever we create a topic, we need to specify the number of partitions we want in that topic (default number is mentioned in a properties file).
  • Let's say our topic has three partitions 0, 1 and 2. The first message to partition zero is going to have the offset zero, second message will have offset one, third message will have offset as two and so on.
  • Offsets of one partition is not dependent on the offset of other topics.
  • The data in Kafka, it is kept only for a limited amount of time. And by default it's one week, so after one week the data is gone. Whenever data is deleted from Kafka, the offsets are not reseted, they keep on incrementing, they never go back to zero.
  • Also, once the data is written through partition it can't be changed, because the messages written to partitions are immutable.
  • So once I write offset seven in partition nine, I can never update it or just swap it or whatever, it can't be changed. 
Explain the role of the offset.
The messages in partitions will be given a unique sequential ID known as an offset. The role of the offset is to uniquely identify every message within the partition.With the aid of ZooKeeper, Kafka stores the offsets of messages used for a specific topic and partition by a consumer group.

Mention what is the meaning of broker in Kafka?
  • In Kafka cluster, broker term is used to refer Server.
  • The broker in Kafka holds the topics and partitions.
  • A Kafka cluster comprised of multiple brokers, and each broker is basically a server. 
  • Each broker is identified by an ID and the ID is going to be a number.
  • Kafka is distributed. Each broker will contain only certain topic partitions, basically each broker has some kind of data, but not all the data.
  • Also, when we connect to one broker, we are connected to the entire cluster( that means to all the brokers). Each broker in Kafka is a bootstrap broker.
What is the role of the ZooKeeper?
  • Zookeeper is an open source, high-performance co-ordination service used for distributed applications adapted by Kafka
  • Zookeeper stores metadata and current state of Kafka cluster.
  • Earlier Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group. But now Kafka has moved the offset storage from zookeeper to Kafka brokers. 
  • It keeps track of the status of the Kafka cluster nodes, as well as of Kafka topics, partitions, etc.
  • Since the data is divided across collections of nodes within ZooKeeper, it exhibits high availability and consistency. When a node fails, ZooKeeper performs an instant failover migration.
  • ZooKeeper is used in Kafka for managing service discovery for Kafka brokers, which form the cluster. 
  • ZooKeeper does  leader detection and communicates with Kafka when a new broker joins, when a broker dies, when a topic gets removed, or when a topic is added so that each node in the cluster knows about these changes. Thus, it provides an in-sync view of the Kafka cluster configuration.
How does Kafka store offsets for each topic?
Earlier Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group. Since kafka 0.9, it’s not zookeeper anymore that store the information about what were the offset consumed by each groupid on a topic by partition. Kafka now store this information on a topic called __consumer_offsets.

Is it possible to use Kafka without ZooKeeper?
No, it is not possible to bypass Zookeeper and connect directly to the Kafka server. If, for some reason, ZooKeeper is down, you cannot service any client request.

What is the process for starting a Kafka server?
Since Kafka uses ZooKeeper, it is essential to initialize the ZooKeeper server, and then fire up the Kafka server.

How to start Kafka and Zookeeper in Windows
  • Download the KAFKA from https://kafka.apache.org/downloads. I have downloaded kafka_2.12-2.3.0 and kept in inside 'Softwares' folder of D drive.
  • Inside 'kafka_2.12-2.3.0' we need to create a new folder 'data', inside this 'data' folder we need to create two folders 'kafka' and 'zookeeper'. These folders will hold kafka and zookeeper data.
Start Zookeeper
  • Now open 'zookeeper.properties', which is inside 'kafka_2.12-2.3.0\config' and add 'dataDir=D:/Softwares/kafka_2.12-2.3.0/data/zookeeper' in it. 
  • Open command prompt, since we are running zookeeper in windows execute 'cd D:\Softwares\kafka_2.12-2.3.0\bin\windows'. And then 'zookeeper-server-start.bat D:\Softwares\kafka_2.12-2.3.0\config\zookeeper.properties'.
  • If you see ' binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)' it means Zookeeper is started successfully.
  • After this, if you go inside 'D:\Softwares\kafka_2.12-2.3.0\data\zookeeper' you will see 'version-2' folder is created inside it.
Open Kafka
  • Open 'server.properties', which is inside 'kafka_2.12-2.3.0\config' and update 'log.dirs=D:/Softwares/kafka_2.12-2.3.0/data/kafka'.
  • Open command prompt, since we are running kafka in windows execute 'cd D:\Softwares\kafka_2.12-2.3.0\bin\windows'.  And then run 'kafka-server-start.bat D:\Softwares\kafka_2.12-2.3.0\config\server.properties'.
  • After this if you go inside 'D:\Softwares\kafka_2.12-2.3.0\data\kafka', you will see bunch of files created.
Explain the concept of Leader and Follower.
  • Every partition in Kafka has one server which plays the role of a leader, and none or more servers that act as Followers.
  • The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader.
  • When the leader is dead, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.
What s the role of Kafka Broker?
  • A broker is a single Kafka node that is managed by Zookeeper. 
  • Set of brokers form a Kafka cluster. 
  • Topics that are created in Kafka are distributed across brokers based on the partition, replication, and other factors. When a broker node fails based on the state stored in zookeeper it automatically rebalances the cluster and also in case if a leader partition is lost then one of the follower partition is elected as leader.
Replication
Replication is making a copy of a partition available in another broker. It enables Kafka to be fault tolerant even when a broker is down.

There can be one leader broker and n follower brokers for each partition. Both producer and consumers are severed only by the leader. In case of a broker failure the partition from another broker is elected as leader and it starts serving the producers and consumer group.

We can set how many such brokers will exist by setting the property 'replication.factor'. It's the total amount of times the data inside a single partition is replicated across the cluster. The default and typical recommendation is three.

Producer clients only write to the leader broker. The followers asynchronously replicate the data. In real world, we need a way to tell whether these followers are managing to keep up with the leader? Do they have the latest data written to the leader?

In-sync replicas
Replica partitions that are in sync with the leader are flagged as ISR (In Sync Replica). We can say, an in-sync replica (ISR) is a broker that has the latest data for a given partition. A leader is always an in-sync replica. A follower is an in-sync replica only if its not behind on the latest records for a given partition.

If a follower broker falls behind the latest data for a partition, we no longer count it as an in-sync replica.

What roles do Replicas and the ISR play?
  • Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. 
  • ISR stands for In-Sync Replicas, it is essentially a set of message replicas that are synced to the leaders.
Explain how you can reduce churn in ISR? When does broker leave the ISR?
  • ISR is a set of message replicas that are completely synced up with the leaders, in other word ISR has all messages that are committed. 
  • ISR should always include all replicas until there is a real failure. A replica will be dropped out of ISR if it deviates from the leader.
Why are Replications critical in Kafka?
Replication ensures that published messages are not lost and can be consumed in the event of any machine error, program error or frequent software upgrades.

If a Replica stays out of the ISR for a long time, what does it signify?
It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.

Mention what happens if the preferred replica is not in the ISR?
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.

Elaborate the architecture of Kafka.
In Kafka, a cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple partitions, and each broker stores one or more of those partitions so that multiple producers and consumers can publish and retrieve messages at the same time.

What is the maximum size of the message does Kafka server can receive?
The maximum size of the message that Kafka server can receive is 1000000 bytes.

Mention what is the traditional method of message transfer?
The traditional method of message transfer includes two methods
•Queuing: In a queuing, a pool of consumers may read message from the server and each message goes to one of them.
•Publish-Subscribe: In this model, messages are broadcasted to all consumers.

Kafka caters single consumer abstraction that generalized both of the above- the consumer group.

What are challenges with traditional messaging system?
Single Point of Failure: The traditional messaging systems are designed for a topology of the Hub and Spoke. All messages are stored in this design on a central server, or broker. So, at any given point in time each client application connects to one server or broker. This design can turn out to be a single point of failure, as all the topics are queued in the central hub. Even if you add a standby to the primary broker, the client application will only be able to connect to one node.

Difficulties in horizontal scaling: While Hub and spoken architectures have developed into multi-node networks, horizontal scaling is not allowed by the architecture. A single client application connects at a given time to a single node. When you increase the number of nodes, the amount of internode traffic that writes and reads processes also increases.

Monolithic architecture: Traditional messaging systems are designed to address a monolithic architecture's data challenges. Nevertheless, most companies now operate a clustered, distributed computing system. Large Commodity Hardware clusters can not be scaled horizontally in this design. Moreover, messages will wait in queues.

Apache Kafka is specially designed to address data management challenges in a distributed and clustered computing environment. The distributed messaging is taken to the next level through Apache Kafka.

-K Himaanshu Shuklaa..

1 comment: