December 01, 2019

Kafka Part 1: Basics

Why Kafka?
Let's say we have a source system and a target system, where in the source consumes data from target system. In a simplest case, we have one source and one target system,so it would be easy for a source system to connect with the target. But now lets say there are x number of sources and y number of targets, and each source need to connect with all the targets. In this case it will become really difficult to maintain the whole system.



Kafka is used by:
  • Netflix use Kafka to apply recommendations in real-time while we are watching shows/movies. When we leave, what we were watching we get a new recommendation right away because of Kafka.
  • LinkedIn uses Kafka to prevent spam in their platform, collect user interactions and make better connection recommendations. 
  • Uber also uses Kafka to gather user, taxi and trip data in real-time to compute and forecast demand and compute the almighty surge pricing in real-time.
Topics, Partitions and Offsets
  • The topic in Kafka is a particular stream of data.
  • Kafka topic is similar to a table in a database.
  • A topic is identified by its name, 
  • Each topic is split in partitions.
  • Each partition have numbers and these numbers are zero and go all the way to whatever.
  • Each partition is going to be ordered and within each message with each partition, we'll get an incremental ID called offset. Offsets are infinite and unbounded. Order is guaranteed within a partition.
  • Whenever we create a topic, we need to specify the number of partitions we want in that topic (default number is mentioned in a properties file).
  • Let's say our topic has three partitions 0, 1 and 2. The first message to partition zero is going to have the offset zero, second message will have offset one, third message will have offset as two and so on.
  • Offsets of one partition is not dependent on the offset of other topics.
  • The data in Kafka, it is kept only for a limited amount of time. And by default it's one week, so after one week the data is gone. Whenever data is deleted from Kafka, the offsets are not reseted, they keep on incrementing, they never go back to zero.
  • Also, once the data is written through partition it can't be changed. 
  • So once I write offset seven in partition nine, I can never update it or just swap it or whatever, it can't be changed. 
Brokers
  • The broker in Kafka holds the topics and partitions.
  • A Kafka cluster comprised of multiple brokers, and each broker is basically a server. 
  • Each broker is identified by an ID and the ID is going to be a number.
  • Kafka is distributed. Each broker will contain only certain topic partitions, basically each broker has some kind of data, but not all the data.
  • Also, when we connect to one broker, we get you're connected to the entire cluster( that means to all the brokers). Each broker in Kafka is a bootstrap broker.
How to start Kafka and Zookeeper in Windows
  • Download the KAFKA from https://kafka.apache.org/downloads. I have downloaded kafka_2.12-2.3.0 and kept in inside 'Softwares' folder of D drive.
  • Inside 'kafka_2.12-2.3.0' we need to create a new folder 'data', inside this 'data' folder we need to create two folders 'kafka' and 'zookeeper'. These folders will hold kafka and zookeeper data.
Start Zookeeper
  • Now open 'zookeeper.properties', which is inside 'kafka_2.12-2.3.0\config' and add 'dataDir=D:/Softwares/kafka_2.12-2.3.0/data/zookeeper' in it. 
  • Open command prompt, since we are running zookeeper in windows execute 'cd D:\Softwares\kafka_2.12-2.3.0\bin\windows'. And then 'zookeeper-server-start.bat D:\Softwares\kafka_2.12-2.3.0\config\zookeeper.properties'.
  • If you see ' binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)' it means Zookeeper is started successfully.
  • After this, if you go inside 'D:\Softwares\kafka_2.12-2.3.0\data\zookeeper' you will see 'version-2' folder is created inside it.
Open Kafka
  • Open 'server.properties', which is inside 'kafka_2.12-2.3.0\config' and update 'log.dirs=D:/Softwares/kafka_2.12-2.3.0/data/kafka'.
  • Open command prompt, since we are running kafka in windows execute 'cd D:\Softwares\kafka_2.12-2.3.0\bin\windows'.  And then run 'kafka-server-start.bat D:\Softwares\kafka_2.12-2.3.0\config\server.properties'.
  • After this if you go inside 'D:\Softwares\kafka_2.12-2.3.0\data\kafka', you will see bunch of files created.
-K Himaanshu Shuklaa..

No comments:

Post a Comment

RSSChomp Blog Directory