Showing posts with label Kafka. Show all posts
Showing posts with label Kafka. Show all posts

December 01, 2019

Kafka Part 10: Implement Exactly Once Processing in Kafka

Let's say we are designing a system using Apache Kafka which will send some kind of messages from one system to another. While designing to need to consider below questions:
  • How do we guarantee all messages are processed?
  • How do we avoid/handle duplicate messages?
A timeout could occur publishing messages to Kafka. Our consumer process could run out of memory or crash while writing to a downstream database. Or may be our broker could run out of disk space, a network partition may form between ZooKeeper instances.

Kafka Part 9: Compression

Compression In Kafka
Data is send from producer to the Kafka in the text format, commonly called the JSON format. JSON has a demerit because data is stored in the string form and most of the time this creates several duplicated records to get stored in the Kafka topic. Which occupies much disk space. That's why we need compression.

Kafka Part 8: Batch Size and linger.ms



What is a Producer Batch and Kafka’s batch size?
  • A producer writes messages to the Kafka, one-by-one. It waits for the messages that are being produced to Kafka. Then, it creates a batch and put the messages into it, until it becomes full. Then, send the batch to the Kafka. Such type of batch is known as a Producer Batch. 
  • We can say Kafka producers buffer unsent records for each partition. Size of these buffers is specified in the batch.size of config file. Once the buffer is full messages will be send.
  • The default batch size is 16KB, and the maximum can be anything. Large is the batch size, more is the compression, throughput, and efficiency of producer requests. The larger messages seem to be disproportionately delayed by small batch sizes.
  • The message size should not exceed the batch size. Otherwise, the message will not be batched. Also, the batch is allocated per partitions, so do not set it to a very high number.

Kafka Part 7: Why ZooKeeper is always configured with odd number of nodes?

Let's understand a few basics:

ZooKeeper is a highly-available, highly-reliable and fault-tolerant coordination and consensus service for distributed applications like Apache Storm or Kafka. Highly-available and highly-reliable is achieved through replication.

Kafka Part 6: Assign and Seek

Assign

When we work with consumer groups, the partitions are assigned automatically to consumers and are rebalanced automatically when consumers are added or removed from the group.

ALSO READ: Kafka Consumer Group, Partition Rebalance, Heartbeat

Sometimes we need a single consumer that always read data from all the partitions in a topic, or from a specific partition in a topic. In this case, there is no reason for groups or rebalancing. We just assign the consumer-specific topic and/or partitions, consume messages, and commit offsets on occasion.

Kafka Part 5: Consumer Group, Partition Rebalance, Heartbeat

What is a Consumer Group?
Consumer Groups is a concept exclusive to Kafka. Every Kafka consumer group consists of one or more consumers that jointly consume a set of subscribed topics.

Let's say we have an application, which read messages from a Kafka topic, perform some validations and so some calculations, and write the results to another data store.

In this case our application will create a consumer object, subscribe to the appropriate topic, and start receiving messages, validating them and writing the results.

This may work well for a while, but imagine a scenario when the rate at which producers write messages to the topic exceeds the rate at which your application can validate them?

Kafka Part 4: Consumers

We have learned how to create Kafka Producer in the previous part of Kafka series. Now we will create Kafka consumer.

Reading data from Kafka is a bit different than reading data from any other messaging systems. Applications that need to read data from Kafka use a KafkaConsumer to subscribe to Kafka topics and receive messages from these topics.

In this blog post, we will discuss about the interview questions related to kafka Consumers and we will also create our own consumer.

Kafka Part 3: Kafka Producer, Callbacks and Keys

What is the role of Kafka producer?
The primary role of a Kafka producer is to take producer properties and record as inputs and write it to an appropriate Kafka broker. Producers serialize, partitions, compresses and load balances data across brokers based on partitions.

The workflow of a producer involves five important steps:
  1. Serialize
  2. Partition
  3. Compress
  4. Accumulate records
  5. Group by broker and send

Kafka Part 2: Kafka Command Line Interface (CLI)

Now we will create topics from CLI.
  • Open another command prompt, execute 'cd D:\Softwares\kafka_2.12-2.3.0\bin\windows'.
  • Make sure zookeeper and kafka broker is running.
  • Now execute 'kafka-topics --zookeeper 127.0.0.1:2181 --topic first_topic --create --partitions 3 --replication-factor 1'. This will create a kafka topic with name 'first_topic', with 3 partitions and replication-factor as 1 (in our case we cannot mention replication-factor more than 1 because we have started only one broker).
  • After executing the above command you will get a message 'Created topic first_topic'.
  • How can we check if our topic is actually created? In the same command prompt, execute 'kafka-topics --zookeeper 127.0.0.1:2181 --list'. This will list all the topics which are present.
  • To get the more details about the topic which is created, execute 'kafka-topics --zookeeper 127.0.0.1:2181 --topic first_topic --describe'

Kafka Part 1: Basics

What is Apache Kafka?
Apache Kafka is a publish-subscribe messaging system developed by Apache written in Scala. It is a distributed, partitioned and replicated log service. It is horizontally scalable, fault tolerant and fast messaging system.

Why Kafka?
Let's say we have a source system and a target system, where in the source consumes data from target system. In a simplest case, we have one source and one target system,so it would be easy for a source system to connect with the target. But now lets say there are x number of sources and y number of targets, and each source need to connect with all the targets. In this case it will become really difficult to maintain the whole system.