December 01, 2019

Kafka Part 9: Compression

Compression In Kafka
Data is send from producer to the Kafka in the text format, commonly called the JSON format. JSON has a demerit because data is stored in the string form and most of the time this creates several duplicated records to get stored in the Kafka topic. Which occupies much disk space. That's why we need compression.

Advantages of Message Compression
  • Compression leads to increased CPU cycles.
  • Reduce disk footprint. Due to Compression we not only save a significant amount of disk space, but latency will also decrease due to the smaller size of the messages.
  • It will reduce the latency and size required to send data to Kafka.
  • It will reduce the bandwidth that will make users increase the net messages which are sent to the broker.
  • The reduced disk load will lead to fast read and write operations.
Disadvantages of Message Compression
The producers commit some CPU cycles for compression. And the consumers commit some CPU cycles for decompression. This lead to increased CPU usage.

In Kafka by default, messages are sent uncompressed. The parameter compression.type can be set to snappy, gzip, or lz4 to compress data before sending it to the brokers.
  • Snappy compression was invented by Google and aims for very high speed and reasonable compression. It does not aim for maximum compression so the decrease in size might not be that significant. If you are looking for a fast compression algorithm, Snappy might work for you.
  • Gzip compression will typically use more CPU and time but results in better compression ratios, so it recommended in cases where network bandwidth is more restricted.

Compression In Kafka 0.7
Compression used to take place at the producer end, where producer compressed a batch of messages. This compressed message gets appended, as is, to the Kafka broker’s log file. Consumer fetch the data, decompress it and hands out original message to the user. The broker pays no penalty as far as compression overhead is concerned.

Compression In Kafka 0.8
In Kafka 0.7, messages were addressable by physical bytes offsets into the partition’s write ahead log. In Kafka 0.8, each message is addressable by a monotonically increasing logical offset that is unique per partition. The 1st message has an offset of 1, the 100th message has an offset of 100 and so on.

Though this feature provides a simplified offset management, but, it has an impact on broker performance if the incoming data is compressed. 

In Kafka 0.8, messages for a partition are served by the leader broker. Its the leader's job to assign the unique logical offsets to every message it appends to its log. If the data is compressed, the leader has to decompress the data in order to assign offsets to the messages inside the compressed message. So the leader decompresses data, assigns offsets, compresses it again and then appends the re-compressed data to disk. The leader has to do this for every compressed message it has received.

-K Himaanshu Shuklaa..

No comments:

Post a Comment