Kafka Interview Questions

Questions –

  1. How do you create a topic in Kafka using the Confluent CLI?
    • Command
  2. Explain the role of the Schema Registry in Kafka.
  3. How do you register a new schema in the Schema Registry?
  4. What is the importance of key-value messages in Kafka?
  5. Describe a scenario where using a random key for messages is beneficial.
  6. Provide an example where using a constant key for messages is necessary.
  7. Write a simple Kafka producer code that sends JSON messages to a topic.
  8. How do you serialize a custom object before sending it to a Kafka topic?
  9. Describe how you can handle serialization errors in Kafka producers.
  10. Write a Kafka consumer code that reads messages from a topic and deserializes them from JSON.
  11. How do you handle deserialization errors in Kafka consumers?
  12. Explain the process of deserializing messages into custom objects.
  13. What is a consumer group in Kafka, and why is it important?
  14. Describe a scenario where multiple consumer groups are used for a single topic.
  15. How does Kafka ensure load balancing among consumers in a group?
  16. How do you send JSON data to a Kafka topic and ensure it is properly serialized?
  17. Describe the process of consuming JSON data from a Kafka topic and converting it to a usable format.
  18. Explain how you can work with CSV data in Kafka, including serialization and deserialization.
  19. Write a Kafka producer code snippet that sends CSV data to a topic.
  20. Write a Kafka consumer code snippet that reads and processes CSV data from a topic.
  21. Different way to receive and ack the kafka
  22. What makes kafka fast?



2.Explain the role of the Schema Registry in Kafka.

The Schema Registry in Kafka plays a crucial role in managing schemas for data that is sent to and from Kafka topics.

Schema Management:

  • Centralized Schema Repository: The Schema Registry acts as a centralized repository for schemas used in Kafka messages. It stores and manages schemas independently from the Kafka brokers.
  • Schema Evolution: It facilitates schema evolution by allowing compatibility checks between different versions of schemas. This ensures that producers and consumers can evolve their schemas without causing disruptions.

Example:

  • Suppose a producer wants to publish messages to a Kafka topic using Avro serialization. Before sending data, it registers the Avro schema with the Schema Registry, which assigns it an ID. When the producer sends a message, it includes the schema ID alongside the serialized data. Consumers retrieve the schema ID from the message, fetch the corresponding schema from the Schema Registry, and deserialize the data accordingly.

22.What makes kafka fast?

Zero-copy writes make Kafka fast, but how exactly?

Kafka is a message broker, and it accepts messages from the network and writes to the disk, and vice versa. The traditional way of moving data from network to disk involves `read` and `write` system calls, which require data to be moved to and from user space to kernel space.

Kafka leverages `sendfile` system call which copies data from one file descriptor to another within the kernel. Kafka uses this to directly transfer data from the network socket to the file on disk, bypassing unnecessary copies.

If you are interested, just read the man page of `sendfile` system call. In most cases, whenever you see something extracting extreme performance a major chunk of it comes from leveraging the right system call.

ps: I used this zero copy while building Remote Shuffle Service for Apache Spark. It proved pivotal in getting a great performance while moving multi-tb data across machines.

⚡ Admissions for my System Design June 2024 cohort are open, if you are SDE-2 and above and looking to build a rock-solid intuition to design any and every system, check out

UBer Usecase –

https://www.linkedin.com/pulse/case-study-kafka-async-queuing-consumer-proxy-vivek-bansal-lt1pc/?trackingId=sXBYzdx7T42SFdmitvQVwQ%3D%3D

Messaging

INdex

  • difference
  • version of apache kafka

Key Differences:

Active MQ vs IBM MQ / WebSphere MQ Vs Kafka

Kafka Consumption Optimisation

  • Kafka parameters & Performance Optimization

Following are the parameters of Kafka that can be balanced one over other for performance-

  1. Partition : a partition is a logical unit of storage for messages. Each topic in Kafka can be divided into one or more partitions. Messages are stored inorder within each partition, and each message is assigned a unique identifier called an offset.
  2. Number of brokers :
  3. Number of consumer instances or no. of pods on which these instances are running
  4. Concurrency :
  5. Consumer group :
    • Use a consumer group to scale out consumption. This will allow you to distribute the load of consuming messages across multiple consumers, which can improve throughput.
  6. fetch size of batch data :

Optimal Partition Configuration-

Increase the number of partitions. This will allow more consumers to read messages in parallel, which will improve throughput. so it the partition and consumer should have 1:1 ration for better performance?

Note: Kafka related Bottlenecks will not occur while pushing the data because as in this case it depends on external source of data how fast it generates. Bottlenecks occurs when huge data on topic and limited consumer capacity (instances, capacity, consumption configuration etc).

Use cases:

Case 1: If Kafka consumer is struggling to keep up with the incoming data (suppose 170million events data lag). To decrease the lag and improve the performance of your Kafka setup, you can consider the following steps:

  1. Consumer Configuration:
    • Increase the number of consumer instances to match the partition count or even exceed it. Since you have 40 partitions, consider having at least 40 consumer instances. This ensures that each partition is consumed by a separate consumer, maximizing parallelism and throughput.
    • Tune the consumer configuration parameters to optimize performance. Specifically, consider adjusting the fetch.min.bytes, fetch.max.wait.ms, max.poll.records, and max.partition.fetch.bytes settings to balance the trade-off between latency and throughput. Experiment with different values to find the optimal configuration for your use case.
  2. Partition Configuration:
    • Assess the data distribution pattern to ensure an even distribution across partitions. If the data is skewed towards certain partitions, consider implementing a custom partitioner or using a key-based partitioning strategy to distribute the load more evenly.
    • If you anticipate further data growth or increased load, you might consider increasing the number of partitions. However, adding partitions to an existing Kafka topic requires careful planning, as it can have implications for ordering guarantees and consumer offsets.
  3. Cluster Capacity:
    • Evaluate the overall capacity and performance of your Kafka cluster. Ensure that your brokers have sufficient CPU, memory, and disk I/O resources to handle the volume of data and consumer concurrency.
    • Monitor the broker metrics to identify any potential bottlenecks. Consider scaling up your cluster by adding more brokers if necessary.
  4. Monitoring and Alerting:
    • Implement robust monitoring and alerting systems to track lag, throughput, and other relevant Kafka metrics. This enables you to proactively identify issues and take appropriate actions.
  5. Consumer Application Optimization:
    • Review your consumer application code for any potential performance bottlenecks. Ensure that your code is optimized, handles messages efficiently, and avoids any unnecessary delays or blocking operations.

Spring Kafka

Index

  1. Resources
    • v3.1 features
  2. Producer
  3. Consumer
    • consumer variations -8
    • consumer factory
  4. Todo
  5. Findings/Answers

API Docs:

  1. https://docs.spring.io/spring-kafka/docs/current/api/

For new features added in specific version of spring-kafka refer :

  1. https://docs.spring.io/spring-kafka/docs/ [refer the version from below link if not knoe–>select version > refernces>htmls]
  2. https://spring.io/projects/spring-kafka#learn

Notes to implement for performance:

https://spring.io/projects/spring-kafka#learn

linkedln :

13 ways to learn Kafka:

  1. 1. Tutorial: Official Apache Kafka Quickstart – https://lnkd.in/eVrMwgCw
  2. 2. Documentation: Official Apache Kafka Documentation – https://lnkd.in/eEU2sZvq
  3. 3. Tutorial: Kafka Learning with RedHat – https://lnkd.in/em-wsvDt
  4. 4. Read: Kafka – The Definitive Guide: Real-Time Data and Stream Processing at Scale – https://lnkd.in/ez3aCVsH
  5. 5. Course: Apache Kafka Essential Training: Getting Started – https://lnkd.in/ettejx2w
  6. 6. Read: Kafka in Action – https://lnkd.in/ed7ViYQZ
  7. 7. Course: Apache Kafka Deep Dive – https://lnkd.in/ekaB9mv6
  8. 8. Read: Apache Kafka Quick Start Guide – https://lnkd.in/e-3pSXnu
  9. 9. Course: Learn Apache Kafka for Beginners – https://lnkd.in/ewh6uUyT
  10. 10. Course: Apache Kafka Crash Course for Java and Python Developers – https://lnkd.in/e72AHUY4
  11. 11. Read: Mastering Kafka Streams and ksqlDB: Building real-time data systems by example – https://lnkd.in/eqr_DaY2
  12. 12. Course: Deploying and Running Apache Kafka on Kubernetes – https://lnkd.in/ezQ58usN
  13. 13. Course: Stream Processing Design Patterns with Kafka Streams – https://lnkd.in/egrks3rn

Kafka 3.1 features –

  1. Micrometer observations –
  2. Same broker for multiple test cases
  3. Retryable topic changes are permanent.
  4. KafkaTemplate supporting CompletableFuture(?) instead of LIstenableFuture(?).
  5. Testing Changes
    • Since 3.0.1 the application sets the default broker to application broker spring.kafka.bootstrap-servers – default embedded one.
    • .

References: https://docs.spring.io/spring-kafka/docs/current/reference/html/

Points :

  1. Starting with version 2.5 , Broker can be changed at runtime – Section “Connecting to Kafka”
    • Suport For ABSwitchCluster -one cluster active at a time