In this post, we put together the best Kafka interview questions for beginner, intermediate and experienced candidates. These most important questions are for quick browsing before the interview or to act as a detailed guide on different topics in Kafka interviewers look for.
Explain how you can improve the throughput of a remote consumer?
If the consumer is located in a different data center from the broker, you may require to tune the socket buffer size to amortize the long network latency.
What is the retention policy for Kafka records in a Kafka cluster?
Kafka cluster retains all data records using a configurable retention period. The data records are retained even if they have been consumed by the consumers. For example, if the retention period is set as one week, then the data records are stored for one week after their creation before they are deleted. So consumers can access this data for one week after its creation.
What are the core APIs provided in Kafka platform?
Kafka provides the following core APIs:
- Producer API - An application uses the Kafka producer API to publish a stream of records to one or more Kafka topics.
- Consumer API - An application uses the Kafka consumer API to subscribe to one or more Kafka topics and consume streams of records.
- Streams API - An application uses the Kafka Streams API to consume input streams from one or more Kafka topics, process and transform the input data, and produce output streams to one or more Kafka topics.
- Connect API - An application uses the Kafka connect API to create producers and consumers that connect Kafka topics to existing applications or data systems.
Compare: RabbitMQ vs Apache Kafka
One of the Apache Kafka’s alternative is RabbitMQ. So, let’s compare both:i. Features: Apache Kafka– Kafka is distributed, durable and highly available, here the data is shared as well as replicated.RabbitMQ– There are no such features in RabbitMQ.
ii. Performance rate:Apache Kafka– To the tune of 100,000 messages/second.RabbitMQ- In case of RabbitMQ, the performance rate is around 20,000 messages/second.
Justify the offset in writer information integration tool?
Messages square measure keep in partitions and assigneda distinctive ID to every of them for fast and straightforward access. That distinctive range is known as because the offset that’s accountable to spot every of the messages within the partition.
What is the difference between Apache Kafka and Apache Storm?
- Apache Kafka: It is a distributed and robust messaging system that can handle huge amount of data and allows passage of messages from one end-point to another.
- Apache Storm: It is a real time message processing system, and you can edit or manipulate data in real time. Apache storm pulls the data from Kafka and applies some required manipulation.
What do you know about a partition key?
A partition key is used to point to the aimed division of communication in Kafka producer. Usually, a hash-oriented divider concludes the division ID with the input, and also people use modified divisions.
Explain the role of Streams API?
An API which permits an application to act as a stream processor, and also consuming an input stream from one or more topics and producing an output stream to one or more output topics, moreover, transforming the input streams to output streams effectively, is what we call Streams API.
What is a way to balance masses in writer once one server fails?
Every partition in writer has one main server that plays the role of a pacesetter and one or additional non-connected servers that square measure named because the followers. Here, the leading server sets the permission and remainder of the servers simply follow him consequently. In case, leading server fails then followers take the responsibility of the most server.
Within the producer, when will a “queue fullness” situation come into play?
Queue fullness occurs when there are not enough Followers servers currently added on for load balancing.
Explain the term “Log Anatomy”.
We view log as the partitions. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select.
What is multi-tenancy?
This is the most asked Kafka Interview Questions in an interview. Kafka can be deployed easily as a multi-tenant solution. The configuration for different topics on which data is to be produced or consumed this feature is enabled. With all this, it also provides operational support for different quotas.
What do you mean by Stream Processing in Kafka?
The type of processing of data continuously, real-time, concurrently, and in a record-by-record fashion is what we call Kafka Stream processing.
If the replica stays out of the ISR for a very long time, then what does it tell us?
If the replica stays out of the ISR for a very long time, or replica is not in synch with the ISR then it means that the follower server is not able to grasp data as fast the leader is doing. So basically the follower is not able to come up with the leader activities.
Do you know how to improve the throughput of the remote consumer?
Well, it is interesting and advance concept in Kafka. If the consumer is located in the distant location then you need to optimize the socket buffer size to tune the overall throughput of a remote consumer.
When do you call the cleanup method?
The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: For instance, if the machine the task is running on blows up, there’s no way to invoke the method. The cleanup method is intended when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.
Why do you think the replications to be dangerous in Kafka?
Duplication assures that the issued messages available are absorbed in the case of any appliance mistake, plan fault, or recurrent software promotions.
State Disadvantages of Apache Kafka.
Limitations of Kafka are:
- No Complete Set of Monitoring Tools
- Issues with Message Tweaking
- Not support wildcard topic selection
- Lack of Pace
How to balance loads in Kafka when one server fails?
Every partition in Kafka has one main server that plays the role of a leader and one or more non-connected servers that are named as the followers. Here, the leading server sets the permission and rest of the servers just follow him accordingly. In case, leading server fails then followers take the responsibility of the main server.
How to start a Kafka server?
Given that Kafka exercises Zookeeper, we have to start the Zookeeper’s server. One can use the convince script packaged with Kafka to get a crude but effective single node Zookeeper instance> bin/zookeeper-server-start.shconfig/zookeeper.properties. Now the Kafka server can start> bin/Kafka-server-start.shconfig/server.properties.
What ensures load balancing of the server in Kafka?
As the main role of the Leader is to perform the task of all read and write requests for the partition, whereas Followers passively replicate the leader. Hence, at the time of Leader failing, one of the Followers takeover the role of the Leader. Basically, this entire process ensures load balancing of the servers.
What roles do Replicas and the ISR play?
Basically, a list of nodes that replicate the log is Replicas. Especially, for a particular partition. However, they are irrespective of whether they play the role of the Leader. In addition, ISR refers to In-Sync Replicas. On defining ISR, it is a set of message replicas that are synced to the leaders.
What is the way to send large messages with Kafka?
In order to send larges messages using Kafka, you must adjust a few properties. By making these changes you will not face any exceptions and will be able to send all messages successfully. Below are the properties which require a few changes:
- At the Consumer end – fetch.message.max.bytes
- At the Broker, end to create replica– replica.fetch.max.bytes
- At the Broker, the end to create a message – message.max.bytes
- At the Broker end for every topic – max.message.bytes
How is Kafka used as a stream processing?
Kafka can be used to consume continuous streams of live data from input Kafka topics, perform processing on this live data, and then output the continuous stream of processed data to output Kafka topics. For performing complex transformations on the live data, Kafka provides a fully integrated Streams API.
What are the benefits of using Kafka than other messaging services like JMS, RabbitMQ doesn’t provide?
Now a days kafka is a key messaging framework, not because of its features even for reliable transmission of messages from sender to receiver, however, below are the key points which should consider:
- Reliability − Kafka provides a reliable delivery from publisher to a subscriber with zero message loss..
- Scalability −Kafka achieve this ability by using clustering along with the zookeeper coordination server
- Durability −By using distributed log, the messages can persist on disk.
- Performance − Kafka provides high throughput and low latency across the publish and subscribe application.
Considering the above features Kafka is one of the best options to use in Bigdata Technologies to handle the large volume of messages for a smooth delivery.
Where does the meta information about Topics stored in a Kafka Cluster?
Zookeeper stores the information about Topics. The information it stores is : number of partitions in a Topic; which node is the master of which partition, which node has the replica of the partition, etc.
Describe scalability in the context of Apache Kafka.
Apache Kafka has the ability to be scaled out without causing any semblance of downtime by tacking on nodes.
What is the main difference between Kafka and Flume?
Even though both are used for real-time processing, Kafka is scalable and ensures message durability.
Would it be possible to use Kafka without the zookeeper?
No, it is not possible to use Kafka without the zookeeper. The user will not able to connect directly to the Kafka server in the absence of zookeeper. For some reason, if zookeeper is down then the individual will not able to access any of the client requests.
Is message duplication necessary or unnecessary in Apache Kafka?
Duplicating or replicating messages in Apache Kafka is actually a great practice. It ensures that all messages will never be lost, even if the main or producer server suffers a failure.
What are Kafka Topics?
Kafka Topics are categories or feeds to which data streams or data records are published to. Kafka producers publish data records to the Kafka topics and Kafka consumers consume the data records from the Kafka topics.
Describe high-throughput in the context of Apache Kafka.
There is no need for substantially large hardware in Apache Kafka. This is because Apache Kafka is capable of taking on very high-velocity and very high-volume data. It can also take care of message throughput of thousands of messages per second. In summary, Apache Kafka is very fast and efficient.
Explain the functionality of the Connector API in Kafka?
The Connector API is responsible where it allows the application to stay connected and keeping a track of all the changes that happen within the system. For this to happen, we will be using reusable producers and consumers which stays connected to the Kafka topics.
What is the real-world use case of Kafka, which makes different from other messaging framework?
There is plethora of use case, where Kafka fit into the real work application, however I listed below are the real work use case which is frequently using.
- Metrics: Use for monitoring operation data, which can use for analysis or doing statistical operation on gather the data from distributed system
- Log Aggregation solution: can be used across an organization to collect logs from multiple services, which consume by consumer services to perform the analytical operation.
- Stream Processing: Kafka’s strong durability is also very useful in the context of stream processing.
- Asynchronous communication: In microservices, keeping this huge system synchronous is not desirable, because it can render the entire application unresponsive. Also, it can defeat the whole purpose of dividing into microservices in the first place. Hence, having Kafka at that time makes the whole data flow easier. Because it is distributed, highly fault-tolerant and it has constant monitoring of broker nodes through services like Zookeeper. So, it makes it efficient to work.
- Chat bots: Chat bots is one of the popular use cases when we require reliable messaging services for a smooth delivery.
- Multi-tenant solution: Multi-tenancy is enabled by configuring which topics can produce or consume data. There are also operations support for quotas
Above are the use cases where predominately require a Kafka framework, apart from that there are other cases which depends upon the requirement and design.
What square measure the most options of writer that build it appropriate for information integration and processing in real-time?
Some of the foremost lightness options of writer that build it well-liked worldwide includes – information partitioning, quantifiability, low-latency, high throughputs etc. These options square measure the rationale why writer had become the foremost appropriate selection for information integration and processing within the period of time.
Explain what geo-replication is within Apache Kafka.
For the Apache Kafka cluster, Apache Kafka MirrorMaker allows for geo-replication. Through this, messages are duplicated across various data centers or cloud regions. Geo-replication can be used in active or passive scenarios for the purpose of backup and recovery. It is also used to get data closer to users and support data locality needs.
Explain the term “Topic Replication Factor”.
It is very important to factor in topic replication while designing a Kafka system. Hence, if in any case, broker goes down its topics’ replicas from another broker can solve the crisis.
What are the three main system tools within Apache Kafka?
The three main system tools in Apache Kafka include Apache Kafka Migration Tool, Consumer Offset Checker, and Mirror Maker. Apache Kafka Migration Tool is used to move a broker from a specific version to another version. Consumer Offset Checker is used to show topics, partitions, and owners within a specific set of topics or consumer group. Mirror maker is used to mirror an Apache Kafka cluster to another Apache Kafka cluster.
What is the maximum message size that can be handled and received by Apache Kafka?
The maximum message size that Apache Kafka can receive and process is approximately one million bytes, or one megabyte.
What does it indicate if replica stays out of ISR for a long time?
If a replica remains out of ISR for an extended time, it indicates that the follower is unable to fetch data as fast as data accumulated at the leader.
What is multi-tenancy?
Apache Kafka can definitely be used as a multi-tenant product. Through configuring what topics can create or consume data, multi-tenancy is enabled and provides operational support for meeting quotas.
Within the producer can you explain when will you experience QueueFullException occur?
Well, if the producer is sending more messages to the broker and if it cannot handle this in the flow of the messages then we will experience QueueFullException. The producers don't have any limitation so it doesn't know when to stop the overflow of the messages. So to overcome this problem one should add multiple brokers so that the flow of the messages can be handled perfectly and we won't fall into this exception again.
What are the key components of Kafka?
Kafka consists of the following key components:
- Kafka Cluster - Kafka cluster contains one or more Kafka brokers (servers) and balances the load across these brokers.
- Kafka Broker - Kafka broker contains one or more Kafka topics. Kafka brokers are stateless and can handle TBs of messages and, thousands of reads and writes without impacting performance.
- Kafka Topics - Kafka topics are categories or feeds to which streams of messages are published to. Every topic has an associated log on disk where the message streams are stored.
- Kafka Partitions - A Kafka topic can be split into multiple partitions. Kafka partitions enable the scaling of topics to multiple servers. Kafka partitions also enable parallel consumption of messages from a topic
- Kafka Offsets - Messages in Kafka partitions are assigned sequential id number called the offset. The offset identifies each record location within the partition. Messages can be retrieved from a partition based on its offset.
- Kafka Producers - Kafka producers are client applications or programs that post messages to a Kafka topic.
- Kafka Consumers - Kafka consumers are client applications or programs that read messages from a Kafka topic.
When does the queue full exception emerge inside the manufacturer?
Queue Full Exception naturally happens when the manufacturer tries to propel communications at a speed which Broker can’t grip. Consumers need to insert sufficient brokers to collectively grip the amplified load since the Producer doesn’t block.
In the Producer, when does QueueFullException occur?
Whenever the Kafka Producer attempts to send messages at a pace that the Broker cannot handle at that time QueueFullException typically occurs. However, to collaboratively handle the increased load, users will need to add enough brokers, since the Producer doesn’t block.
When not to use Apache Kafka?
- Kafka doesn't number the messages. It has a notion of â€œoffsetâ€ inside the log which identifies the messages.
- Consumers consume the data from topics but Kafka does not keep track of the message consumption. Kafka does not know which consumer consumed which message from the topic. The consumer or consumer group has to keep a track of the consumption.
- There are no random reads from Kafka. Consumer has to mention the offset for the topic and Kafka starts serving the messages in order from the given offset.
- Kafka does not offer the ability to delete. The message stays via logs in Kafka till it expires (until the retention time defined).
What is the role of the ZooKeeper in Kafka?
Apache Kafka is a distributed system is built to use Zookeeper. Although, Zookeeper’s main role here is to build coordination between different nodes in a cluster. However, we also use Zookeeper to recover from previously committed offset if any node fails because it works as periodically commit offset.
Explain the role of the offset.
There is a sequential ID number given to the messages in the partitions what we call, an offset. So, to identify each message in the partition uniquely, we use these offsets.
Describe durability in the context of Apache Kafka.
Messages are essentially immortal because Apache Kafka duplicates its messages.
Describe low latency in the context of Apache Kafka.
Apache Kafka is able to take on all these messages with very low latency, usually in the range of milliseconds.
Explain the role of the Kafka Producer API.
The role of Kafka’s Producer API is to wrap the two producers – kafka.producer.SyncProducer and the kafka.producer.async.AsyncProducer. The goal is to expose all the producer functionality through a single API to the client.
Is apache Kafka is a distributed streaming platform? if yes, what you can do with it?
Yes, Apache Kafka is a streaming platform. A streaming platform contains the vital three capabilities, they are as follows:
- It will help you to push records easily
- It will help you store a lot of records without giving any storage problems
- It will help you to process the records as they come in
Is replication critical or simply a waste of time in Kafka?
Replicating messages could be a smart follow in writer that assure that messages can ne’er lose though the most server fails.
Which components are used for stream flow of data?
- Bolt:- Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.
- Spout:- Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –
- Reliable:- These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.
- Unreliable:- These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.
- Tuple:- The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.
How are Kafka Topic partitions distributed in a Kafka cluster?
Partitions of the Kafka Topic logs are distributed over multiple servers in the Kafka cluster. Each partition is replicated across a configurable number of servers for fault tolerance.
Every partition has one server that acts as the 'leader' and zero or more servers that act as 'followers'. The leader handles the reads and writes to a partition, and the followers passively replicate the data from the leader.
If the leader fails, then one of the followers automatically take the role as the 'leader'.
Describe fault-tolerance in the context of Apache Kafka.
Probably one of the biggest benefits of Apache Kafka that make the platform so attractive to tech companies is its ability to keep data safe in the event of a total system failure, major update, or component malfunction. This is known as fault-tolerance. Apache Kafka is fault-tolerant because it replicates every message within the system to store in case of malfunction.
Elaborate the architecture of Kafka.
In Kafka, a cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple partitions, and each broker stores one or more of those partitions so that multiple producers and consumers can publish and retrieve messages at the same time.
What are the key benefits of using storm for real time processing?
- Easy to operate: Operating storm is quiet easy
- Real fast: It can process 100 messages per second per node
- Fault Tolerant: It detects the fault automatically and re-starts the functional attributes
- Reliable: It guarantees that each unit of data will be executed at least once or exactly once
- Scalable: It runs across a cluster of machine.
How is Kafka used as a storage system?
Kafka has the following data storage capabilities which makes it a good distributed data storage system:
- Replication - Data written to Kafka topics are by design partitioned and replicated across servers for fault-tolerance.
- Guaranteed - Kafka sends acknowledgment to Kafka producers after data is fully replicated across all the servers, hence guaranteeing that the data is persisted to the servers.
- Scalability - The way Kafka uses disk structures enables them to scale well. Kafka performs the same irrespective of the size of the persistent data on the server.
- Flexible reads - Kafka enables different consumers to read from different positions on the Kafka topics, hence making Kafka a high-performance, low-latency distributed file system.
What is Broker and how Kafka utilize broker for communication?
- Broker are the system which is responsible to maintaining the publish data.
- Each broker may have one or more than one partition.
- Kafka contain multiple broker to main the load balancer.
- Kafka broker are stateless
- eg: Let’s say there are N partition in a topic and there is N broker, then each broker has 1 partition.
What Is ZeroMQ?
ZeroMQ is “a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.
How do you send messages to a Kafka topic using Kafka command line client?
Kafka comes with a command line client and a producer script kafka-console-producer.sh that can be used to take messages from standard input on console and post them as messages to a Kafka queue.
How are the messages consumed by a consumer in Kafka?
By making use of send file API transfer of messages is done in Kafka. Using this file the transfer of bytes takes place from the socket to disk through the kernel space-saving copies and the calls between kernel user and back to the kernel.
Explain how you can reduce churn in ISR? When does broker leave the ISR?
ISR is a set of message replicas that are completely synced up with the leaders, in other word ISR has all messages that are committed. ISR should always include all replicas until there is a real failure. A replica will be dropped out of ISR if it deviates from the leader.
What happens if the preferred replica is not in the ISR?
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.
How can you justify the writer architecture?
Kafka product relies on a distributed style wherever one cluster has multiple brokers/servers related to it. The ‘Topic’ is going to be divided into lots of partitions to store the messages and there’s one client cluster to fetch the messages from brokers.
What’s a client cluster in Kafka?
A client cluster is formed of one or additional shoppers that along take the various topics and fetch information from the brokers.
What is the replica? What does it do?
A replica can be defined as a list of essential nodes that are responsible to log for a particular partition, and it doesn't matter whether they actually play the role of a leader or not.
Explain the concept of Leader and Follower.
Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as Followers. The Leader performs the task of all read and write requests for the partition, while the role of the Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the role of the Leader. This ensures load balancing of the server.
How you can get exactly once messaging from Kafka during data production?
During data, production to get exactly once messaging from Kafka you have to follow two things avoiding duplicates during data consumption and avoiding duplication during data production. Here are the two ways to get exactly one semantics while data production:
- Avail a single writer per partition, every time you get a network error checks the last message in that partition to see if your last write succeeded
- In the message include a primary key (UUID or something) and de-duplicate on the consumer
Why is Kafka preferred over traditional message transfer techniques?
Kafka product is more scalable, faster, robust and distributed by design.
ou have tested that a Kafka cluster with five nodes is able to handle ten million messages per minute. Your input is likely to increase to twenty five million messages per minute. How many more nodes should be added to the cluster?
- A: 15
- B: 13
- C: 8
- D: 5
Answer: CExplanation: Since Kafka is horizontally scalable, handling 25 million messages per minute will need 13 machines or 8 more machines.
Which of the following is guaranteed by Kafka?
- A: A consumer instance gets the messages in the same order as they are produced.
- B: A consumer instance is guaranteed to get all the messages produced.
- C: No two consumer instances will get the same message
- D: All consumer instances will get all the messages
When messages passes from producer to broker to consumer, the data modification is minimized by using:
- A: Message compression
- B: Message sets
- C: Binary message format
- D: Partitions
Answer: CExplanation: Binary message format ensures that consistent format is used by all three processes
Which is the configuration file for setting up ZooKeeper properties in Kafka?
- A: zookeeper.xml
- B: zookeeper.properties
- C: zk.yaml
- D: kafka.zk.properties
Which of the following best describes the relationship between ZooKeeper and partial failures?
- A: ZooKeeper eliminates partial failures
- B: ZooKeeper causes partial failures
- C: ZooKeeper detects partial failures
- D: ZooKeeper provides a mechanism for handling partial failures
Answer: DExplanation: ZooKeeper only provides a mechanism to handle partial failures
The znodes that continue to exist even after the creator of the znode dies are called:
- A: ephemeral nodes
- B: persistent nodes
- C: sequential nodes
- D: pure nodes
Answer: BExplanation: Unlike ephemeral nodes, persistent znodes continue to exist unless explicitly deleted
Why is replication necessary in Kafka? Because it ensures that...
- A: A published message will not be lost
- B: A published message will not be saved
- C: A published message will not be deleted
- D: A published message will not be sent
A Kafka topic is setup with a replication factor of 5. Out of these, 2 nodes in the cluster have failed. Business users are concerned that they may lose messages. What do you tell them?
- A: They need to stop sending messages till you bring up the 2 servers
- B: They need to stop sending messages till you bring up at least one server
- C: They can continue to send messages as there is fault tolerance of 4 server failures.
- D: They can continue to send messages as you are keeping a tape back up of all the messages
Answer: CExplanation: Fault tolerance is n - 1, so they don't have to worry about losing messages
How many brokers will be marked as leaders for a partition?
- A: Zero
- B: One
- C: Five
- D: All running brokers
Which server should be started before starting Kafka server?
- A: ZooKeeper server
- B: Kafka Producer
- C: Kafka Consumer
- D: Kafka Topic
Kafka maintains feeds of messages in categories called
- A. Topics
- B. Chunks
- C. domains
- D. messages