Google News
logo
Kafka Interview Questions
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Basically, this project was started by the Apache software foundation written in Java and Scala.
 
Kafka provides three main functions to its users :
 
* Publish and subscribe to streams of records.
* Effectively store streams of records in the order in which records were generated.
* Process streams of records in real time.
 
Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data
Following are the key features of Kafka :
 
* Kafka was started by the Apache software and written in Scala programming language.

*
Kafka is a publish-subscribe messaging system built for high throughput and fault tolerance.

*
Kafka has a built-in partition system known as a Topic.

*
Kafka Includes a replication feature as well.

*
Kafka provides a queue that can handle large amounts of data and move messages from one sender to another.

*
Kafka can also save the messages to storage and replicate them across the cluster.

*
Kafka collaborates with Zookeeper to coordinate and synchronize with other services.

*
Apache Spark is well supported by Kafka.
The most important elements of Kafka are as follows :
 
Topic : It is a bunch of similar kinds of messages.

Producer : Using this, one can issue communications to the topic.

Consumer : It endures to a variety of topics and takes data from brokers.

Broker : This is the place where the issued messages are stored.
Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data.

For example, if you want to create a data pipeline that takes in user activity data to track how people use your website in real-time, Kafka would be used to ingest and store streaming data while serving reads for the applications powering the data pipeline. Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications.
Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers. Queuing allows for data processing to be distributed across many consumer instances, making it highly scalable. However, traditional queues aren’t multi-subscriber.

The publish-subscribe approach is multi-subscriber, but because every message goes to every subscriber it cannot be used to distribute work across multiple worker processes. Kafka uses a partitioned log model to stitch together these two solutions.

A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers. This means that there can be multiple subscribers to the same topic and each is assigned a partition to allow for higher scalability.

Finally, Kafka’s model provides replayability, which allows multiple independent applications reading from data streams to work independently at their own rate.
 
Queuing :
Apache Kafka

Publish-Subscribe :
Apache Kafka
Event streaming is the digital equivalent of the human body's central nervous system. It is the technological foundation for the 'always-on' world where businesses are increasingly software-defined and automated, and where the user of software is more software.
 
Technically speaking, event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.
Event streaming is applied to a wide variety of use cases across a plethora of industries and organizations. Its many examples include :
 
* To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.

* To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.

* To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.

* To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.

* To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.

* To connect, store, and make available data produced by different divisions of a company.

* To serve as the foundation for data platforms, event-driven architectures, and microservices.
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:
 
* To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.

*
To store streams of events durably and reliably for as long as you want.

*
To process streams of events as they occur or retrospectively.

And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.
 
Servers : Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.
 
Clients : They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.
Following is a list of key benefits of Apache Kafka above other traditional messaging techniques:
 
Kafka is Fast : Kafka decouples data streams so there is very low latency, making it extremely fast.
 
Kafka is Scalable : Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server. 

Kafka is Durable : Partitions are distributed and replicated across many servers, and the data is all written to disk. This helps protect against server failure, making the data very fault-tolerant and durable. 
 
Kafka is Distributed by Design : Kafka provides fault tolerance features, and its distributed design also guarantees durability.
Kafka remedies the two different models by publishing records to different topics. Each topic has a partitioned log, which is a structured commit log that keeps track of all records in order and appends new ones in real time. These partitions are distributed and replicated across multiple servers, allowing for high scalability, fault-tolerance, and parallelism.

Each consumer is assigned a partition in the topic, which allows for multi-subscribers while maintaining the order of the data. By combining these messaging models, Kafka offers the benefits of both. Kafka also acts as a very scalable and fault-tolerant storage system by writing and replicating all data to disk. By default, Kafka keeps data stored on disk until it runs out of space, but the user can also set a retention limit. Kafka has four APIs :
 
Producer API : used to publish a stream of records to a Kafka topic.

Consumer API :
used to subscribe to topics and process their streams of records.

Streams API :
enables applications to behave as stream processors, which take in an input stream from topic(s) and transform it to an output stream which goes into different output topic(s).

Connector API : allows users to seamlessly automate the addition of another application or data system to their current Kafka topics.
RabbitMQ is an open source message broker that uses a messaging queue approach. Queues are spread across a cluster of nodes and optionally replicated, with each message only being delivered to a single consumer.

Characteristics Apache Kafka RabbitMQ
Architecture Kafka uses a partitioned log model, which combines messaging queue and publish subscribe approaches. RabbitMQ uses a messaging queue.
Scalability Kafka provides scalability by allowing partitions to be distributed across different servers. Increase the number of consumers to the queue to scale out processing across those competing consumers.
Message retention Policy based, for example messages may be stored for one day. The user can configure this retention window. Acknowledgement based, meaning messages are deleted as they are consumed.
Multiple consumers Multiple consumers can subscribe to the same topic, because Kafka allows the same message to be replayed for a given window of time. Multiple consumers cannot all receive the same message, because messages are removed as they are consumed.
Replication Topics are automatically replicated, but the user can manually configure topics to not be replicated. Messages are not automatically replicated, but the user can manually configure them to be replicated.
Message ordering Each consumer receives information in order because of the partitioned log architecture. Messages are delivered to consumers in the order of their arrival to the queue. If there are competing consumers, each consumer will process a subset of that message.
Protocols Kafka uses a binary protocol over TCP. Advanced messaging queue protocol (AMQP) with support via plugins: MQTT, STOMP.
In addition to command line tooling for management and administration tasks, Kafka has five core APIs for Java and Scala :
 
* The Admin API to manage and inspect topics, brokers, and other Kafka objects.

* The Producer API to publish (write) a stream of events to one or more Kafka topics.

* The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.

* The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event-time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.

* The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don't need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.
The terms leader and follower are used in the Apache Kafka environment to maintain the overall system and ensure the load balancing on the servers. Following is a list of some important features of leader and follower in Kafka :
 
* For every partition in the Kafka environment, one server plays the role of leader, and the remaining servers act as followers.

* The leader level is responsible for executing the all data read and write commands, and the rest of the followers have to replicate the process.

* Suppose any time any fault occurs and the leader is not able to function appropriately. In that case, one of the followers takes the place and responsibility of the leaders and makes the system stable and helps in the server's load balancing.
In every Kafka broker, some partitions are available, either a leader or a replica of a topic.
 
* Every Kafka topic separated into partitions contains records in a fixed order in each of them.

* Each record in a partition is assigned and attributed with a unique offset. Multiple partition logs are possible in a single topic. Because of this facility, several users can read from the same topic at the same time.

* Topics can be parallelized via partitions, which split data into a single topic among numerous brokers.

* In Kafka, replication is done at the partition level, and a replica is the redundant element of a topic partition.

* Each partition can contain one or more replicas, and it means the partitions can contain messages that are duplicated across many Kafka brokers in the cluster.

* One server acts as the leader of each partition or replica, while the others act as followers.

* If the leader goes down in any circumstances, one of the followers takes over as the leader.
Apache ZooKeeper is a naming registry for distributed applications as well as a distributed, open-source configuration and synchronization service. It keeps track of the Kafka cluster nodes' status, as well as Kafka topics, partitions, and so on.
 
ZooKeeper is used by Kafka brokers to maintain and coordinate the Kafka cluster. When the topology of the Kafka cluster changes, such as when brokers and topics are added or removed, ZooKeeper notifies all nodes. When a new broker enters the cluster, for example, ZooKeeper notifies the cluster, as well as when a broker fails.

ZooKeeper also allows brokers and topic partition pairs to elect leaders, allowing them to select which broker will be the leader for a given partition (and server read and write operations from producers and consumers), as well as which brokers contain clones of the same data. When the cluster of brokers receives a notification from ZooKeeper, they immediately begin to coordinate with one another and elect any new partition leaders that are required. This safeguards against the unexpected absence of a broker.
Kafka can now be used without ZooKeeper as of version 2.8. The release of Kafka 2.8.0 in April 2021 gave us all the opportunity to try it out without ZooKeeper. However, this version is not yet ready for production and lacks some key features.

In the previous versions, bypassing Zookeeper and connecting directly to the Kafka broker was not possible. This is because when the Zookeeper is down, it is unable to fulfill client requests.
Apache Kafka

Topic replication is very important in Kafka. It is used to construct Kafka deployments to ensure durability and high availability. When one broker fails, topic replicas on other brokers remain available to ensure that data is not lost and Kafka deployment is not disrupted in any case. The replication ensures that the messages published are not lost.
 
The replication factor specifies the number of copies of a topic kept across the Kafka cluster. It takes place at the partition level and is defined at the subject level. For example, taking a replication factor of two will keep two copies of a topic for each partition. The replication factor cannot be more than the cluster's total number of brokers.
 
ISR stands for In-Sync Replica, and it is a replica that is up to date with the partition's leader.
19 .
Is it possible to use Kafka without ZooKeeper?
No, it is not possible to bypass Zookeeper and connect directly to the Kafka server. If, for some reason, ZooKeeper is down, you cannot service any client request.
20 .
What roles do Replicas and the ISR play?
Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message replicas that are synced to the leaders.
21 .
Why are Replications critical in Kafka?
Replication ensures that published messages are not lost and can be consumed in the event of any machine error, program error or frequent software upgrades.
22 .
If a Replica stays out of the ISR for a long time, what does it signify?
It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.
Since Kafka uses ZooKeeper, it is essential to initialize the ZooKeeper server, and then fire up the Kafka server.

* First, download the most recent version of Kafka and extract it.
* Ensure that the local environment has Java 8+ installed to run Kafka.

Use the following commands to start the Kafka server and ensure that all services are started in the correct order:
 
* Start the ZooKeeper service by doing the following:
$bin/zookeeper-server-start.sh config/zookeeper.properties​
* To start the Kafka broker service, open a new terminal and type the following command :
$ bin/kafka-server-start.sh config/server.properties  
Apache Kafka Apache Flume
Apache Kafka is a distributed data store or a data system. Apache Flume is a distributed, available, and reliable system.
Apache Kafka is optimized for ingesting and processing streaming data in real-time. Apache Flume can efficiently collect, aggregate and move a large amount of log data from many different sources to a centralized data store.
Apache Kafka is easy to scale. Apache Flume is not scalable as Kafka. It is not easy to scale.
It is working as a pull model. It is working as a push model.
It is a highly available, fault-tolerant, efficient and scalable messaging system. It also supports automatic recovery. It is specially designed for Hadoop. In case of flume-agent failure, it is possible to lose events in the channel.
Apache Kafka runs as a cluster and easily handles the incoming high volume data streams in real-time. Apache Flume is a tool to collect log data from distributed web servers.
Apache Kafka treats each topic partition as an ordered set of messages. Apache Flume takes in streaming data from multiple sources for storage and analysis, which is used in Hadoop.
25 .
What is the maximum size of a message that Kafka can receive?
By default, the maximum size of a Kafka message is 1MB (megabyte). The broker settings allow you to modify the size. Kafka, on the other hand, is designed to handle 1KB messages as well.
Geo-Replication is a Kafka feature that allows messages in one cluster to be copied across many data centers or cloud regions. Geo-replication entails replicating all of the files and storing them throughout the globe if necessary. Geo-replication can be accomplished with Kafka's MirrorMaker Tool. Geo-replication is a technique for ensuring data backup.
Following are the disadvantages of Kafka :
 
* Kafka performance degrades if there is message tweaking. When the message does not need to be updated, Kafka works well.

* Wildcard topic selection is not supported by Kafka. It is necessary to match the exact topic name.

* Brokers and consumers reduce Kafka's performance when dealing with huge messages by compressing and decompressing the messages. This has an impact on Kafka's throughput and performance.

* Certain message paradigms, including point-to-point queues and request/reply, are not supported by Kafka.

* Kafka does not have a complete set of monitoring tools.
Following are some of the real-world usages of Apache Kafka :
 
As a Message Broker : Due to its high throughput value, Kafka is capable of managing a huge amount of comparable types of messages or data. Kafka can be used as a publish-subscribe messaging system that allows data to be read and published in a convenient manner.
 
To Monitor operational data : Kafka can be used to keep track of metrics related to certain technologies, such as security logs.
 
Website activity tracking : Kafka can be used to check that data is transferred and received successfully by websites. Kafka can handle the massive amounts of data created by websites for each page and for the activities of users.
 
Data logging : Kafka's data replication between nodes functionality can be used to restore data on nodes that have failed. Kafka may also be used to collect data from a variety of logs and make it available to consumers.
 
Stream Processing with Kafka : Kafka may be used to handle streaming data, which is data that is read from one topic, processed, and then written to another. Users and applications will have access to a new topic containing the processed data.
To get exactly-once messaging during data production from Kafka, we must follow the two things avoiding duplicates during data consumption and avoiding duplication during data production.
 
Following are the two ways to get exactly one semantics while data production :
 
* Avail a single writer per partition. Whenever you get a network error, you should check the last message in that partition to see if your last write succeeded.

* In the message, include a primary key (UUID or something) and de-duplicate on the consumer.
Log Anatomy is a way to view a partition. We view the log as the partitions, and a data source writes messages to the log. It facilitates that one or more consumers read that data from the log at any time they want. It specifies that the data source can write a log, and the log is being read by consumers at different offsets simultaneously.
Following are the use cases of Apache Kafka monitoring :
 
* Apache Kafka monitoring can keep track of system resources consumption such as memory, CPU, and disk utilization over time.

* Apache Kafka monitoring is used to monitor threads and JVM usage. It relies on the Java garbage collector to free up memory, ensuring that it frequently runs, thereby guaranteeing that the Kafka cluster is more active.

* It can be used to determine which applications are causing excessive demand, and identifying performance bottlenecks might help rapidly solve performance issues.

* It always checks the broker, controller, and replication statistics to modify the partitions and replicas status if required.
Flume’s major use case is to gulp down data into Hadoop. Flume is incorporated with Hadoop’s monitoring system, file formats, file system, and utilities such as Morphlines. Along with Flume’s design of sinks, sources, and channels, Flume can help one shift data to other systems lithely. However, the main feature of Hadoop is its Hadoop integration. Flume is the best option to use when we have non-relational data sources or a long file to stream into Hadoop.
 
On the other hand, Kafka’s major use case is a distributed publish–subscribe messaging system. It is not developed specifically for Hadoop, and using Kafka to read and write data to Hadoop is considerably trickier than it is with Flume. Kafka can be used when we particularly need a highly reliable and scalable enterprise messaging system to connect multiple systems like Hadoop.
The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition. Consumers can read messages starting from a specific offset and can read from any offset point they choose.
 
* Partition offset has a unique sequence id called as offset.
* Each partition should have a partition offset. 
 
Topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
Partitions allow a single topic to be partitioned across numerous servers from the perspective of the Kafka broker. This allows you to store more data in a single topic than a single server can. If you have three brokers and need to store 10TB of data in a topic, one option is to construct a topic with only one partition and store all 10TB on one broker. Another alternative is to build a three-partitioned topic and distribute 10 TB of data among all brokers. A partition is a unit of parallelism from the consumer's perspective.
Redis Kafka
Push-based message delivery is supported by Redis. This means that messages published to Redis will be distributed to consumers automatically. Pull-based message delivery is supported by Kafka. The messages published to the Kafka broker are not automatically sent to the consumers; instead, consumers must pull the messages when they are ready.
Message retention is not supported by Redis. The communications are destroyed once they have been delivered to the recipients. In its log, Kafka allows for message preservation.
Parallel processing is not supported by Redis. Multiple consumers in a consumer group can consume partitions of the topic concurrently because of the Kafka's partitioning feature.
Redis can not manage vast amounts of data because it's an in-memory database. Kafka can handle massive amounts of data since it uses disc space as its primary storage.
Because Redis is an in-memory store, it is much faster than Kafka. Because Kafka stores data on disc, it is slower than Redis.
* Encryption : All communications sent between the Kafka broker and its many clients are encrypted. This prevents data from being intercepted by other clients. All messages are shared in an encrypted format between the components.

* Authentication : Before being able to connect to Kafka, apps that use the Kafka broker must be authenticated. Only approved applications will be able to send or receive messages. To identify themselves, authorized applications will have unique ids and passwords.

After authentication, authorization is carried out. It is possible for a client to publish or consume messages once it has been validated. The permission ensures that write access to apps can be restricted to prevent data contamination.
As we know, messages are retained for a considerable amount of time in Kafka. Moreover, there is flexibility for consumers that they can read as per their convenience.
 
Although, there is a possible case that if Kafka is configured to keep messages for 24 hours and possibly that time consumer is down for time greater than 24 hours, then the consumer may lose those messages.
 
However, still, we can read those messages from last known offset, but only at a condition that the downtime on part of the consumer is just 60 minutes. Moreover, on what consumers are reading from a topic Kafka doesn’t keep state.
Some of the real-time applications are :
 
* Netflix

*
Mozilla

*
Oracle
* Broker are the system which is responsible to maintaining the publish data.

* Each broker may have one or more than one partition. 

* Kafka contain multiple broker to main the load balancer.

* Kafka broker are stateless 

* Ex: Let’s say there are N partition in a topic and there is N broker, then each broker has 1 partition. 
Dumb broker/Smart producer implies that the broker does not attempt to track which messages have been read by each consumer and only retain unread messages; rather, the broker retains all messages for a set amount of time, and consumers are responsible to track what all messages have been read.
 
Apache Kafka employs this model only wherein the broker does the work of storing messages for a   time (7 days by default), while consumers are responsible for keeping track of what all messages they have read using offsets.
 
The opposite of this is the Smart Broker/Dumb Consumer model wherein the broker is focused on the consistent delivery of messages to consumers. In such a case, consumers are dumb and consume at a roughly similar pace as the broker keeps track of consumer state. This model is followed by RabbitMQ.
* import org.apache.kafka.clients.consumer.ConsumerRecord

* import org.apache.kafka.common.serialization.StringDeserializer

* import org.apache.spark.streaming.kafka010._

* import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent

* import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
Kafka supports data replication within the cluster to ensure high availability. But enterprises often need data availability guarantees to span the entire cluster and even withstand site failures.
 
The solution to this is Mirror Maker – a utility that helps replicate data between two Kafka clusters within the same or different data centers.
 
MirrorMaker is essentially a Kafka consumer and producer hooked together. The origin and destination clusters are completely different entities and can have a different number of partitions and offsets, however, the topic names should be the same between source and a destination cluster. The MirrorMaker process also retains and uses the partition key so that ordering is maintained within the partition.
Regular micro services arrangements will have many microservices collaborating, and that is a colossal issue if not taken care of appropriately. It isn't practical for each service to have an immediate association with each service that it needs to converse with for 2 reasons: First, the number of such associations would develop quickly; Second, the services being called might be down or may have moved to another server.
 
On the off chance that you have 2 services, at that point, there are up to 2 direct associations. With 3 services, there are 6. With 4 services, there are 12, etc. As it were, such associations can be seen as the coupling between the objects in an OO program. You have to cooperate with different objects yet the lesser the coupling between their classes, the more sensible your program is.
 
Message Brokers are a method for decoupling the sending and accepting services through the idea of Publish and Subscribe. The sending service (maker) posts it message/load on the message queue and the accepting service (consumer), which is tuning in for messages, will get it. Message Broking is one of the key use cases for Kafka.
 
Something else Message Brokers do is a queue or hold the message till the time consumer lifts it. On the off chance that the customer service is down or occupied when the sender sends the message, it can generally take it up later. The result of this is the producer services doesn't need to stress over checking if the message has gone through, retry on failure, and so on.
 
Kafka is incredible because it enables us to have both Pub-Sub just as queuing highlights (generally, it is possible that either was upheld by such intermediaries). It additionally ensures that the request of the messages is kept up and not expose to arrange idleness or different elements. Kafka likewise enables us to "communicate" messages to different consumers, if necessary. Kafka importance can be understood in building reliable, scalable microservices solution with minimum configuration.
As we know that consumer system subscribes to topics in Kafka but it is Pooling loop which informs consumers if any new data has arrived or not. It is poll loop responsibility to handle coordination, partition rebalances, heartbeats, and data fetching. It is the core function in consumer API which keeps polling the server for any new data. Let's try to understand polling look in Kafka :
try {
    while (true) {
        ConsumerRecords<String, String> records = consumer.poll(100);
        for (ConsumerRecord<String, String> record : records)
        {
            log.debug("topic = %s, partition = %d, offset = %d,"
                customer = %s, country = %s\n",
                record.topic(), record.partition(), record.offset(),
                record.key(), record.value());
            int updatedCount = 1;
            if (custCountryMap.countainsValue(record.value())) {
                updatedCount = custCountryMap.get(record.value()) + 1;
            }
            custCountryMap.put(record.value(), updatedCount)
            JSONObject json = new JSONObject(custCountryMap);
            System.out.println(json.toString(4))
* This section is an infinite loop. Consumers keep pooling Kafka for new data.

* Consumers.Poll(100) : This section is very critical for the consumer as this section determine time interval(milliseconds)consumer should wait for data to arrive from the Kafka broker. If any consumer will not keep polling the data, the assigned partition usually goes to another consumer as they will be considered not alive.  If we pass 0 as parameters the function will return immediately.

* The second section returns the result set.  Individual results will be having data related to the topic and partition it belongs along with offset of record. We also get key and value pairs of record. Now we iterate through the result set and does our custom processing.  

* Once processing is completed, it writes a result in a data store. This will ensure the running count of customers from each country by updating a hashtable.   

* The ideal way for the consumer is calling a close() function before exiting. This ensures that it closes the active network connections and sockets. This function also triggers rebalancing at the same time rather than waiting for consumer group co-ordinator to find the same and assign partitions to other consumers. 
Java Messaging Service(JMS) Kafka
The push model is used to deliver the messages. Consumers receive messages on a regular basis. A pull mechanism is used in the delivery method. When consumers are ready to receive the messages, they pull them.
When the JMS queue receives confirmation from the consumer that the message has been received, it is permanently destroyed. Even after the consumer has viewed the communications, they are maintained for a specified length of time.
JMS is better suited to multi-node clusters in very complicated systems. Kafka is better suited to handling big amounts of data.
JMS is a FIFO queue that does not support any other type of ordering. Kafka ensures that partitions are sent in the order in which they appeared in the message.
The Apache cluster will automatically identify any broker shutdown or failure. In this instance, new leaders for partitions previously handled by that device will be chosen. This can happen as a result of a server failure or even if it is shut down for maintenance or configuration changes. When a server is taken down on purpose, Kafka provides a graceful method for terminating the server rather than killing it.
 
When a server is switched off :
 
* To prevent having to undertake any log recovery when Kafka is restarted, it ensures that all of its logs are synced onto a disk. Because log recovery takes time, purposeful restarts can be sped up.

* Prior to shutting down, all partitions for which the server is the leader will be moved to the replicas. The leadership transfer will be faster as a result, and the period each partition is inaccessible will be decreased to a few milliseconds.
Kafka streams Spark Streaming
Able to handle only real-time streams Can handle real-time streams as well as batch processes.
The use of partitions and their replicas allows Kafka to be fault-tolerant. Spark allows recovery of partitions using Cache and RDD (resilient distributed dataset)
Kafka does not provide any interactive modes. The broker simply consumes the data from the producer and waits for the client to read it. Has interactive modes
Messages remain persistent in the Kafka log. A dataframe or some other data structure has to be used to keep the data persistent.
Yes, Apache Kafka is considered to be a distributed streaming platform. A streaming platform can be called such if it has the following three capabilities:
 
* Be able to publish and subscribe to streams of data.
 
* Provide services similar to that of a message queue or scalable enterprise messaging system.
 
* Store streams of records in a durable and fault-tolerant manner.
 
Since Kafka meets all three of these requirements, it can be considered to be a streaming platform.
 
Furthermore, since a Kafka cluster consists of multiple servers that function as brokers, it is said to be distributed. Kafka topics are divided into multiple partitions to ensure load balancing. Brokers process these partitions parallelly and allow multiple producers and consumers to publish and retrieve messages in parallel.
 
Distributed streaming platforms handle large amounts of data in real-time by pushing them to multiple servers for real-time processing.
OutOfMemoryException can occur if the consumers are sending large messages or if there is a spike in the number of messages wherein the consumer is sending messages at a rate faster than the rate of downstream processing. This causes the message queue to fill up, taking up memory.
In Kafka, message transfer among the producer, broker, and consumers is done by making use of a standardized binary message format. The process of converting the data into a stream of bytes for the purpose of the transmission is known as serialization.

Deserialization is the process of converting the bytes of arrays into the desired data format. Custom serializers are used at the producer end to let the producer know how to convert the message into byte arrays. Deserializers are used at the consumer end to convert the byte arrays back into the message.