What is Apache Kafka?
Apache Kafka is a real-time publish-subscribe solution messaging system: open source, distributed, partitioned, replicated, commit-log based with a publish-subscribe schema. Its main characteristics are as follows:
• Distributed . Cluster-centric design that supports the distribution of the messages over the cluster members, maintaining the semantics. So you can grow the cluster horizontally without downtime.
• Multiclient . Easy integration with different clients from different platforms: Java, .NET, PHP, Ruby, Python, etc.
• Persistent . You cannot afford any data lost. Kafka is designed with efficiency, so data structures provide constant time performance no matter the data size.
• Real time . The messages produced are immediately seen by consumer threads; these are the basis of the systems called complex event processing (CEP) .
• Very high throughput . As we mentioned, all the technologies in the stack are designed to work in commodity hardware. Kafka can handle hundreds of read and write operations per second from a large number of clients.
Here is an Apache Kafka typical scenario:
On the producers’ side, you can find several types of actors , for example:
• Adapters . Generate transformation information; for example, a database listener or a file system listener.
• Logs . The log files of application servers and other systems, for example.
• Proxies . Generate web analytics information.
• Web pages . Front-end applications generating information.
• Web services . The service layer; generate invocation traces.
You could group the clients on the customer side as three types:
• Offline . The information is stored for posterior analysis; for example, Hadoop and data warehouses.
• Near real time . The information is stored but it is not requested at the same time; for example, Apache Cassandra, and NoSQL databases.
• Real time . The information is analyzed as it is generated; for example, an engine like Apache Spark or Apache Storm (used to make analysis over HDFS).
Apache Kafka Use Cases
There are several examples of Apache Kafka use cases:
• Commit logs . What happens when your system does not have a log system? In these cases, you can use Kafka. Many times systems do not have logs, simply because (so far) it’s not possible to handle such a large data volume. The stories of application servers falling simply because they could not write their logs correctly with the verbosity needed by the business are more common than it seems. Kafka can also
help to start and restart fallen log servers.
• Log aggregation . Contrary to what people believe, much of the work of the onsite support team is on log analysis. Kafka not only provides a system for log management, but it can also handle heterogeneous aggregation of several logs. Kafka can physically collect the logs and remove cumbersome details such as file location or format. In addition, it provides low latency and supports multiple data sources
while making distributed consumption.
• Messaging . Systems are often heterogeneous, and instead of rewriting them, you have to translate between them. Often the manufacturer’s adapters are unaffordable to a company; for such cases, Kafka is the solution because it is open source and can
handle more volume than many traditional commercial brokers.
• Stream processing . We could write an entire book on this topic. In some business cases, the process of collecting information consists of several stages. A clear example is when a broker is used not only to gather information but also to transform it. This is the real meaning and success of the Enterprise Service Bus (ESB) architectures. With Kafka, the information can be collected and further enriched;
this (very well paid) enrichment process is known as stream processing .
• Record user activity . Many marketing and advertising companies are interested in recording all the customer activity on a web page. This seems a luxury, but until recently, it was very difficult to keep track of the clicks that a user makes on a site. For those tasks where the data volume is huge, you can use Kafka for real-time process and monitoring.