Each day, large amounts of data and metrics are collected from real time activity streams, performance metrics, application logs, web activity tracking and much more. Modern scalable applications need a messaging bus that can collect this massive continuous stream of data without sacrificing good performance and scalability. Apache Kafka is built ground up to solve the problem of a distributed, scalable, reliable message bus.
Apache Kafka is a distributed messaging system originally built at LinkedIn and now part of the Apache Software Foundation. In the words of the authors: “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log”.
While there are many mainstream messaging systems like RabbitMQ and ActiveMQ available, there are certain things that make Apache Kafka stand out when it comes to large scale message processing applications.
A single Kafka broker (server) can handle tens of millions of reads and writes per second from thousands of clients all day long on modest hardware. It is the preferred choice when it comes to ordered durable message delivery. Kafka beat RabbitMQ in terms of performance on large set of benchmarks. According to a research paper published at Microsoft Research, Apache Kafka published 500,000 messages per second and consumed 22,000 messages per second on a 2-node cluster with 6-disk RAID 10.
In Kafka, messages belonging to a topic are distributed among partitions. A topic has multiple partitions based on predefined parameters, for e.g all messages related to a user would go to a particular partition. The ability of a Kafka topic to be divided into partitions allows the topic to scale beyond a size that will fit on a single server. The concept of dividing a topic into multiple partitions allows Kafka to provide both ordering guarantees and load balancing over a pool of consumer processes.
Another aspect that helps Kafka to scale better is the concept of consumer groups (a collection of message subscribers/consumers). The partitions in the topic are assigned to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. This mechanism aids in parallelism of consuming the messages within a topic. The number of partitions dictates the maximum parallelism of the message consumers.
Kafka is designed very differently than other messaging systems in the sense that it is fully distributed from the ground up. Kafka is run as a cluster comprised of one or more servers each of which is called a broker. A Kafka topic is divided into partitions using the built-in partitioning. It has a modern cluster-centric design that offers strong durability and fault tolerance guarantees. The Kafka cluster comprised of multiple servers/brokers can be spread over multiple data centers or availability zones or even regions. This way the application would be up and running even in a disaster scenario of losing a data center.
Kafka partitions are replicated across a configurable number of servers thus allowing automatic failover to these replicas when a server in the cluster fails. Apache Kafka treats replication as the default behavior. This ability of the partitions to be replicated makes the Kafka messages resilient. If any server with certain partitions goes down, it can easily be replaced by other servers within the cluster with the help of the partition replicas.
Messages are persisted on disk for a specific amount of time or for a specific total size of messages in a partition. The parameters are configurable on a per topic basis. One can choose to persist the messages forever. The time to be persisted is configurable per topic.
Kafka guarantees a stronger ordering of message delivery than a traditional messaging system. Traditional messaging systems hand out messages in order, but these messages are delivered asynchronously to the consumers. This could result in the message getting delivered out of order to different consumers. Kafka guarantees ordered delivery of messages within a partition. If a system requires total order over messages then this can be achieved by having a topic with no partitions. But this comes at the cost of sacrificing parallelism of message consumers.
Apache Kafka has become a popular messaging system in a short period of time with a number of organizations like LinkedIn, Tumblr, PayPal, Cisco, Box, Airbnb, Netflix, Square, Spotify, Pinterest, Uber, Goldman Sachs, Yahoo and Twitter among others using it in production systems.
Apache Kafka is an extremely fast and scalable message bus that supports publish-subscribe semantics. Due to its distributed and scalable nature, it is on its way to becoming a heavyweight in the scalable cloud applications arena. Although originally envisioned for log processing, IoT (Internet of Things) applications will find Kafka a very important part of architecture due to the performance and distribution guarantees that it provides.
Want to learn more about how to build scalable applications? Please take a look at our Building Scalable Applications blog series: