A large amount of data is generated by companies having any form of web-based presence and activity. Data is one of the newer ingredients in these Internet-based systems and typically includes user-activity events corresponding to logins, page visits, clicks, social networking activities such as likes, sharing, and comments, and operational and system metrics. This data is typically handled by logging and traditional log aggregation solutions due to high throughput (millions of messages per second). These traditional solutions are the viable solutions for providing logging
data to an offline analysis system such as Hadoop. However, the solutions are very limiting for building real-time processing systems.
According to the new trends in Internet applications, activity data has become a part of production data and is used to run analytics at real time. These analytics can be:
- Search based on relevance
- Recommendations based on popularity, co-occurrence, or sentimental analysis
- Delivering advertisements to the masses
- Internet application security from spam or unauthorized data scraping
Real-time usage of these multiple sets of data collected from production systems has become a challenge because of the volume of data collected and processed. Apache Kafka aims to unify offline and online processing by providing a mechanism for parallel load in Hadoop systems as well as the ability to partition real-time consumption over a cluster of machines. Kafka can be compared with Scribe or Flume as it is useful for processing activity stream data; but from the architecture perspective, it is closer to traditional messaging systems such as ActiveMQ or RabitMQ.