Using Kafka for real time data processing (why I like it)

The beginning

I remember on one of my university lectures, our teacher from Software Engineering was explaining us how important was to integrate data. Turns out, 90% of the work we do on software companies is based on integrating data from internal (ex. migrations) or 3rd parties (integrations). At that point we were learning ETL processes but that kept in my mind for a while.

Some time ago, at my company, we had to integrate and process data that was fairly irregular in terms of data flow. Sometimes we could have lot of data coming through, other times nothing will be required to be processed for several minutes. Based on my experience, the way I have worked in the past with data flow is normally based on a messaging systes where you have one of more producers on one side and one or more consumers on the other end.

Data flow apps architecture

The following picture illustrates how a data flow driven app is normally designed (at least what i have seen and experienced)

Kafka messaging queue

Example of message queue architecture

When working on this kind of solution, you normally have in mind several things:

  • The producers will have to write into a messaging queue. In order to write data, there has to be some kind of discipline when writing into it. You either synchronize when writing on the producer code or you provide synchronized access for writing in the queue. As several processes might be writing and this logic may get more and more complex, it is always good to consider the part for synchronized access in the data structure (or system in charge of it) as it is the most stable part in the equation. This normally comes as a standard with any popular messaging system. Some popular examples are HornetQ, RabbitMQ, ZeroMQ…
  • The consumers have to read the information and consume it from the queue, leaving space for the next object. We want to take elements to take it out of the queue so more elements can come in. Typical messaging queues provide this for you.
  • Memory, management, state, statistics, monitoring… about the queue. It is a good idea to have information we can always look at in order to monitor and send alerts in case of any issues.

Based on that, it seems logical to rely on a messaging system queue that will let us to:

  • Store elements (ideally they will be in memory for faster write and retrieval, although we might want persistence)
  • Manage basic queue cases like: it is full or it is empty.
  • Blocking and non blocking operations depending on our intentions: do we want to control de flow of pushing and consuming elements?
  • Provide metrics, stats about the queue so they can be easily sent to the monitoring system
  • Manage big data flows and possibility of scaling just adding more resources

Also, some ‘nice to have’ for a message queue are:

  • Replicate states of the queue
  • Ability to scale just adding more machines
  • Memory management and persistence options

Normally, queues like HornetQ or RabbitMQ are pretty good dealing with these but i struggled in the past scaling them up. I believe it is mostly on how they deal with the memory and also how much we want to store to have processes for future data analysis. At that point, one day I was introduced to Kafka and I was blown away.

Kafka, the beginning on a journey

From the beginning, it was an amazing experience working with Kafka. I went to their website, downloaded the project and in less than 10 minutes I had a Kafka server with Zookeeper running on my machine with a consumer and a producer running. Not only that, the tools provided were quite good and in a matter of minutes i was able to see data coming through the kafka topic.

Documentation

Documentation is simple to understand and useful. However, i was craving for more information. I think it is an amazing project and i feel there should be a good book out there. I wanted to understand the concepts and why not, start looking into the code.

Monitoring

Yammer metrics come with Kafka. For more information, have a look here.
Really happy about how to monitor the metrics, although I was missing a flashy dashboard. I guess the architects were more thinking on exposing the metrics so developers choose the way to represent the data. In my opinion, it will be always more or less the same: get number of messages pushed, pulled per second. Those metrics can be sent later to a ELK instance and have a nice dashboard representing the state of the queue. I think there are some cool projects that can be done on this part so I added to my todo list to write a github project to see queue stats on grafana or elements on elk.

Persistence

On their website they mention it is ‘durable’.

Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact

Specifying the retention policy of a topic we could ‘replay’ the information it is just great.

How many times have you wondered how to replicate an issue? Having the data there helps a lot and can help as well to perform regression tests about new functionality.

In summary

  • If you have not, give Kafka a go, you will be surprised of what it can do for you.
  • Having metrics exposed makes it really easy to push it to a analytics dashboard. I like to use the ELK stack.
  • Having the possibility of ‘replay’ of information on a specific topic makes development of new applications (like consumer logic) really easy to develop and to test as you don’t need to generate data. It is good to simulate a real time data stream as well.
  • You might not need extreme requirements, but i would recommend to give it a go. You may not go back to previous message systems you have used.

In the following days, i will try to do some examples with kafka and publish some results.

Stay tuned!

 

Leave a Reply

Your email address will not be published. Required fields are marked *