Confluent Cloud: Kafka made easy
Apache Kafka is a streaming platform that allows one to process massive amounts of data in real-time. Kafka is distributed, fault-tolerant, and open-source. On the downside, Kafka is not easy to configure and manage. The basic production setup requires services like ZooKeeper, Kafka Broker, and Schema Registry to be set up on several servers in a fault-tolerant manner. Besides, the update process is also manual and must be done carefully to ensure brokers are compatible with clients.
Enter Confluent Cloud. Confluent Cloud allows one to get a working production Kafka cluster in a matter of minutes instead of hours. Confluent Cloud also supports Kafka Streams and KSQL and can use any one of the three major cloud service providers to span the Kafka Cluster: Google Cloud Platform, Amazon Web Services, and Azure.
In this workshop, let us check how it feels to work with Confluent Cloud. We suggest getting our hands dirty by solving an important problem — estimating the community approval rating of Brexit. By this we mean a score that would allow a quantitative estimate of how most people refer to Brexit.
We decided to analyze tweets because Twitter is an open platform where anybody can express their opinion. With the help of Twitter we attempted to collect an agnostic sample. Kafka will help us to store the Tweets and intermediate processing results reliably, while decoupling the pipeline steps from each other.
The main preparation step is to get Kafka cluster running. For this we just login into the Confluent Cloud account and ask it to create a cluster for us. The information we have to provide is the cluster name, cloud provider and the region to use. For those of us who used to configure Kafka manually this sounds too good to be true. Yes, you just say you want the cluster and you get it: fully managed service, with all the configurations and updates included. An important point to stress is that Confluent Cloud allows to choose the cloud provider for the cluster. We leverage this to co-locate our processing steps with Kafka brokers to achieve lower latency and reduce traffic costs. As a cloud provider we chose GCP, but you can opt for AWS or Azure and still work with the cluster in the same way.
The pipeline itself consists of the following steps:
- Retrieve Tweets.
- Analyze sentiments.
- Calculate Community Approval Score.
- Serve the score from an HTTP endpoint.
These steps are described in detail below.
In the pipeline steps, we use the confluent_kafka open-source package that allows writing Kafka producers and consumers in Python. The confluent_kafka package is developed by Confluent and is a part of the Confluent Platform Open Source package. We use the confluent_kafka package in all steps of the pipeline—both for simple producers/consumers, and for manually managing offsets in a topic. For effective transfer and storage of messages, we use AVRO for serialization in all communications with Kafka.
All of the pipeline steps are hosted on Google Cloud Platform.
Step 1: Retrieve Tweets
In this step, we get Tweets from Twitter API and save them to a Kafka topic called tweets. Twitter also offers Streaming API that sends the Tweets as they appear, but we opted not to use this feature because:
- We also needed history of Tweets. Ignoring the history and building the analysis only on the new Tweets effectively excludes a large chunk of data from the analysis.
- Streaming API works in the user context only. It requires the user to authorize the application to work on their behalf. Our service does not interact with Twitter users.
We have developed a service in Python called TweetsLoader that uses Twitter API to first load all the history available, and then continuously pull new Tweets. This service ensures that we use older Tweets without the rest of the system knowing the difference between the old and the new Tweets. Unlike Streaming API, Twitter API is restricted in the number of calls it can make, so special care has been taken to minimize the latency in the retrieval process and to avoid being blocked by Twitter at the same time.
Step 2: Sentiment Analysis
This step of the pipeline can get arbitrarily complex. We needed to get the app running as soon as possible, so we decided to go with an off-the-shelf solution. The good thing about pipelines built with Kafka is that the steps are decoupled, so we can easily replace this step with a more sophisticated algorithm when we need to.
We decided to use TextBlob Python package for sentiment analysis, as it requires minimal configuration and provides decent results out of the box. TextBlob package uses Python’s NLTK package—a de-facto standard for NLP tasks in Python.
This step pulls Tweets on the tweets Kafka topic and applies the analysis. As a result of the analysis, TextBlob produces a real-valued number for the text of each tweet, or score. The score is in a range [-1.0, 1.0], where -1.0 means strongly negative text, 1.0 - strongly positive, and 0.0 means neutral text. We send the score to a Kafka topic called sentiments.
Step 3: CAR Score Calculation
The CAR score is simply an average of the scores of individual tweets. This step reads Tweets’ scores from sentiments Kafka topic, calculates the average of scores, the number of tweets all tweets—positive, negative, and neutral.
This step produces results in a Kafka topic called car_score. The result is produced when all the messages from sentiments topic have been read, up to 100 messages. We need to produce the result every 100 messages to get the UI updated with minimal latency. Otherwise, the UI would get updated only when all of the messages are read.
The car_score topic is intentionally configured to have only one partition, this is how we guarantee ordering in this topic. We need this because we are only interested in the latest message on this topic, the most up-to-date CAR score. Kafka guarantees the order of messages only within one partition, and we rely on that when the next step of the pipeline reads the last message in this topic from that single partition.
Step 4: Serve The Score
This step is effectively a Python Flask application that, upon request, reads the latest message from the car_score Kafka topic and serves that data back to the client.
Our team is comprised of Big Data engineers, not UI developers—we rely SoftServe’s our Design team for that—so we created a simple, front-end app that queries our Flask back-end and displays the results to the user. The results on UI are constantly updated to reflect the new tweets appeared. We have hosted the front-end on Google Cloud Storage as this option required the minimal setup.
You can access the UI and see the CAR score live at http://storage.googleapis.com/community-approval-rating/index.html. Please make sure to use HTTP requests, not HTTPS. The code of the service is available at https://github.com/olokshyn/CAR under MIT license.
In this workshop, we have used Confluent Cloud to quickly create a pipeline that consists of three Kafka topics, three processing steps written in Python using confluent_kafka package, and one Flask backend app. We used publicly available Tweets as a data source and stored them reliably in Kafka. We saw how Confluent Cloud simplifies the configuration part of the Kafka allowing us to focus on application development.
This case study shows that one can get meaningful results from big, unstructured data within a day given the right tools. With Apache Kafka, Confluent Cloud, and GCP the amount of code needed is minimal and the insights can be achieved expeditiously.