Our software engineer Oleksandra Klevets has developed an integration of Cassandra and CDH that can be deployed and managed through Cloudera Manager. In this post, she ex plains the benefits of deploying Cassandra alongside with Hadoop and walks you through the process of setting it up.
Before diving into the article, please note that currently Cassandra integration is not supported by Cloudera, meaning that the setup described below is a Proof of Concept solution and not recommended for production use. Yet 😏
Why Combine Cassandra and Hadoop At All?
Cassandra is a NoSQL database that provides a fault tolerant storage for vast amounts of structured and semi-structured data, has no single point of failure and is linearly scalable. It is ideal for high-speed, online transactional data processing. In its turn, Hadoop is a framework that provides distributed storage for any type of data and is a Big Data analytics system that focuses on data warehousing and data lake use cases.
So why use Hadoop and Cassandra together? DataStax states that just as with legacy relational database applications, there is typically a need in modern Web, mobile and IOT applications to have a database devoted to online operations (which includes analytics on hot data) and a batch-oriented data warehouse environment that supports processing of colder data for analytic purposes.
Cassandra is a perfect database choice for online Web and mobile applications, while Hadoop targets processing of colder data in data lakes, warehouses, etc. This allows an IT organization to effectively support different analytic “tempos” needed to satisfy customer requirements and run their business.
Putting It Together: Cassandra on Cloudera
Using different Big Data projects and frameworks may bring a mess to managing different components. So it’s always a good idea to have one single interface to monitor and manage Big Data infrastructure.
Cloudera offers a great tool for managing Hadoop clusters: Cloudera Manager. It automates a CDH cluster installation process, provides a cluster-wide, real-time view of nodes and services running, enables configuration changes from a single control console, and delivers reporting and diagnostic tools for troubleshooting and optimization. Also, Cloudera Manager has a set of utilities to extend its functionality in managing other services.
Adding a new service to Cloudera Manager is not an easy task, but this can be achieved using parcels and Custom Service Descriptors.
Let’s see how that works.
In order to get Cassandra running in Cloudera environment, Cassandra parcel and Custom Service Descriptor (CSD) should be created and installed into Cloudera Manager. With this approach, deploying Cassandra is a similar experience to deploying other Hadoop components such as YARN, Impala, or Hive.
First of all, you need to clone the following github repository to your machine:
mkdir -p /github; cd /github
git clone https://github.com/elisska/cloudera-cassandra
Now, let’s install Cassandra parcel. Cloudera Manager can retrieve new parcels via HTTP, so we will use
Python SimpleHTTPServer to serve the DATASTAX_CASSANDRA-2.2.6-el6.parcel:
echo Parcel repo available at `hostname`:8000; python -m SimpleHTTPServer 8000
Log in to Cloudera Manager Web UI and add new parcel repository to Cloudera Manager:
This article explains how to deploy Cassandra under CDH using Cloudera Manager and custom parcels, as well as provides an overview of adding new Cassandra node to the cluster.
Since this is a PoC version of Cassandra and Cloudera Manager integration, there is a lot of further work to be done to enable functions like node decommission, removing node from a cluster, Cassandra cluster rolling restart, cluster ring and health check via Cloudera Manager, etc.
The source code for the Cloudera Manager Cassandra extension (Apache v2 licensed) is available on GitHub .