Installing Hadoop Cluster with Cloudera Manager
Deploying, configuring and running a Hadoop cluster manually is rather time- and cost-consuming. Here's a helping hand to create a fully distributed Hadoop cluster with Cloudera Manager.
This article shows how fast and easy it may be to install Hadoop cluster with Cloudera Manager. There are three major steps to follow:
- Prepare hosts
- Install Cloudera Manager
- Install Cluster
Since the first two steps may be carried out either manually (in details showcases how to prepare hosts and install Cloudera Manager) or through Vagrant automatization (lets you play with Cloudera Manager due to Vagrant script that automates host preparation and CM installation), this article will cover both approaches.
Deploying, configuring and running a Hadoop cluster manually is rather time- and cost-consuming and may even lead to market position loss if company's business operations depend directly on the technical operations speed; that's why specialized tools are preferable.
Luckily, that's where software comes into play. To make things simpler, here's a helping hand to creating a fully distributed Hadoop cluster with Cloudera Manager to get down to real life practice. Mind though, if you need a dev environment, just download a pre-prepared virtual machine with a pseudo distributed cluster inside.
This step by step guide covers the main highlights of the process (full documentation of Cloudera Manager including security, cluster optimization and fine tuning could be found here). Let's get this Hadoop cluster installed.
Before we start, here's a short glossary that might come in handy:
- CDH – Cloudera Distribution including Hadoop. CDH includes Hadoop and other applications that are usually used along, e.g. Flume, HBase, Hive, Impala, Kafka, Pig, Spark, Sqoop, etc.
- Cloudera Manager – a tool for Apache Hadoop administration including such operations as installation, upgrading, host commission/decommission, monitoring
- Vagrant – a tool for building complete development environments
Used software versions include:
CDH - 5
Cloudera Manager - 5.7
Vagrant - 1.8.1
OS - Centos 6.7
VitrualBox - 4.3.28
To install a Hadoop cluster:
- Prepare servers
- Install Cloudera Manager
- Install Cloudera Manager Agents and CDH
- Install Hadoop cluster
If you are going to practice on your own workstation (not on real servers), it's recommended that the first two steps are automated with Vagrant. Note: make sure you have at least 16 GB of RAM.
To prepare cluster servers and to install Cloudera Manager, do the following:
To make changes in your hosts file and make the virtual machines available by theirs hostnames, Vagrant may ask for the password of a current user. If Vagrant has provisioned and run the cluster successfully, skip steps 1 and 2 (preparing servers and installing Cloudera Manager).
1. Prepare servers
Skip this step if Vagrant was used. Go to the step 3.
For this minimal cluster, 4 servers are needed (minimal requirements for a non-production cluster):
- 1 x 8Gb RAM (Cloudera Manager + most important Hadoop Services)
- 3 x 1.2Gb RAM (Data Nodes)
Cloudera Manager Documentation provides the following instruction for each node in a soon-to-be cluster:
- Disable Selinux
- Setup NTP
- Disable firewall
- Define host names
These steps are automated in the Vagrant section above. If you prepare your servers manually, use the instructions below. Note: all the instructions are tested for CentOS 6.7. Use appropriate user manual for another OS.
1.1. Disable Selinux
Run the following in the command line:
$> sudo sed -i 's/^\(SELINUX\s*=\s*\).*$/\1disabled/' /etc/selinux/config
This command modifies Selinux's config file disabling Selinux service.
1.2. Setup NTP
NTP service is required to keep system clock on each server synchronized with global time and with each other. Do the following to setup it:
$> sudo yum -y install ntp
$> sudo chkconfig ntpd on
$> sudo service ntpd start
$> sudo hwclock --systohc
1.3. Disable Firewall
$> sudo chkconfig iptables off
1.4. Define host names
1.4.1. Edit /etc/hosts
The /etc/hosts file should have the following inside it:
1.4.2. Define system hostname
Run the following command on each host with a corresponding name (cloudera-1, cloudera-2, cloudera-3, cloudera-4):
$> sudo hostname cloudera-1
1.4.3. Edit /etc/sysconfig/network
The /etc/sysconfig/network file should look as follows:
1.4.4. Restart the network service
Restart the network service on each server to apply the changes:
$> sudo service network restart
2. Install Cloudera Manager
Skip this step if Vagrant was used. Go to the step 3.
This step is also automated with Vagrant. If you install it manually, follow the instruction below.
Installation could be divided into the following steps:
- Install MySql database
- Instal and run Cloudera Manager server
MySql an Cloudera Manager will be installed on cloudera-1 server (8GB of RAM).
2.1. Install MySql database
Install MySql server:
$> yum -y install mysql-server mysql
$> chkconfig mysqld on
Edit MySql configuration /etc/my.cnf in according to Cloudera recommendations:
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
symbolic-links = 0
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1
max_connections = 550
#log_bin should be on a disk with enough free space. Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your system
#and chown the specified folder to the mysql user.
# For MySQL version 5.1.8 or later. Comment out binlog_format for older versions.
binlog_format = mixed
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M
# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M
$> service mysqld start
Create a script that sets up MySql security settings:
yum -y install expect
SECURE_MYSQL=$(expect -c "
set timeout 10
expect \"Enter current password for root (enter for none):\"
expect \"Change the root password?\"
expect \"Remove anonymous users?\"
expect \"Disallow root login remotely?\"
expect \"Remove test database and access to it?\"
expect \"Reload privilege tables now?\"
Save this script as mysqlsec.sh, then make it executable and run it:
$> chmod a+x mysqlsec.sh
Create Cloudera Manager database and user:
$> mysql -u root -e "create database scm" mysql
$> mysql -u root -e "grant all on *.* to 'scm'@'%' identified by 'scm' with grant option;" mysql
2.2. Install Cloudera Manager
Setup Cloudera repository:
$> wget -O /etc/yum.repos.d/cloudera-manager.repo http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/cloudera-manager.repo
$> yum -y update
$> yum -y install oracle-j2sdk1.7 cloudera-manager-server cloudera-manager-daemons
$> yum -y install mysql-connector-java
Prepare Cloudera Manager Database:
$> /usr/share/cmf/schema/scm_prepare_database.sh mysql -h localhost scm scm scm
Start Cloudera Manager server:
$> service cloudera-scm-server start
Wait a bit while Cloudera Manager is starting its services and web server.
3. Install Cloudera Manager Agents and CDH
Go to http://cloudera-1:7180
This is Cloudera Manager login page. Use admin/admin as login/password
Then read and accept the license agreement and choose "Cloudera Enterprise Data Hub Edition Trial" on the next page.
After that you'll be offered to setup a new cluster. Use the following pattern to search all the nodes for the new cluster:
That will find all 4 servers we have prepared before.
Select all of them and press "Continue" button. Accept the default CDH repository settings. "Continue". Accept installing Java:
Do not choose the single user mode. Just press "Continue".
Provide Cloudera Manager with a root password for all the servers. If you have used Vagrant approach for servers preparation the password is "vagrant". If you prepared your servers manually, use the password you created:
Then just wait for Cloudera Manager to finish agents and CDH installation.
Press "Continue" and wait for distribution and activation.
Press "Continue" and wait for Cluster Inspector to finish the inspection.
4. Install Hadoop cluster
On the next page choose Core Hadoop installation.
Then you can choose the cluster roles distribution across the cluster. Accept the default options.
Then you have to define SQL server for the services. If you have used Vagrant for server preparation use the following parameters:
Host Name: cloudera-1
Database Type: MySQL
Database Name: scm
In case you prepared SQL server manually use your own parameters.
On the page with changes review accept the default settings and press "Continue".
Wait for the Cloudera Manager to setup the cluster roles.
The cluster role deployment diagram looks as follows:
When cluster is installed you can see it in Cloudera Manager (http://cloudera-1:7180/cmf/home):
Now you can monitor the cluster state, add and remove new services in this cluster, change configurations, identify problems in the cluster and so on. The yellow signs shown near the services are warnings that can be ignored now but should be analyzed and fixed if you are going to bring the cluster in production.
Cloudera Manager makes creation and maintenance of Hadoop clusters significantly easier than if they have been managed manually. Due to this instruction it is possible to create a Hadoop cluster in less than one hour when manual configuration and deployment could take a few hours or even days. Also, it's super convenient having all tools you need in one place. For more Cloudera Manager tricks visit cloudera.com