Howto: Setting Up a Multi-Node Hadoop Cluster Under Ubuntu
This tutorial explains how to set up a Apache Hadoop cluster running on several Ubuntu machines. The Hadoop framework consists of a distributed file system (HDFS) and an implementation of the MapReduce computing paradigm.
Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of one Ubuntu machine which will be the master and n Ubuntu machines which will be the slaves. In the figure each of the machines is divided into the HDFS layer shown at the bottom and the MapReduce layer shown at the top.
The HDFS layer consists of several data nodes which store the data files. These nodes are managed by one name node. One example of its management tasks is keeping track of the location where files are stored. In order to increase the fault tolerance of HDFS, there exists replicas of all stored files on different data nodes and a secondary name node which will take the place of the name node if it fails. In the cluster shown in figure 1 the master runs the name node as well as the secondary name node. Data nodes are run on each slave and on the master, too.
The MapReduce layer consists of a job tracker which splits a MapReduce job into several map and reduce tasks which are executed on the different task trackers. In order to reduce network traffic, the task trackers are run on the same machines as the data nodes. The job tracker could be run on an arbitrary machine but in cluster shown in figure 1 it is run on the master, too.
This tutorial was tested with the following software versions:
- Ubuntu Linux 12.04.3
- Java version 1.8.0-ea
- Hadoop 1.2.1
The following required software has to be installed on the master and on all slaves.
In order to install Java, you have to install the two further packages, first. Therefore, execute:
sudo apt-get install software-properties-common sudo apt-get install python-software-properties
Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated:
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update
Finally, Java 8 can be installed:
sudo apt-get install oracle-java8-installer
First check if SSH is already running by executing
for each machine which should be involved in the cluster.
address.of.machine is the address of the master or slave. If the login on one of the machines fails, you have to install an ssh server on that machine by following
When the Hadoop cluster will be start up later on, the master will establish ssh connections to all slaves. Then the passphrase will have to be entered several times. In order to avoid this, the SSH public key authentication can be used. Therefore, a key pair consisting of a public and a private key has to be created at the master,
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
append the public key to the list of authorized keys,
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
and distribute the public key to all slaves:
ssh-copy-id -i ~/.ssh/id_rsa.pub address.of.slave
Finally, the SSH setup can be tested by connecting from the master to the master and all slaves:
If you try to connect to one of the machines the first time, an error message prompts telling that the authenticity can not be established and asking if the connection should be continued. The question has to be answered with
After fulfilling the prerequisites, Hadoop can be installed on all machines. First, the corresponding Hadoop release has to be downloaded from https://hadoop.apache.org/releases.html In this tutorial, this can be done by executing
The installation is started by running:
sudo dpkg --install hadoop_1.2.1-1_x86_64.deb
The success of the installation can be checked by executing
which should print the help message of this command.
The configuration of Hadoop starts with setting up the environment of Hadoop. Therefore, some environment variables have to be set in the file
/etc/hadoop/hadoop-env.sh in all machines:
||Defines the path to the Java libraries which should be used|
||Defines the maximum amount of heap size of daemons in MB.|
||Defines the directory where the daemons' log files are stored.|
||Defines the directory where the daemons' log files are stored on a secure Hadoop cluster.|
||Defines the directory where PID files are stored.|
||Defines the directory where PID files are stored on a secure Hadoop cluster.|
The address of the machine where the secondary name node should be started could be defined in the file
/etc/hadoop/masters on the master. By default it is the same machine where the name node is. In this tutorial this value keeps unchanged.
The machines of the data nodes and task trackers are defined in the file
/etc/hadoop/slaves on the master. The address of each machine is added in a separate line.
Now you have to create three further configuration files in the folder
- general Hadoop configurations are defined in
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://address.of.master:8020</value> <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS. </description> <final>true</final> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.checkpoint.dir</name> <value>/data/hadoop/tmp/dfs/namesecondary</value> <description>Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy. </description> </property> </configuration>
- HDFS specific settings are defined in
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.name.dir</name> <value>/data/hadoop/namenode</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/data/hadoop/datadir</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> <final>true</final> </property> <property> <name>dfs.replication</name> <value>3</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. This number should be smaller than the number of data nodes. </description> </property> </configuration>
- MapReduce specific settings are defined in
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.system.dir</name> <value>/mapred/mapredsystem</value> <final>true</final> <description>The directory where MapReduce stores control files. </description> </property> <property> <name>mapred.job.tracker</name> <value>address.of.master:9000</value> <final>true</final> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>/data/hadoop/mapreddir</value> <final>true</final> <description>The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. </description> </property> <property> <name>mapred.temp.dir</name> <value>/data/hadoop/tmp/mapred/temp</value> <description>A shared directory for temporary files. </description> </property> </configuration>
Running the Hadoop cluster
Before Hadoop can be run, the HDFS has to be format. Therefore, the following command has to be run at the master:
hadoop namenode -format
Therafter, the Hadoop cluster can be started by executing
start-all.sh at the master. This script first runs
start-dfs.sh which starts the HDFS in the following order:
- name node is run on the machine where
- all data nodes are run on the machines defined in
- secondary name nodes are run on the machines defined in
start-mapred.sh is run which starts the MapReduce nodes in the following order:
- job tracker is run on the machine where
- all task trackers are run on the machines defined in
Hadoop provides three web interfaces:
- http://address.of.master:50070/ shows the information of the name node
- http://address.of.master:50030/ shows the information of the job tracker
- http://address.of.master:50060/ shows the information of all task trackers
The Hadoop clsuter can be stopped by executing
stop-all.sh on the master which first runs
stop-mapred.sh and then