Big Data/hadoop

Howto: Setting Up a Multi-Node Hadoop Cluster Under Ubuntu

This tutorial explains how to set up a Apache Hadoop cluster running on several Ubuntu machines. The Hadoop framework consists of a distributed file system (HDFS) and an implementation of the MapReduce computing paradigm.

Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of one Ubuntu machine which will be the master and n Ubuntu machines which will be the slaves. In the figure each of the machines is divided into the HDFS layer shown at the bottom and the MapReduce layer shown at the top.

The HDFS layer consists of several data nodes which store the data files. These nodes are managed by one name node. One example of its management tasks is keeping track of the location where files are stored. In order to increase the fault tolerance of HDFS, there exists replicas of all stored files on different data nodes and a secondary name node which will take the place of the name node if it fails. In the cluster shown in figure 1 the master runs the name node as well as the secondary name node. Data nodes are run on each slave and on the master, too.

The MapReduce layer consists of a job tracker which splits a MapReduce job into several map and reduce tasks which are executed on the different task trackers. In order to reduce network traffic, the task trackers are run on the same machines as the data nodes. The job tracker could be run on an arbitrary machine but in cluster shown in figure 1 it is run on the master, too.

This tutorial was tested with the following software versions:

Ubuntu Linux 12.04.3
Java version 1.8.0-ea
Hadoop 1.2.1

Prerequisites

The following required software has to be installed on the master and on all slaves.

Java 8

In order to install Java, you have to install the two further packages, first. Therefore, execute:

sudo apt-get install software-properties-common
sudo apt-get install python-software-properties

Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update

Finally, Java 8 can be installed:

sudo apt-get install oracle-java8-installer

SSH Server

First check if SSH is already running by executing

ssh address.of.machine

for each machine which should be involved in the cluster. address.of.machine is the address of the master or slave. If the login on one of the machines fails, you have to install an ssh server on that machine by following http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/

When the Hadoop cluster will be start up later on, the master will establish ssh connections to all slaves. Then the passphrase will have to be entered several times. In order to avoid this, the SSH public key authentication can be used. Therefore, a key pair consisting of a public and a private key has to be created at the master,

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

append the public key to the list of authorized keys,

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

and distribute the public key to all slaves:

ssh-copy-id -i ~/.ssh/id_rsa.pub address.of.slave

Finally, the SSH setup can be tested by connecting from the master to the master and all slaves:

ssh address.to.machine

If you try to connect to one of the machines the first time, an error message prompts telling that the authenticity can not be established and asking if the connection should be continued. The question has to be answered with yes.

Install Hadoop

After fulfilling the prerequisites, Hadoop can be installed on all machines. First, the corresponding Hadoop release has to be downloaded from https://hadoop.apache.org/releases.html In this tutorial, this can be done by executing

wget http://mirror.synyx.de/apache/hadoop/common/hadoop-1.2.1/hadoop_1.2.1-1_x86_64.deb

The installation is started by running:

sudo dpkg --install hadoop_1.2.1-1_x86_64.deb

The success of the installation can be checked by executing

hadoop

which should print the help message of this command.

Configure Hadoop

The configuration of Hadoop starts with setting up the environment of Hadoop. Therefore, some environment variables have to be set in the file /etc/hadoop/hadoop-env.sh in all machines:

variable	value	description
`JAVA_HOME`	`/usr/lib/jvm/java-8-oracle`	Defines the path to the Java libraries which should be used
`HADOOP_HEAPSIZE`	`2000`	Defines the maximum amount of heap size of daemons in MB.
`HADOOP_LOG_DIR`	`/data/hadoop/log/$USER`	Defines the directory where the daemons' log files are stored.
`HADOOP_SECURE_DN_LOG_DIR`	`/data/hadoop/log`	Defines the directory where the daemons' log files are stored on a secure Hadoop cluster.
`HADOOP_PID_DIR`	`/data/hadoop/run`	Defines the directory where PID files are stored.
`HADOOP_SECURE_DN_PID_DIR`	`/data/hadoop/run`	Defines the directory where PID files are stored on a secure Hadoop cluster.

The address of the machine where the secondary name node should be started could be defined in the file /etc/hadoop/masters on the master. By default it is the same machine where the name node is. In this tutorial this value keeps unchanged.

The machines of the data nodes and task trackers are defined in the file /etc/hadoop/slaves on the master. The address of each machine is added in a separate line.

Now you have to create three further configuration files in the folder /etc/hadoop/:

general Hadoop configurations are defined in core-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://address.of.master:8020</value>
    <description>The name of the default file system. Either the
      literal string "local" or a host:port for NDFS.
    </description>
    <final>true</final>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  
  <property>
    <name>fs.checkpoint.dir</name>
    <value>/data/hadoop/tmp/dfs/namesecondary</value>
    <description>Determines where on the local filesystem the DFS secondary
      name node should store the temporary images to merge.
      If this is a comma-delimited list of directories then the image is
      replicated in all of the directories for redundancy.
    </description>
  </property>

</configuration>

HDFS specific settings are defined in hdfs-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>dfs.name.dir</name>
    <value>/data/hadoop/namenode</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table.  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
    <final>true</final>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/data/hadoop/datadir</value>
    <description>Determines where on the local filesystem an DFS data node
       should store its blocks.  If this is a comma-delimited
       list of directories, then data will be stored in all named
       directories, typically on different devices.
       Directories that do not exist are ignored.
    </description>
    <final>true</final>
  </property>
  
  <property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
	  This number should be smaller than the number of data nodes.
    </description>
  </property>

</configuration>

MapReduce specific settings are defined in mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>mapred.system.dir</name>
    <value>/mapred/mapredsystem</value>
    <final>true</final>
    <description>The directory where MapReduce stores control files.
    </description>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>address.of.master:9000</value>
    <final>true</final>
    <description>The host and port that the MapReduce job tracker runs
      at. If "local", then jobs are run in-process as a single map
      and reduce task.
  </description>
  </property>

  <property>
    <name>mapred.local.dir</name>
    <value>/data/hadoop/mapreddir</value>
    <final>true</final>
    <description>The local directory where MapReduce stores intermediate
      data files. May be a comma-separated list of
      directories on different devices in order to spread disk i/o.
      Directories that do not exist are ignored.
    </description>
  </property>

  <property>
    <name>mapred.temp.dir</name>
    <value>/data/hadoop/tmp/mapred/temp</value>
    <description>A shared directory for temporary files.
    </description>
  </property>

</configuration>

Running the Hadoop cluster

Before Hadoop can be run, the HDFS has to be format. Therefore, the following command has to be run at the master:

hadoop namenode -format

Therafter, the Hadoop cluster can be started by executing start-all.sh at the master. This script first runs start-dfs.sh which starts the HDFS in the following order:

name node is run on the machine where start-dfs.sh is executed
all data nodes are run on the machines defined in /etc/hadoop/slaves
secondary name nodes are run on the machines defined in /etc/hadoop/masters

Then start-mapred.sh is run which starts the MapReduce nodes in the following order:

job tracker is run on the machine where start-mapred.sh is executed
all task trackers are run on the machines defined in /etc/hadoop/slaves

Hadoop provides three web interfaces:

http://address.of.master:50070/ shows the information of the name node
http://address.of.master:50030/ shows the information of the job tracker
http://address.of.master:50060/ shows the information of all task trackers

The Hadoop clsuter can be stopped by executing stop-all.sh on the master which first runs stop-mapred.sh and then stop-dfs.sh.

Additional Sources

https://github.com/renepickhardt/metalcon/wiki/setupZookeeperHadoopHbaseTomcatSolrNutch

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/