Apache Hadoop Setup

Hadoop 2.x is based on YARN architecture, which uses ResourceManagaer and ApplicationManager. ResourceManagaer manage recourses across cluster and Application Manager manages job life cycles.
Installing Hadoop is quite simple what we need to do to just Untar the Hadoop tar on the cluster nodes.
Master nodes will take responsibility of NameNode and ResourceManager whereas slaves clusters will take up the responsibility of DataNode and NodeManager. NameNode and ResourceManager could be different nodes.
Below explain step by step how to setup Hadoop 2.x on a single-node cluster.
Prerequisites:
• Java 6 installed
• Dedicated user for Hadoop
• SSH configured

Platform:
We are using MacOS however we could also follow the same steps to install in Linux. If we need to install on Window, we have to install Cygwin to support shell.
Download
• Download tar from link http://hadoop.apache.org/releases.html
• Extract into /Application/hadoop-2.3.0
Setup Environment
1. $ export HADOOP_HOME=/Application/hadoop-2.3.0
2. export PATH=$HADOOP_HOME/bin:$PATH
3. $export PATH=$HADOOP_HOME/sbin:$PATH
Note: We could also add above command bash profile to avoid repeating above steps
Create directories
Create namenode and datanode directory as per below
$ mkdir -p $ HADOOP_HOME/data/hdfs/namenode
$ mkdir -p $ HADOOP_HOME/data/hdfs/datanode
Change in yarn-site.xml
Change in /Application/hadoop-2.3.0/etc/hadoop/yarn-site.xml as below

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Change in core-site.xml
Change in /Application/hadoop-2.3.0/etc/hadoop/core-site.xml

<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>fs.defaultFS</name>
  <value>hdfs://localhost:9000/</value>
 </property>
</configuration>
Change in hdfs-site.xml
Change in /Application/hadoop-2.3.0/etc /hadoop/hdfs-site.xml:

<configuration>
<property>
   <name>dfs.replication</name>
   <value>1</value>
 </property>
 <property>
   <name>dfs.namenode.name.dir</name>
   <value>file:/ Application /hadoop-2.3.0/data/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/ Application /hadoop-2.3.0/data/hdfs/datanode</value>
 </property>
</configuration>

Change in mapred-site.xml
Change in / Application /hadoop-2.3.0/etc/hadoop/mapred-site.xml. If it is not available then create one.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
</property>
</configuration>

Logging
Update the conf/log4j.properties file to customize the Hadoop logging configuration.
Hadoop uses the Apache log4j via the Apache Commons Logging framework.
Format namenode

$ hadoop namenode -format
You will be getting message as below
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at username.local/xx.yyy.zz.aa
************************************************************/
Start HDFS server
run jps command by 

$jps
	782
912 Jps

Means as of now HDFS namenode has not been started
Start namenode

$sh hadoop-daemon.sh start namenode

Start datanode 

sh hadoop-daemon.sh start datanode
$ jps
1305 Jps
1238 DataNode
1201 NameNode

Resource Manager

$ sh yarn-daemon.sh start resourcemanager

Node Manager:

$ sh yarn-daemon.sh start nodemanager

Job History Server:

$ sh mr-jobhistory-daemon.sh start historyserver

Web interface
Browse HDFS and check health using http://localhost:50070 in the browser:

Reference
Hadoop Essence: The Beginner's Guide to Hadoop & Hive

7 thoughts on “Apache Hadoop Setup

Leave a comment