Subscribe to our Newsletter

How to begin with Hadoop

Hadoop is an open source platform, completely written in Java and distributed under Apache's licence, that allows us to store, manage and process gigantic amounts of data in a highly parallel manner on clusters of commodity machines. It is most suitable for batch processing. People and organizations use Hadoop to build their ETL tools, to perform their BI operations, to do analytics etc. 

You can find countless posts on Hadoop over the internet. And most of them are really good. But quite often, newbies face some issues even after doing everything as specified. I was no exception. In fact, many a times, my friends who are just starting their Hadoop journey, call me up and tell me that they are facing some issues even after doing everything in order. So, I thought of writing down the things which worked for me. I am not going to cover detail on Hadoop as there are many better post that outline everything pretty well. Here i'll just show you how to kick start with Hadoop on a single Linux box in pseudo distributed mode.

Prerequisites :

1- Sun(Oracle) java must be installed on the machine.
2- ssh must be installed and keypair must be already generated.

NOTE : Ubuntu comes with its own java compiler (i.e OpenJDK), but Sun(Oracle) java is the preferable choice for Hadoop. You can visit this link if you need some help on how to install it.

NOTE : You can visit this link if you want to see how to setup and configure ssh on your Ubuntu box.

Software versions used :

1- Linux (Ubuntu 12.04)
2- Java (Oracle java-7)
3- Hadoop (Apache hadoop-1.0.3)
4- OpenSSH_5.9p1 Debian-5ubuntu1, OpenSSL 1.0.1 14

If you have everything in place, start following the steps shown below to configure Hadoop on your machine :

1- Download the stable release of Hadoop from the Apache Hadoop website and copy it to some convenient location. Say your home directory.

2- Now, right click the compressed file which you have downloaded just now and choose extract here. This will create the hadoop-* folder inside your home directory. We'll call this location as HADOOP_HOME hereafter. So, your HADOOP_HOME=/home/your_username/hadoop-*

3- Edit the /HADOOP_HOME/conf/ file to set the JAVA_HOME variable to point to appropriate jvm.

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

NOTE : Before moving further, create a directory, hdfs for instance, with sub directories viz. name, data and tmp. We'll use these directories as the values of properties in the configuration files.

NOTE : Change the permissions of the directories created in the previous step to 755. Too open or too close permissions may result in abnormal behavior. Use the following command to do that :

[email protected]:~$ sudo chmod -R 755 /home/cluster/hdfs/

4- Now, we'll start with the actual configuration process. Hadoop is configured using a set of configuration files present inside the HADOOP_HOME/conf directory. These are xml files having a set of properties in form of key-value pairs. We'll modify the following 3 files for our setup :

I- HADOOP_HOME/conf/core-site.xml : Add the following lines between the <configuration></configuration> tag -

</property> : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. Each node in the system on which Hadoop is expected to operate needs to know the address of the NameNode.

hadoop.tmp.dir : A base for temporary directories. Value of this property defaults to the /tmp directory. So, it is always better to set this property to some other location to prevent irregularities.

II- HADOOP_HOME/conf/hdfs-site.xml : Add the following lines between the <configuration></configuration> tag -

</property> : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. Defaults to the /tmp directory, if not specified explicitly. : This is the path on the local file system in which the DataNode instance should store its data. It also defaults to the /tmp directory, if not specified explicitly.

II- HADOOP_HOME/conf/mapred-site.xml : Add the following lines between the <configuration></configuration> tag -


mapred.job.tracker : host and port at which JobTracker will run.

NOTE : Although there are many properties that can be used and play an important role while working with a large, fully distributed cluster, above shown properties are sufficient enough to set up a pseudo distributed Hadoop cluster on a single machine.

5- The configuration part is over now. And in order to proceed further, we have to format our Hdfs first (like any other file system). Use the following command to do that :

[email protected]:~/hadoop-1.0.3$ bin/hadoop namenode -format

If everything was ok, you'll see something like this on your terminal :

12/07/23 05:43:22 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.3
STARTUP_MSG: build = -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012
12/07/23 05:43:22 INFO util.GSet: VM type = 64-bit
12/07/23 05:43:22 INFO util.GSet: 2% max memory = 17.77875 MB
12/07/23 05:43:22 INFO util.GSet: capacity = 2^21 = 2097152 entries
12/07/23 05:43:22 INFO util.GSet: recommended=2097152, actual=2097152
12/07/23 05:43:22 INFO namenode.FSNamesystem: fsOwner=cluster
12/07/23 05:43:22 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/23 05:43:22 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/23 05:43:22 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/07/23 05:43:22 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/07/23 05:43:22 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/07/23 05:43:22 INFO common.Storage: Image file of size 113 saved in 0 seconds.
12/07/23 05:43:23 INFO common.Storage: Storage directory /home/cluster/hdfs/name has been successfully formatted.
12/07/23 05:43:23 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/
[email protected]:~/hadoop-1.0.3$

6- Once the formatting is done, start the NameNode, Secondary NameNode and DataNode daemons using the command shown below :

[email protected]:~/hadoop-1.0.3$ bin/

This will emit the following lines on the terminal :

starting namenode, logging to /home/cluster/hdfs/logs/hadoop-cluster-namenode-ubuntu.out
localhost: starting datanode, logging to home/cluster/hdfs/logs/hadoop-cluster-datanode-ubuntu.outlocalhost: starting secondarynamenode, logging to /home/cluster/hdfs/logs/hadoop-cluster-secondarynamenode-ubuntu.out
[email protected]:~/hadoop-1.0.3$

7- To start the JobTracker and Tasktracker daemons use :

[email protected]:~/hadoop-1.0.3$ bin/

This will emit the following lines on the terminal :

starting jobtracker, logging to /home/cluster/hdfs/logs/hadoop-cluster-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to
[email protected]:~/hadoop-1.0.3$

NOTE : To check if everything is working fine or not, we'll use JPS command (OpenJDK must be installed for this) :
[email protected]:~/hadoop-1.0.3$ jps
12537 Jps
12042 SecondaryNameNode
12173 JobTracker
11783 DataNode
11487 NameNode
12421 TaskTracker

NOTE : Hadoop also provides a web interface using which we can monitor our cluster. Point your web browser to http://localhost:50070 to see the NameNode status and to http://localhost:50030 to see the MapReduce status.

You can find all the information about your Hdfs from this page. You can even browse the file system and download files from here.

This page shows the status of the JobTracker and includes information about all the MapReduce jobs which ran on the cluster.

At this point you have a proper working single node Hadoop cluster running on your Ubuntu box. As the next step you can try configuring a fully distributed cluster. Please feel free to contact me if you need any help.

Originally posted on Data Science Central

E-mail me when people leave their comments –

You need to be a member of Hadoop360 to add comments!

Join Hadoop360

Featured Blog Posts - DSC