Installing Hadoop in Pseudo Distributed Mode (Single Node Cluster) on Ubuntu 16.4 system.

In this post we shall see how to install hadoop on a single node cluster (pseudo distributed mode).

Hadoop software library framework allows distributed processing of large data sets across clusters of computers(can be commodity computers as well) using simple programming models. It is designed to scale up from single server to thousands of machines, each offering local computation and storage. “Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.” For more information look the following link.

Hadoop frame work contains the following modules:(top to bottom as shown in the below figure)

  1. Hadoop Mapreduce.
  2. Hadoop YARN.
  3. Hadoop Distributed File System(HDFS).
Architecture of Hadoop

We shall now see how to install hadoop single node cluster, which is also known as Pseudo distributed mode where 2 nodes — a master and a slave process are simulated by running JVM’s . A fully-fledged test cluster environment view is given in this mode.In this mode HDFS(Hadoop Distributed File System) for storage (uses some portion of our hard disk) and YARN(Yet Another Resource Negotiator) for managing the resources are used.

The following is the step by step procedure for installation of hadoop single node cluster.

Requirements:
Ubuntu system.
Hadoop tar(stable version : 2.9.1) can downloaded from the following link
which is about 345 MB.

Installation Procedure:
Before starting installation of hadoop we check for the installation of java.

java -version (If java is installed the we get the version of installed java in our system).

If java is not installed use sudo apt-get install default-jre

Command for installing java using apt-get
Command for installing java using apt-get
Install java using apt-get

or install java from source.

Or use

sudo apt-get install openjdk-8-jre
sudo apt-get install openjdk-8-jre

Preferably install java 8 because java 9 gave some errors in multi-node cluster setup.

HADOOP INSTALLATION:

Step 1 : Add new user group. (We add new user/users to this group.)

sudo addgroup hadoop // new group hadoop is added .

Step 2 : Add new user to this group.

sudo adduser — ingroup hadoop hduser

We get prompt to password, please give password and confirm it as shown:

Adding a new user in hadoop group.

Step 3 : Login as hduser using su hduser and produce ssh-keys for password less login to localhost so that hadoop which is going to be installed in pseudo distributed mode does password less login to localhost and dose the tasks assigned.

To login as hduser in hadoop group.

Using

ssh-keygen -t rsa -p “ ”

we get the following prompt:

using command ssh-keygen -t rsa -p “”

you will be getting a prompt to give path for generated key.
press enter for default.
we need to copy the key generated to the authorized_keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Step 4: We finally test whether the local machine is being connected through
hduser or not.
ssh localhost //connect to localhost from hduser without password.

using ssh localhost for password less login.

we get somthing like above and press yes and we login to the localhost.
Press exit and come out to the hduser.

Till now we had set the system for installation of hadoop in it. We shall now see how to put HADOOP on top of this machine.
We install hadoop in /usr/local/ so that it is in common place to all the users( we can install in any other location as well).
First extract the .tar file which is downloaded and rename it as hadoop.

Step 5 : Move the extracted file to /usr/local/ and change the access permission to only hduser in the hadoop group.
sudo mv hadoop /usr/local
Next changeownership of hadoop folder to hduser in hadoop group.

sudo chown -R hduser:hadoop hadoop

Step 6 : We now make some changes in the .bashrc file (~/.bashrc). We add the following lines at the end of the bashrc file.

# $HOME/.bashrc
# Set Hadoop-related environment variables
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop/sbin/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/
export HADOOP_MAPRED_HOME=/usr/local/hadoop/
export HADOOP_COMMONHOME=/usr/local/hadoop/
export HADOOP_HDFS_HOME=/usr/local/hadoop/
export YARN_HOME=/usr/local/hadoop/
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs=”hadoop fs”
unalias hls &> /dev/null
alias hls=”fs -ls”
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Step 7:
We now configure our hadoop file for it to work as single node cluster.
Open the hadoop-env.sh file which is present in /usr/local/hadoop/etc/hadoop/ and remove comment of the line export JAVA_HOME and add the path of the java directory as shown.

hadoop-env,sh in /usr/local/hadoop/etc/hadoop/

Then create new folder tmp for which hadoop uses to store the files and run the tasks.

Create tmp directory.

Change the ownership of this file to hduser under the user group hadoop.

change ownership

We next change the core-site.xml
$ vi /usr/local/hadoop2/etc/hadoop/core-site.xml and put the following
code in the space between <configuration> . . . </configuration>

The following should be put:

<!- -conf/core-site.xml- ->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used todetermine the host, port, etc. for a filesystem.</description>
</property>

Similarly we change in the mapred-site.xml present in the same directory
in between <configuration> . . . </configuration> as shown:

<!- -conf/mapred-site.xml- ->
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

Similarly in the hdfs-site.xml also we put in the space between
<configuration> . . . </configuration> the following code as shown below :

<! --conf/hdfs-site.xml- ->
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Once done we are ready to start our hadoop node and work on, but before
that the Namenode needs to be formatted.

format the name node before using hadoop

We get the message as the name-node is formatted if every thing as described
above is done properly.
To start the hadoop system we have shell script which does the starting of all
nodes, managers and other things.

To start hadoop file system we have the following command

We get the output as following

To check whether all the nodes are running we can use jps for this. The output
is shown as in the picture.

To stop we also have stop all shell script which stops all the processes which are started. We have sbin/stop-all.sh

Stopping all the started processes.

We may get that the scripts start-all.sh, stop-all.sh are deprecated so we can use start-dfs.sh, start-yarn.sh and stop-dfs.sh, stop-yarn.sh to strart and stop our single node cluster.
HURRAY ! Hadoop is successfully installed in your system.

The installation on Multi node hadoop cluster is explained in this article.

Django | Image Processing | Python| SSSIHL | https://www.linkedin.com/in/gauthamdasu/