First off, this is just a description of my experience setting up a hadoop two node cluster, primarily for my own future reference, and secondarily for the edification of the curious blog-reader. I do not claim to be an expert on Hadoop, and there probably are errors in my setup steps below that someone who knows hadoop well may be able to refine.
References:
References:
- Hadoop single node setup: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html
- Hadoop cluster setup: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
- Video tutorial of an old version of hadoop: https://www.youtube.com/watch?v=PFKs592W_Rc
To start off, I have two VMWare virtual machines, hadoop-master and hadoop-slave1. Each run Ubuntu 15.10 x64 desktop. The ip addresses of these respectively are
hadoop-master: 192.168.131.155
hadoop-slave1: 192.168.131.156
Both VMs have a single user named "hadoop"
Step 0: Packages Required
- sudo apt-get install ssh rsync default-jdk
- download hadoop from http://hadoop.apache.org/releases.html and extract it somewhere on each vm
Step 1: Make the hosts know each other
Here we update /etc/hosts so that
(1) the hostname of this host refers to its external ip address and not 127.0.0.1
(2) the hostname of the other VM is also mapped to that VM's IP address
So, /etc/hosts for hadoop-master looks like:
and /etc/hosts for hadoop-slave1 looks like:
Step 2: Setup ssh without password
Do these three commands on each VM
- ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
- ssh-copy-id -i ~/.ssh/id_dsa.pub hadoop@hadoop-master
- ssh-copy-id -i ~/.ssh/id_dsa.pub hadoop@hadoop-slave1
The first command create a reciprocal pair of keys for ssh without password. The public key is in id_dsa.pub (since we specified the target file as id_dsa). The other two commands will install the public key on this machine as well as the other machine. So once this is done, from each machine you should be able to ssh to localhost, hadoop-master/hadoop-slave1 without providing a password.
Step 3: JAVA_HOME environment variable
The command "java -version" will tell you the java version on your VMs. You don't need to know this, in our case, but just FYI and also if you need to look up java version vs. hadoop version compatibility.
On my ubuntu instance, JAVA_HOME should point to a version specific subfolder under /usr/lib/jvm/. Conveniently, there is a symbolic link 'default-java' which will automatically point to the correct version folder.
so my JAVA_HOME needs to be set to "/usr/lib/jvm/default-java/" Now, where do I need to set this? There are two places (need to do this on both VMs):
- In your .bashrc file. At the very end of the file, export JAVA_HOME
- In hadoop's etc/hadoop/hadoop-env.sh
Step 4: Basic hadoop / HDFS
From this step on, we are going to follow the Single Node Setup tutorial, except paralleling it as cluster setup. The Cluster tutorial itself is a rather complicated list of all the options, most of which we do not care about.
Configure core-site.xml and hdfs-site.xml on both VMs. They are identical on both VMs.
core-site.xml:
hdfs-site.xml:
core-site.xml specifies the NameNode as defined in the Cluster tutorial. We make hadoop-master:9000 the NameNode. hdfs-site.xml defines the replication count.
Next we update etc/hadoop/slaves to list all the machines in the pool. I listed both machines, and I made this file identical on both machines - not sure if that is the right thing to do.
At this point, we have basic hadoop setup. Next you need to format the HDFS filesystem since this is a new hdfs filesystem. This needs to just be done on the master, and is a one time operation - you don't need to do this every time you bring up HDFS.
- bin/hdfs namenode -format
- sbin/start-dfs.sh
- bin/hdfs dfs -mkdir /user
- bin/hdfs dfs -mkdir /user/hadoop
Let's add some files onto the HDFS file system from the master. Specifically, we copy the folder etc/hadoop on our local filesystem as a folder named "input" on the HDFS filesystem so we can do the map-reduce example in the tutorial. (If you did the single node setup, then you will see that this put operation takes longer. This is because, we specified a replication level of 2, and so each file needs to be moved over the network to the slave as well).
- bin/hdfs dfs -put etc/hadoop input
- bin/hdfs dfs -ls input
- bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'had[a-z.]+'
- bin/hdfs dfs -ls output
- bin/hdfs dfs -cat output/*
- bin/hdfs dfs -get output output
- cat output/*
Also, we can examine the status of HDFS by going to the URL
- http://hadoop-master:50070
Step 5: Actual distribution / YARN
The above step really only set up the HDFS filesystem. The map-reduce job we ran still ran locally and was not executed as a distributed map-reduce job (sorry!). To actually distribute processing, we need to setup the ResourceManager node specified in cluster tutorial. For that we need to setup the map-reduce framework, which in our case is YARN.
First, configure mapred-site.xml and yarn-site.xml as follows (same on both VMs):
mapred-site.xml:
yarn-site.xml:
The former tells hadoop to use YARN for the map-reduce framework. The latter tells hadoop that "hadoop-master" is the resource-manager. You could setup a multi-node cluster where the resource manager is not the same host as the node manager. (I have no idea what mapreduce_shuffle does).
At this point, we are ready to start YARN. Remember that YARN is practically useless unless you run it on top of HDFS. So we first start HDFS and then start YARN. (from the master)
- sbin/start-dfs.sh
- sbin/start-yarn.sh
- http://hadoop-master:8088
- bin/hdfs dfs -rm -r output
- bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'had[a-z.]+'
Finally, don't forget to shutdown correctly (from master):
- sbin/stop-yarn.sh
- sbin/stop-dfs.sh
[Reference: http://stackoverflow.com/questions/26545524/there-are-0-datanodes-running-and-no-nodes-are-excluded-in-this-operation]
This is not really a step in itself, and you don't normally need to do it. However, if you need to do it, first stop both yarn and dfs, then reformat.
- sbin/stop-yarn.sh
- sbin/stop-dfs.sh
- bin/hdfs namenode -format
- rm -rf /tmp/*
Once you do that, you can bring up HDFS and create home folder for the hadoop user
- sbin/start-dfs.sh
- bin/hdfs dfs -mkdir -p /user/hadoop
- sbin/start-yarn.sh
Obviously, as we found out in step 6, hadoop is storing data on /tmp/ folder, which is susceptible to being deleted. However, we do not want the data we store on the HDFS filesystem being deleted. To correct this, first make two folders named "hadoop-name" and "hadoop-data" somewhere that the hadoop user can write to. I have made them in /home/hadoop as below. The hadoop-name folder will store hadoop namenode information and logs. The hadoop-data folder will store the actual data on the filesystem.
- /home/hadoop/hadoop-name
- /home/hadoop/hadoop-data
Finally, edit etc/hadoop/hdfs-site.xml (on all machines) and add these two properties:
dfs.namenode.name.dir should point to the hadoop-name folder we created, and dfs.datanode.data.dir should point to the hadoop-data folder we created. At this point, you can delete any hadoop files in /tmp, reformat the namenode and start HDFS and YARN
- rm -rf /tmp/*
- bin/hdfs namenode -format
- sbin/start-dfs.sh
- bin/hdfs dfs -mkdir -p /user/hadoop
- sbin/start-yarn.sh
.
No comments:
Post a Comment