Thursday, June 23, 2016

Setting up Spark on top of Hadoop

In a previous post, I describe setting up a hadoop 2.7.2 two node cluster:
http://varghese85-cs.blogspot.com/2016/03/hadoop-cluster-setup-2-nodes-master.html

In this post, I describe how to setup and run spark 1.6.1 on top of that hadoop cluster.

First, download the spark-1.6.1-bin-hadoop2.6.tgz package from the Spark download page at http://spark.apache.org/downloads.html. Extract this somewhere on your filesystem. Now, in the post where we did the hadoop setup, we extracted and configured hadoop identically on all nodes in the cluster. However, for spark, I only extracted and set it up on one of the nodes - the master. I'm not sure if it needs to be the master, or if it can be done on any single slave node - likely it can be.

Next, export the SPARK_HOME to point at your extracted folder, and add the spark bin to your path, in the .bashrc file.
export SPARK_HOME=<path-to-your-folder>
export PATH=$PATH:$SPARK_HOME/bin
Also make sure that HADOOP_CONF_DIR is exported or do so if it already isn't. you can
echo $HADOOP_CONF_DIR
to see if it is exported, and if it is not, then you can add it to the .bashrc file as
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
Remember that you need to restart bash / close & reopen the terminal for the new .bashrc to take effect.

Now, start hadoop hdfs and yarn if not already running
start-dfs.sh
start-yarn.sh
Next, start the interactive spark-shell on top of yarn
spark-shell --master yarn
Note: the spark documentation ( http://spark.apache.org/docs/latest/running-on-yarn.html ) says to use the command spark-shell --master yarn --deploy-mode client. However, the interactive shell can only run as client, so I think the latter parameter is moot.

At this point, after a little waiting, you'll get to the spark scala prompt.
You can type
sc
at the prompt and this will show you that this variable is the spark context available to you.

So, let's do a quick spark data load. Assume that we have a large text file called bigdata.txt on our hadoop cluster under the hadoop user. We'll load this fil by the command:
val lines=sc.textFile("hdfs:/user/hadoop/bigdata.txt")
Note, that we don't use double blackslash after the hdfs protocol. If you use double backslash, you have to specify the server and port as well, such as hadoop-master:9000 etc. See: http://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark

Next, let us count the number of lines on this file. The "lines" variable we used now contains an RDD object, because that is what the textFile() api returns. We can use the count() member of the RDD to count the number of lines.
lines.count()
This, assuming your data file was large enough to span a few hdfs blocks, will cause a spark DAG to run on yarn to count the number of lines.

Finally, CTRL+D to quit spark shell.

Another good example I tried was this Spark Word Count example from Roberto Marchetto at http://www.robertomarchetto.com/spark_java_maven_example

I found a large text file that spans a few HDFS blocks, and ran that example on the hadoop cluster using the command:
spark-submit --class org.sparkexamples.WordCount --master yarn sparkexample-1.0-SNAPSHOT.jar /user/hadoop/bigdata.txt wc-output
Where the jar file is the result of mvn package, bigdata.txt is the input file on hdfs and wc-output is the folder on hdfs into which spark will write output.

This: http://spark.apache.org/examples.html is another interesting spark example/tutorial to try out.
.

No comments: