Thursday, May 12, 2016

Fighting with Apache Avro on Hadoop

Search title: Installing Apache Avro on Hadoop on Ubuntu
Search title: Installing Apache Avro on Hadoop 2.7.2 on Ubuntu 16.04

For the swear-jar: Everybody these days uses a Cloudera VM or Hortonworks VM with a single node hadoop cluster, and so it's hard to find useful information on the internet amidst the noise made by these local-hosted-single-noders !

Anyhow, I wanted to do it all from scratch. In a previous post, I describe how to setup a hadoop master-slave cluster. (Link: http://varghese85-cs.blogspot.com/2016/03/hadoop-cluster-setup-2-nodes-master.html ) So now, based on that config, I have a four node hadoop cluster running, with one master node (namenode and resource manager) and three slave nodes (datanodes).

As next step, I wanted to install "the rest of the animals in the zoo" on said cluster. I started with Oozie, but gave up on it, because they don't publish binaries, and their sources don't build (thanks to the codehaus repository closing its doors?).

The next animal I wanted to install (although technically not an animal) is Apache Avro (https://avro.apache.org/). I am following Tom White's book (Hadoop, the definitive guide, 4th edition; Link: http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1491901632 ). Chapter 12 gives an example avro map-reduce program in java on pages 360-361, which I wanted to try out.

Avro really isn't so much a project as a set of libraries. So I thought this was easy, but boy - was I mistaken!

At first, I added maven dependencies for org.apache.avro/avro and org.apache.avro/avro-tools. That compiled fine, but did not work at all.

Then I tried using the -libjars option to package avro, avro-tools, and avro-mapred jars along with my hadoop program. Even that kept failing saying AvroJob class could not be found.

And all the while I was trying to use Avro 1.7.7 which is the latest stable. After breaking my head against the wall, not being able to make it work for a while, I decided to figure out where is the hadoop classpath/jars - where does hadoop store other jars.

I found that the location was share/hadoop/common/lib (ps: I'm using Hadoop 2.7.2). And on top of that, to my wonderment, avro-1.7.4.jar was already there, although none of the other avro jars were there.

So first off, I switched my pom.xml to use avro 1.7.4 instead of 1.7.7. Then from the avro archives site (Link: http://archive.apache.org/dist/avro/ ) I donwloaded avro-ipc-1.7.4.jar and avro-mapred-1.7.4.jar and included them too at the above location (on all my nodes on the cluster). (How did I know to get these files? I watched the jars that maven downloaded when compiling!)

At this point, the map phase started working. I was excited. But the job kept consistently failing in the reduce phase. The exact error I found was:
Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

This did not make any sense to me, but after some googling, I found that this is because avro-mapred-1.7.4.jar was compiled against hadoop-0.23 whereas what we need is hadoop 2. So that meant, one had to download avro-mapred-1.7.4-hadoop2.jar and then, specify the classifier clause in the maven pom.xml.

And with that, I was finally able to make avro work on my hadoop 2.7.2 cluster :)
The struggle with the Apache ecosystem is real - hopefully, someday BigTop will deliver!

Here are some screenshots to show what I did, in case you're trying to replicate this:
This is the directory listing of share/hadoop/common/lib
(PS: hadoop-2.7.2 is the folder where I extracted the hadoop-2.7.2.tar.gz file)
This is my pom.xml, relevant section:
(Note the use of the classifier)
Also, unlike in the book example where the schema is hard-coded as a static final string into the program, I made an avsc file on HDFS. That meant, I had to pass the name of the file as parameter to the map-reduce program (there is no shared variables - main, map and reduce run in different jvms!). And in the driver (main/toolrunner.run) I read the file off hdfs into a string and set a configuration value with this string, so I can get the schema from the configuration from the context within the mappers and reducers (this definitely can be optimized further!).

Hope that was useful. If you have questions, please feel free to comment below :)
.

No comments: