Hadoop and AWS and Python, Oh My!

For an upcoming project at work, I needed to get a better idea of how the AWS services work together, and wanted to also see how the EC2 instances could be used for parallel processing.  Sadly, I do not love Java, and although I would use it if pressed, I wanted to see if I could find a pythony way to process some data using a hadoop setup.

So, based on this page, I created a mapper and reducer in python. The mapper looks through a file and spits out lines for each match it finds.  The reducer takes the stdout from that process (using hadoop streaming) and does the thinking, then spits out the result. The examples on that page are a fine place to start for this piece.  And you can time the process on your system here to get an idea of the speedup using the hadoop setup.
Next, I needed to get the files over to S3 so I can access them from my EC2 instance.  S3 instances are persistent, and transfers between S3 and EC2 are free, so I can run my processes an infinite number of times without incurring new costs for grabbing the files.  First, I created a bucket using the Python S3 tools, and then copied the files over using:
hadoop fs -put <file> s3://ID:SECRET@BUCKET/name_of_dir

There are, of course, other ways to move things to S3 buckets.  Pick one you like.

Now that all of my files are there for accessing, it’s time to set up the hadoop instance.  
This part isn’t included in toto anywhere, so I’ll cover it here in detail.  This assumes you’ve done all of:
  • Set up yourself with an AWS account with EC2 and S3 access (including setting up a properly permissioned id_rsa-gsg-keypair as described here)
  • Created a bucket in S3 and populated it with files
  • Created a mapper.py and reducer.py and tested them with your files
  • Installed the hadoop tools on your local system and configured them as described here
Next, even though every piece of documentation says to do this:
bin/hadoop-ec2 run

That’s a lie.  Try this instead:

bin/hadoop-ec2 launch-cluster <group_name> <number_of_slaves>

This will create a master hadoop node, and your slaves. For number_of_slaves you want to pick something <= 19 so that your total doesn’t exceed 20 (unless you have special privileges).

Now we have to move our snazzy mapper and reducer to the master:

scp $SSH_OPTS /path/to/mapper.py root@$MASTER_HOST:/home
scp $SSH_OPTS /path/to/reducer.py root@$MASTER_HOST:/home

‘run’ apparently used to then log you into your master, but since we’re using launch-cluster, you’ll need to do it yourself:

ssh $SSH_OPTS root@<your_new_master>

And there you are! On your new master. Awesome. Now let’s move the data to our cluster (ID and SECRET are your AWS credentials, BUCKET is the bucket you created):

cd /usr/local/hadoop-<version>
bin/hadoop fs -mkdir files
bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/files files

Ok, great. Almost there. Now we need to run the thing:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.18.0-streaming.jar -mapper mapper.py -file /home/mapper.py -reducer reducer.py -file /home/reducer.py -input files/* -output map-reduce.output

While it’s running, you can check out the neat web report hadoop creates at http://:50030.  Go ahead, check it out.  It’s totally cool.

aws hadoop python

Dialogue & Discussion