Posts Tagged ‘aws’

Accessing Amazon’s Product Advertising API with Python

Posted in Uncategorized on February 10th, 2010 by admin – 2 Comments

I’m working on a little hacky toy for our upcoming hack day at work, and one of the pieces requires that I get some info from the Amazon “REST” API. It’s not REST in the least, but that’s not what I’m here to talk about. I’m here to talk about creating a signed request which works with the “Amazon Signature 2″ format needed for the API. There’s a Perl library to do everything with the API, which is great if that’s what you’re looking for, but it wasn’t. I just wanted to sign a simple static request to get information about books for which I had the ISBN. No Python library exists, but fortunately if you’re using Python 2.5+ you’ve got everything you need with very little code. I’m including it here to save you the effort of pulling together the pieces yourself.

# Create a signed request to use Amazon's API
import base64
import hmac
import urllib, urlparse
import time

from hashlib import sha256 as sha256

AWS_ACCESS_KEY_ID = 'XXX'
AWS_SECRET_ACCESS_KEY = 'YYY'
hmac = hmac.new(AWS_SECRET_ACCESS_KEY, digestmod=sha256)

def getSignedUrl(params):
    action = 'GET'
    server = "webservices.amazon.com"
    path = "/onca/xml"

    params['Version'] = '2009-11-02'
    params['AWSAccessKeyId'] = AWS_ACCESS_KEY_ID
    params['Service'] = 'AWSECommerceService'
    params['Timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())

    # Now sort by keys and make the param string
    key_values = [(urllib.quote(k), urllib.quote(v)) for k,v in params.items()]
    key_values.sort()

    # Combine key value pairs into a string.
    paramstring = '&'.join(['%s=%s' % (k, v) for k, v in key_values])
    urlstring = "http://" + server + path + "?" + \
        ('&'.join(['%s=%s' % (k, v) for k, v in key_values]))

    # Add the method and path (always the same, how RESTy!) and get it ready to sign
    hmac.update(action + "\n" + server + "\n" + path + "\n" + paramstring)

    # Sign it up and make the url string
    urlstring = urlstring + "&Signature="+\
        urllib.quote(base64.encodestring(hmac.digest()).strip())

    return urlstring

if __name__ == "__main__":
    params = {'ResponseGroup':'Small,BrowseNodes,Reviews,EditorialReview,AlternateVersions',
                     'AssociateTag':'xxx-20',
                     'Operation':'ItemLookup',
                     'SearchIndex':'Books', 
                     'IdType':'ISBN',
                     'ItemId':'9780385086950'}
    url = getSignedUrl(params)
    print url

Hadoop and AWS and Python, Oh My!

Posted in Uncategorized on October 30th, 2008 by synedra – 1 Comment

For an upcoming project at work, I needed to get a better idea of how the AWS services work together, and wanted to also see how the EC2 instances could be used for parallel processing.  Sadly, I do not love Java, and although I would use it if pressed, I wanted to see if I could find a pythony way to process some data using a hadoop setup.

So, based on this page, I created a mapper and reducer in python. The mapper looks through a file and spits out lines for each match it finds.  The reducer takes the stdout from that process (using hadoop streaming) and does the thinking, then spits out the result. The examples on that page are a fine place to start for this piece.  And you can time the process on your system here to get an idea of the speedup using the hadoop setup.
Next, I needed to get the files over to S3 so I can access them from my EC2 instance.  S3 instances are persistent, and transfers between S3 and EC2 are free, so I can run my processes an infinite number of times without incurring new costs for grabbing the files.  First, I created a bucket using the Python S3 tools, and then copied the files over using:
hadoop fs -put <file> s3://ID:SECRET@BUCKET/name_of_dir

There are, of course, other ways to move things to S3 buckets.  Pick one you like.

Now that all of my files are there for accessing, it’s time to set up the hadoop instance.  
This part isn’t included in toto anywhere, so I’ll cover it here in detail.  This assumes you’ve done all of:
  • Set up yourself with an AWS account with EC2 and S3 access (including setting up a properly permissioned id_rsa-gsg-keypair as described here)
  • Created a bucket in S3 and populated it with files
  • Created a mapper.py and reducer.py and tested them with your files
  • Installed the hadoop tools on your local system and configured them as described here
Next, even though every piece of documentation says to do this:
bin/hadoop-ec2 run

That’s a lie.  Try this instead:

bin/hadoop-ec2 launch-cluster <group_name> <number_of_slaves>

This will create a master hadoop node, and your slaves. For number_of_slaves you want to pick something <= 19 so that your total doesn’t exceed 20 (unless you have special privileges).

Now we have to move our snazzy mapper and reducer to the master:

bin/hadoop-ec2-env.sh
scp $SSH_OPTS /path/to/mapper.py root@$MASTER_HOST:/home
scp $SSH_OPTS /path/to/reducer.py root@$MASTER_HOST:/home

‘run’ apparently used to then log you into your master, but since we’re using launch-cluster, you’ll need to do it yourself:

ssh $SSH_OPTS root@<your_new_master>

And there you are! On your new master. Awesome. Now let’s move the data to our cluster (ID and SECRET are your AWS credentials, BUCKET is the bucket you created):

cd /usr/local/hadoop-<version>
bin/hadoop fs -mkdir files
bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/files files

Ok, great. Almost there. Now we need to run the thing:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.18.0-streaming.jar -mapper mapper.py -file /home/mapper.py -reducer reducer.py -file /home/reducer.py -input files/* -output map-reduce.output

While it’s running, you can check out the neat web report hadoop creates at http://<server_name>:50030.  Go ahead, check it out.  It’s totally cool.