0

Hadoop and AWS and Python, Oh My!

Posted by synedra on Oct 30, 2008 in Uncategorized

For an upcoming project at work, I needed to get a better idea of how the AWS services work together, and wanted to also see how the EC2 instances could be used for parallel processing.  Sadly, I do not love Java, and although I would use it if pressed, I wanted to see if I could find a pythony way to process some data using a hadoop setup.

So, based on this page, I created a mapper and reducer in python. The mapper looks through a file and spits out lines for each match it finds.  The reducer takes the stdout from that process (using hadoop streaming) and does the thinking, then spits out the result. The examples on that page are a fine place to start for this piece.  And you can time the process on your system here to get an idea of the speedup using the hadoop setup.
Next, I needed to get the files over to S3 so I can access them from my EC2 instance.  S3 instances are persistent, and transfers between S3 and EC2 are free, so I can run my processes an infinite number of times without incurring new costs for grabbing the files.  First, I created a bucket using the Python S3 tools, and then copied the files over using:
hadoop fs -put <file> s3://ID:SECRET@BUCKET/name_of_dir

There are, of course, other ways to move things to S3 buckets.  Pick one you like.

Now that all of my files are there for accessing, it’s time to set up the hadoop instance.  
This part isn’t included in toto anywhere, so I’ll cover it here in detail.  This assumes you’ve done all of:
  • Set up yourself with an AWS account with EC2 and S3 access (including setting up a properly permissioned id_rsa-gsg-keypair as described here)
  • Created a bucket in S3 and populated it with files
  • Created a mapper.py and reducer.py and tested them with your files
  • Installed the hadoop tools on your local system and configured them as described here
Next, even though every piece of documentation says to do this:
bin/hadoop-ec2 run

That’s a lie.  Try this instead:

bin/hadoop-ec2 launch-cluster <group_name> <number_of_slaves>

This will create a master hadoop node, and your slaves. For number_of_slaves you want to pick something <= 19 so that your total doesn’t exceed 20 (unless you have special privileges).

Now we have to move our snazzy mapper and reducer to the master:

bin/hadoop-ec2-env.sh
scp $SSH_OPTS /path/to/mapper.py root@$MASTER_HOST:/home
scp $SSH_OPTS /path/to/reducer.py root@$MASTER_HOST:/home

‘run’ apparently used to then log you into your master, but since we’re using launch-cluster, you’ll need to do it yourself:

ssh $SSH_OPTS root@<your_new_master>

And there you are! On your new master. Awesome. Now let’s move the data to our cluster (ID and SECRET are your AWS credentials, BUCKET is the bucket you created):

cd /usr/local/hadoop-<version>
bin/hadoop fs -mkdir files
bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/files files

Ok, great. Almost there. Now we need to run the thing:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.18.0-streaming.jar -mapper mapper.py -file /home/mapper.py -reducer reducer.py -file /home/reducer.py -input files/* -output map-reduce.output

While it’s running, you can check out the neat web report hadoop creates at http://<server_name>:50030.  Go ahead, check it out.  It’s totally cool.

 
0

Hacking on Freebase

Posted by synedra on Oct 23, 2008 in Uncategorized

Today, fighting a cold, I’m hacking in a totally unattractive way.  I’d love to redeem my hacking status by heading up to the Freebase Hack Day in a couple of weeks, but unfortunately that’s the week I’ll be visiting my corporate masters in Southern California.  Don’t worry, I already sent in my absentee vote.  Democracy is still safe in our great nation.

For anyone else who has any ability to go spend a day, or even a few hours, at the Metaweb offices in San Francisco, I highly recommend going.  Freebase has always been a cool platform, but it’s suffered somewhat from being overwhelming and difficult to approach.  The team is about to roll out Acre, a new application development platform, which makes it easy to create new applications and new ways to play with the content.  I hope to have time to play with Acre soon myself, because my brief introduction to it was really intriguing.  One of the things I love best about it is that it allows you to discover new things about the content you already know.  Connections between things, similarities (and lack thereof!).  Check out the video to get a better idea of how it works… 
The data in Freebase is growing continually, and is already far too big to really understand as a whole.  Creating windows into this data, or different ways to understand it, is one of those cases where it’s almost more fun to create the solution than it is to come up with the problem.  So take this great opportunity to get yourself some interesting company, a t-shirt, free food, and time to play with some excellent toys.

 
0

Shameful Kindle Love

Posted by synedra on Oct 15, 2008 in Uncategorized

Hi. My name is Kirsten, and I’m a gadgetaholic. My drug of choice generally comes from that shady company just over the hill from me in Cupertino. I have had 5 (or is it 6) iPods, not counting two iPhones. I’ve had umpteen mac laptops, a mini, a time machine… pretty much, they make it, I buy it. It’s fair to say, then, that my design sense has a bent toward a certain aesthetic. I appreciate gadgets which try to magically know what I want. I’m willing to pay extra to avoid Microsoftian clumsiness in my day. So how, then, could I possibly like the Kindle?

The Kindle is a gadget unlike other ‘e-readers’. When I asked to get one for my birthday last year, I wanted it because (erudite person that I am) I thought I would enjoy being able to read periodicals on this device, that it would increase the percentage of “grown-up” reading I did as a result – and that I wouldn’t enjoy reading actual books on it at all. A second to redraw the page? I couldn’t imagine relaxing with a book that had that kind of lag.

Then I got the Kindle, and I spent some time getting to know it. Yes, it’s ugly and clumsy, in a warty orc kind of way. Tossing out the silly black tote-around cover helped a lot, but it stubbornly retained its “I’m a prototype” flung-together aesthetic. Other than the forward/back buttons, the interface is really pretty primitive. Everyone who picks it up accidentally changes the page forward or back… in fact, you can accidentally move forward in your book 20 pages just by tucking it under your arm to fish for something in your purse. The keyboard is clunky and not very responsive, and nobody ever gets the scroll wheel until you point it out to them.

But I confess I love the little guy. Once you figure out what that little scrolly wheel can do, it’s pretty keen, and the shiny indicator is really cute. The ability to get books wherever you are, without tethering the Kindle (and you) to your computer, is amazingly useful.  While I love reading pretty much everywhere, books tend to present you with annoying ergonomic challenges. Holding the book open to the right page with one hand is something we’ve all mastered, and we’ve all found little tricks to help us with that part (edge of the dinner plate, anyone?). Turning the page, however, always requires some amount of negotiation with the book. Whatever else you’re trying to do while you read, every minute or so you need to recruit your other hand to help with the turning of the page, and that’s really a lot more annoying than you realize. I only know this now because the Kindle removes that annoyance. I settle into a comfortable position for reading and turning the page is a no-op (other than the delay, which turns out to be not that big a deal).

As far as reading periodicals and being a well-informed intellectual, I’m afraid that I haven’t actually become a more erudite person. While the magic updates are great, the periodicals themselves are more difficult to scan in Kindle form than they are in paper-printed form. I can’t glance at the paper and know all of the stuff from the front page. I don’t want to plod through three different pages of headlines to know what’s going on in the world, particularly not these days. 10 seconds is pretty much all I can stand to devote to absorbing our current situation before I’m ready to dive deeper or head to greener subjects.

Still, I’m not completely blinded by my love for this little guy. I can’t borrow books on the Kindle and then return them. I’d like to be able to do that – I don’t actually *need* to be able to access books I’ve read forever and ever once I’ve read them, with very few exceptions. A subscription service would be swell, sure, but I’d really just prefer that Amazon adopt the itunes rental model and rent me a book for 2 weeks for 20% of the price of the book. Or let me trade my Kindle books with other people. But even with these drawbacks, the Kindle is one of my favorite gadgets.

 
0

Projecting and Reacting

Posted by synedra on Oct 11, 2008 in Uncategorized
The human condition is such that it’s very difficult for us to avoid projecting our emotions on other people, using our stress to paint their words into something that amplifies what we’re feeling.  It’s an unfair tendency, and extremely difficult to avoid.  The stronger the emotion, the more powerful the projection, and the more likely we are to put ourselves (and those other people) into difficult positions.

For the last several months, I’ve been working incredibly hard on a project.  The project has had its ups and downs, and recently had a fairly serious crisis.  I was indirectly responsible for this crisis, and my boss spent a lot of time talking to me about how it had affected our credibility, and what we should do going forward.  While he was trying to encourage me to take a more sane stance going forward – make a reasonable schedule and keep it, I was feeling guilty about my culpability in the problem, and so I translated his words as “You’re in the hole and you have to dig yourself out!”  So I made an impossible schedule (and hit my targets!) but was totally stressed, which distressed him greatly.  We finally talked about it and I realized that I had been projecting my fears/stress/guilt on him – a habit I try to avoid, but one which is so easy to slide into.  In this case, our discussion helped a great deal.  The schedule is well in hand, the customers are thrilled, but more importantly I am once again certain that my position is secure and I can take the time to do my job well.

It’s not just engineery girls who find themselves in this position, though.  McCain’s supporters, right now, are angry and afraid – and the McCain campaign has allowed many of those supporters to rev themselves into a frenzy.  I’ll note that McCain has never called Obama a terrorist, but these frightened supporters are projecting their fear/anger onto his words and blowing them way out of proportion.  I’m pleased to hear that McCain has started talking those people down.  We’re already in a situation where people are terrified about the economy and our country – situations less dire than this have caused violent outbreaks in the past.  Adding terrorist fear to this climate, and attaching that to your opponent is not just bad manners – it’s dangerous and stupid. Unfortunately, when the projection is happening on such a massive scale, it’s much harder to sit down and clarify your position.  I think McCain has started to see this tactic spinning out of control, and I hope that he’s strong enough to rein in his supporters.   

Copyright © 2010 Princess Polymath All rights reserved. Theme by Laptop Geek.