Counter Seems Broken
visitors since
1999 09 10
© 1998 - 2014 Keith Wiley

All material on this website is
copyrighted and may not be
used without first obtaining
permission from the author.

Thank you.

How to Deploy Hadoop 2 (Yarn) on EC2

In early 2013 I needed to deploy Hadoop 2.0 (aka Yarn) on AWS. As I searched the web for walkthroughs of the procedure I was disappointed to discover all such articles to be woefully out-of-date, referring to much older versions of Hadoop...or depending on Whirr and other auto-deployment tools which themselves relied on older versions of Hadoop. Thus, I was forced to work the process out from scratch for myself. I hope this article is helpful to anyone who wishes to replicate the process.



Procedure for Deploying Hadoop 2.0 Yarn on EC2

Note: This document describes a nonEMR, nonWhirr, nonHue, nonCloudera-Manager method (this is a "direct" deployment in effect)

  1. Log into AWS EC2 and create three instances. One will be a master and two will be slaves:
  2. Wait for all three instances to fully load. Then note their respective ip addresses. For this example, I will refer to the master as X.X.X.87 and the slaves as X.X.X.245 and X.X.X.63 (obviously, you will have to substitute the full ip address throughout your reading of this document).
  3. On the left side of the EC2 console, click "Elastic IPs". Find an available elastic ip and assign it to the master node's instance. This is necessary in order to get internet access to download various packages, most importantly Hadoop itself (and perhaps Hive as well). In addition, I also like to install emacs, because, well, duh.
  4. ssh to an EC2 jumpbox (our EC2 access strategy consists of a single exposed "jumpbox" from which we then ssh to all other instances, thereby consolidating access control and simplifying security; adapt to your circumstances as needed).
  5. Copy the keypair .pem file to all three machines (from the jumpbox):
  6. Log out of the jumpbox and back in to trigger the new aliases (or "source .bashrc"). Then login to the master and two slaves in three separate shells, so you can set things up easily.
  7. Put the following in ~/.bashrc on all three nodes (obviously, the aliases and the prompt (PS1) are just my personal taste; ignore as you see fit):
  8. Set up all three nodes so they can smoothly ssh to each other (with no password verification):
  9. I like emacs, so "sudo yum install emacs" on the master node. Since there is only one elastic ip, it isn't worth the trouble to try to install emacs on the slave nodes. Sigh.
  10. Download Hadoop:
  11. Verify that $HADOOP_HOME is correct by searching for hadoop with the which shell command ($HADOOP_HOME was defined in .bashrc, shown above):
  12. Create Hadoop's temporary directory on all three machines:
  13. Configure Hadoop. On the master node:
  14. Initialize HDFS (format the namenode). On the master node:
  15. Start the various Hadoop daemons. On the master node:
  16. Look at the new empty HDFS. On the master node:
  17. Run a simple MapReduce example. On the master node:
  18. Run another simple MapReduce example:
  19. Stop the deamons in the reverse order they were started. On the master node:
  20. At this point, the EC2 instances for the cluster can either be "stopped" or "terminated". If they are merely stopped, they can be restarted at a later time, as described below.
  21. You're done. Congratulations. Get a drink.
  22. There are benefits (financial to say the least) in shutting the cluster down when it isn't needed. Stopping and restarting the cluster is actually quite straight-forward: