How to Deploy Hadoop 2 (Yarn) on EC2

In early 2013 I needed to deploy Hadoop 2.0 (aka Yarn) on AWS. As I searched the web for walkthroughs of the procedure I was disappointed to discover all such articles to be woefully out-of-date, referring to much older versions of Hadoop...or depending on Whirr and other auto-deployment tools which themselves relied on older versions of Hadoop. Thus, I was forced to work the process out from scratch for myself. I hope this article is helpful to anyone who wishes to replicate the process.

Cheers!

20130417

Procedure for Deploying Hadoop 2.0 Yarn on EC2

Note: This document describes a nonEMR, nonWhirr, nonHue, nonCloudera-Manager method (this is a "direct" deployment in effect)

  1. Log into AWS EC2 and create three instances. One will be a master and two will be slaves:
    • Name them "Hadoop Master (small)", "Hadoop Slave01 (small)" and "HadoopSlave02 (small)".
    • Specify a keypair. I will assume it is called HadoopKeyPair.
    • Alter the default instance configuration:
      • Instance Details: Type: "small" seems to work for simple MapReduce jobs, I haven't tested "micro".
      • Instance Details: Launch into a VPC: Pick a subnet of your choosing.
      • Security Settings: Select Existing Security Groups: I use a pair of security groups named Test-Hadoop-master and Test-Hadoop-slave. However, these two security groups are actually identical so there is no need for two separate groups. You could just use the same one for all of the instances. Here is how they are configured, although I believe they can be optimized, as explained below:
        • 22 tcp 0.0.0.0/0
        • 8000-8100 tcp 0.0.0.0/0
        • 50000-50100 tcp 0.0.0.0/0
        • -1 icmp [ARBITRARY]
        • 0-65535 tcp [ARBITRARY]
        • 0-65535 udp [ARBITRARY]
        • -1 icmp [ARBITRARY]
        • 0-65535 tcp [ARBITRARY]
        • 0-65535 udp [ARBITRARY]
      • As stated above, since I've configured the two existing security groups identically, there is no reason to have two separate groups in the first place.
      • Note that the rules encompassing port ranges 8000-8100 and 50000-50100 are overly general. Hadoop uses a few specific ports in those ranges. With some tedium (using the ns alias provided below) these rules could be specified more precisely.
  2. Wait for all three instances to fully load. Then note their respective ip addresses. For this example, I will refer to the master as X.X.X.87 and the slaves as X.X.X.245 and X.X.X.63 (obviously, you will have to substitute the full ip address throughout your reading of this document).
  3. On the left side of the EC2 console, click "Elastic IPs". Find an available elastic ip and assign it to the master node's instance. This is necessary in order to get internet access to download various packages, most importantly Hadoop itself (and perhaps Hive as well). In addition, I also like to install emacs, because, well, duh.
  4. ssh to an EC2 jumpbox (our EC2 access strategy consists of a single exposed "jumpbox" from which we then ssh to all other instances, thereby consolidating access control and simplifying security; adapt to your circumstances as needed).
    • Add an entry to /etc/hosts for the master node: e.g. "X.X.X.87 T-HADOOP-MASTER-SMALL".
    • Add an entry to .shh/config:
      • Host T-HADOOP-MASTER-SMALL
      • StrictHostKeyChecking no
      • User ec2-user
      • IdentityFile ~/.ssh/HadoopKeyPair.pem
      • GSSAPIAuthentication no
    • If, by chance, you didn't specify the HadoopKeyPair keypair when creating the instances, you will have to make corresponding changes to the entry above.
    • You will also need a way to log into the slave nodes. You can either go to the trouble of putting entries in /etc/hosts and .ssh/config as shown above for the master node, or you can simply create direct ssh aliases in .bashrc, such as:
      • alias hd_master_small='ssh T-HADOOP-MASTER-SM'
      • alias hd_slave01_small='ssh -i .ssh/HadoopKeyPair.pem ec2-user@X.X.X.245'
      • alias hd_slave02_small='ssh -i .ssh/HadoopKeyPair.pem ec2-user@X.X.X.63'
  5. Copy the keypair .pem file to all three machines (from the jumpbox):
    • scp -i .ssh/HadoopKeyPair.pem .ssh/HadoopKeyPair.pem T-HADOOP-MASTER-SMALL:.ssh
    • scp -i .ssh/HadoopKeyPair.pem .ssh/HadoopKeyPair.pem ec2-user@X.X.X.245:.ssh
    • scp -i .ssh/HadoopKeyPair.pem .ssh/HadoopKeyPair.pem ec2-user@X.X.X.63:.ssh
  6. Log out of the jumpbox and back in to trigger the new aliases (or "source .bashrc"). Then login to the master and two slaves in three separate shells, so you can set things up easily.
  7. Put the following in ~/.bashrc on all three nodes (obviously, the aliases and the prompt (PS1) are just my personal taste; ignore as you see fit):
    • export HADOOP_HOME='/home/ec2-user/hadoop-2.0.0-cdh4.1.3'
    • alias ls='ls -sFal --color'
    • alias hd='hadoop'
    • alias jvps='ps aux | grep java'
    • alias ns='netstat -a -t --numeric-ports -p'
    • export PS1="\[\e[35;1m\]\u@\h \e[33;1m\]\w/ \[\e[32;1m\]$\[\e[0m\] "
    • export PATH=$HADOOP_HOME/bin:${PATH}
  8. Set up all three nodes so they can smoothly ssh to each other (with no password verification):
    • On all three nodes, add the following to /etc/hosts (there is no need to modify the default single-line entry, just add the following at the end of the file):
      • X.X.X.87 ip-X-X-X-87 master
      • X.X.X.245 ip-X-X-X-245 slave01
      • X.X.X.63 ip-X-X-X-63 slave02
    • On the master node:
      • cd .ssh
      • Create a passwordless rsa keypair: ssh-keygen -t rsa -P ''
        • (that's two single quotes at the end, specifying an empty password string)
      • Accept the default file location (in .ssh) and filename (id_rsa).
      • Copy the contents of the id_rsa.pub and paste at the end of authorized_keys three times. For the second two pastes, alter the ip address to match that of the two slaves. authorized_keys will then look something like this:
        • ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC64oVfh9bbX+CBmQfUV8tzbHbzmi4+CTHcpcicsURKl6joT9ZmZXk7SjzB9nVTMGzI8vvJqtIiwGXIXQLCeSL1q0JBwZBunZnq0IG99BX0RlD1KWl8yWJLufSFj7IdfTApqa6Y9VZfBwAIpo6sW21p8mnQZFb97liBFRd+Si67KoQNs6OXDqsaJk9tK2tMVayJbUhvbUfeu21+YUBU/qK1t21Y3kfuMFi5Jr6Ozq41VQ8S2VGkVc+VLr00jP0cz8YxMjn2hGrr1tPquA4FGLZbd6y/QsXQTe6ENhS/q1qZRxDrKYeSMABasbRTCfMIJ1muVRuUBeXpvJ+fbc12fa5B HadoopKeyPair
        • ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAt5grkm1DDcbdk1X0Pz1TWVOsvqbAIosUfjQXwpRkAO/YZL3GG8+sWJ9v4OH3N8UWmUz77Q+T/Ql2z+IcHTFJq68RHzbD7z/4dJUD2STE4RzfSqu/QD+zyZXrQ8Orty7r7APlKB2rqeh6zaVX1maqefERUbvkKnybHUUcZgZGsSgL2I89gYmh7SrNMLSbjw7rfhbn1vSk3bpDHxyqjIF4ow2sifWB/BD5ScWgs0JnJguSzKKyy65AvwaVE9jv9bwVsI3otPq/b2fCpR7J+RVImqUj9BuVd+nmzPxmFKHc7Zbt/Mnx4yP9t0Ke+V5PHxUfm1SiSgQC4/8z4ShoxzcNnw== ec2-user@ip-X-X-X-87
        • ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAt5grkm1DDcbdk1X0Pz1TWVOsvqbAIosUfjQXwpRkAO/YZL3GG8+sWJ9v4OH3N8UWmUz77Q+T/Ql2z+IcHTFJq68RHzbD7z/4dJUD2STE4RzfSqu/QD+zyZXrQ8Orty7r7APlKB2rqeh6zaVX1maqefERUbvkKnybHUUcZgZGsSgL2I89gYmh7SrNMLSbjw7rfhbn1vSk3bpDHxyqjIF4ow2sifWB/BD5ScWgs0JnJguSzKKyy65AvwaVE9jv9bwVsI3otPq/b2fCpR7J+RVImqUj9BuVd+nmzPxmFKHc7Zbt/Mnx4yP9t0Ke+V5PHxUfm1SiSgQC4/8z4ShoxzcNnw== ec2-user@ip-X-X-X-245
        • ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAt5grkm1DDcbdk1X0Pz1TWVOsvqbAIosUfjQXwpRkAO/YZL3GG8+sWJ9v4OH3N8UWmUz77Q+T/Ql2z+IcHTFJq68RHzbD7z/4dJUD2STE4RzfSqu/QD+zyZXrQ8Orty7r7APlKB2rqeh6zaVX1maqefERUbvkKnybHUUcZgZGsSgL2I89gYmh7SrNMLSbjw7rfhbn1vSk3bpDHxyqjIF4ow2sifWB/BD5ScWgs0JnJguSzKKyy65AvwaVE9jv9bwVsI3otPq/b2fCpR7J+RVImqUj9BuVd+nmzPxmFKHc7Zbt/Mnx4yP9t0Ke+V5PHxUfm1SiSgQC4/8z4ShoxzcNnw== ec2-user@ip-X-X-X-63
      • Copy authorized_keys and the id_rsa keypair to the slaves:
        • scp -i HadoopKeyPair.pem authorized_keys ec2-user@ip-X-X-X-245:.ssh
        • scp -i HadoopKeyPair.pem authorized_keys ec2-user@ip-X-X-X-63:.ssh
        • scp id_rsa* ec2-user@ip-X-X-X-245:.ssh
        • scp id_rsa* ec2-user@ip-X-X-X-63:.ssh
      • Manually log into each machine so as to generate an entry in known_hosts (respond to the yes/no query for each login):
        • ssh ip-X-X-X-87
        • ssh ip-X-X-X-245
        • ssh ip-X-X-X-63
      • Copy known_hosts to the slaves:
        • scp known_hosts ec2-user@ip-X-X-X-245:.ssh
        • scp known_hosts ec2-user@ip-X-X-X-63:.ssh
    • At this point, the .ssh/ directory of all three machines should be identical and should contain the following:
      • authorized_keys
      • known_hosts
      • id_rsa
      • id_rsa.pub
      • HadoopKeyPair.pem (or some equivalent, as indicated when you created the instances in EC2)
  9. I like emacs, so "sudo yum install emacs" on the master node. Since there is only one elastic ip, it isn't worth the trouble to try to install emacs on the slave nodes. Sigh.
  10. Download Hadoop:
  11. Verify that $HADOOP_HOME is correct by searching for hadoop with the which shell command ($HADOOP_HOME was defined in .bashrc, shown above):
    • which hadoop
    • In particular, if you use a slightly different version of Hadoop, then you will have to make a corresponding change to the definition of $HADOOP_HOME in .bashrc.
  12. Create Hadoop's temporary directory on all three machines:
    • sudo mkdir /mnt/hadoop
    • sudo chown ec2-user /mnt/hadoop (or whatever user you are logged in as or otherwise intend to use for Hadoop processing)
  13. Configure Hadoop. On the master node:
    • cd $HADOOP_HOME/etc/hadoop/
    • The default Mapper/Reducer JVM Heap allocation is too small, so MapReduce jobs will fail. To fix this, edit hadoop-env.sh:
      • Change export HADOOP_CLIENT_OPTS="-Xmx128m $HADOOP_CLIENT_OPTS" to export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
      • You can obviously set the Java Heap even larger if you like. Take into account the amount of RAM the EC2 instance type provides.
    • Edit (you may have to create it) (etc/hadoop/)masters and add ip-X-X-X-87 on a single line.
    • Edit (etc/hadoop/)slaves and add ip-X-X-X-245 and ip-X-X-X-63 on two separate lines.
    • Edit core-site.xml. In between the <configuration> tags, add (note that the string "master" must match the alias in /etc/hosts, describe above):
      • <property>
      • <name>fs.default.name</name>
      • <value>hdfs://master:9000</value>
      • <description>
      • The name of the default file system. Either the literal string "local" or a host:port for NDFS.
      • </description>
      • </property>
      • <property>
      • <name>hadoop.tmp.dir</name>
      • <value>/mnt/hadoop</value> <!-- Where to put submitted Job files -->
      • <description>A base for other temporary directories.</description>
      • </property>
    • Edit hdfs-site.xml. In between the <configuration> tags, add:
      • <property>
      • <name>dfs.name.dir</name>
      • <value>${hadoop.tmp.dir}/dfs_hook/dfs_name_dir</value>
      • <final>true></final>
      • </property>
      • <property>
      • <name>dfs.data.dir</name>
      • <value>${hadoop.tmp.dir}/dfs_hook/dfs_data_dir</value>
      • <final>true</final>
      • </property>
      • <property>
      • <name>dfs.replication</name>
      • <value>1</value>
      • </property>
    • Edit yarn-site.xml. In between the <configuration> tags, add:
      • <property>
      • <name>yarn.nodemanager.aux-services</name>
      • <value>mapreduce.shuffle</value>
      • </property>
      • <property>
      • <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      • <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      • </property>
      • <property>
      • <name>yarn.resourcemanager.resource-tracker.address</name>
      • <value>master:8025</value>
      • </property>
      • <property>
      • <name>yarn.resourcemanager.scheduler.address</name>
      • <value>master:8030</value>
      • </property>
      • <property>
      • <name>yarn.resourcemanager.address</name>
      • <value>master:8040</value>
      • </property>
    • Note that the value of hadoop.tmp.dir in core-site.xml corresponds to the location where we previously created the temp directories.
    • Note how Hadoop configurations can reference other configurations (in hdfs-site.xml, we resolve ${hadoop.tmp.dir} dynamically instead of duplicating its underlying value). Note that in hdfs-site.xml, the replication factor could probably be 2 since our cluster has two slaves. Furthermore, note that in large production deployments, Hadoop convention is to use a replication factor of at least 3.
    • Copy the .xml files to the slaves:
  14. Initialize HDFS (format the namenode). On the master node:
    • cd $HADOOP_HOME
    • ./bin/hdfs namenode -format
    • If you have done this before, it will ask if you want to overwrite the existing HDFS to format a new one. DOING SO WILL ERASE ALL DATA CURRENTLY STORED IN HDFS!
  15. Start the various Hadoop daemons. On the master node:
    • sbin/hadoop-daemon.sh start namenode
      • Use the jvps alias described above to verify that the NameNode process is running.
      • Use the ns alias described above to verify that the NameNode process is listening on port 9000 and 50070 (the 9000 was configured in core-site.xml, see above).
      • Investigate the corresponding .log file in $HADOOP_HOME/logs/ to check for errors.
    • sbin/hadoop-daemons.sh start datanode
      • Note that the correct script is deamons (plural). The singular version runs a command on the local (master) node while the plural version runs a command on the slaves (as indicated in etc/hadoop/slaves).
      • On the slave nodes:
        • Use the jvps alias described above to verify that the DataNode process is running.
        • Use the ns alias described above to verify that the DataNode process is listening on ports 50010, 50020, and 50075.
        • ns will also show a direct connection to the master through port 9000 (in the foreign address column). Likewise, ns on the master will now show connections on port 9000 (in the local column) to the two slaves. The ports at the other end of those connections are unpredictable but the connections will match between the master and each slave (i.e., if the master has a connection from local port 9000 to slave ip's 63 at foreign port 33254, then slave 63 will have a connection from local port 33254 to the master's ip (87) at foreign port 9000).
        • Investigate the corresponding .log file in $HADOOP_HOME/logs/ on the slaves to check for errors.
    • sbin/yarn-daemon.sh start resourcemanager
      • Use the jvps alias described above to verify that the ResourceManager process is running.
      • Use the ns alias described above to verify that the ResourceManager is listening on ports 8025, 8030, 8033, 8040, and 8088.
      • Investigate the corresponding .log file in $HADOOP_HOME/logs/ to check for errors.
    • sbin/yarn-daemons.sh start nodemanager
      • Note that the correct script is deamons (plural) as described above for starting the datanode.
      • On the slave nodes:
        • Use the jvps alias described above to verify that the NodeManager process is running.
        • Use the ns alias described above to verify that the NodeManager is listening on port 8040, 8042, 8080, and an unpredictable high port (e.g. 41365).
        • Investigate the corresponding .log file in $HADOOP_HOME/logs/ to check for errors.
    • sbin/mr-jobhistory-daemon.sh start historyserver
      • Use the jvps alias described above to verify that the JobHistoryServer process is running.
      • Use the ns alias described above to verify that the JobHistoryServer process is listening on port 39889.
      • Investigate the corresponding .log file in $HADOOP_HOME/logs/ to check for errors.
    • After you have started all the daemons, jvps on the master should show the NameNode, ResourceManager, and JobHistoryServer running and jvps on the slaves should show the DataNode and NodeManager running. Furthermore, it is common to list the master in etc/hadoop/slaves (in addition to etc/hadoop/masters), in effect making it double up as both the master namenode and also another slave. If you do this, then after starting the deamons, the master should also show the DataNode and NodeManager running.
  16. Look at the new empty HDFS. On the master node:
    • hadoop fs -ls /
    • Note that my aliases, listed above, let me running hadoop by just typing "hd".
    • The command just demonstrated shouldn't show much since a brand new HDFS is basically empty (it might contain a /tmp directory).
  17. Run a simple MapReduce example. On the master node:
    • cd $HADOOP_HOME
    • hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.3.jar pi 2 10000
    • If successful, it should produce a result in the shell like:
      • Estimated value of Pi is 3.14280000000000000000
    • This example neither reads any input from, nor writes any output to, HDFS.
  18. Run another simple MapReduce example:
    • The Pi example just shown doesn't even access HDFS. Let's run an example that actually processes data stored on HDFS.
    • Create a new HDFS directory: hadoop fs -mkdir /input
    • Create a test file on the local machine (I'll assumed it's called ~/test.txt). Put the following in the file (two lines of text):
      • one two two three three
      • three four four four four
    • Put the file on HDFS: hadoop fs -put ~/test.txt /input
    • Verify the file transfer: "hadoop fs -ls /" and "hadoop fs -ls /input" and "hadoop fs -cat "/input/*"
    • Run the canonical Hadoop word count example:
      • hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.3.jar wordcount /input /output
    • Note that if /output already exists on HDFS, the MapReduce job will fail.
    • Check to see if /output was created: hadoop fs -ls /
    • Look at the contents of /output: hadoop fs -ls /output
      • /output should contain a file named part-r-00000.
    • Investigate the results of the word count job: hadoop fs -cat /output/p*
    • The output should be the following (note that the order of the rows of output may vary):
      • four 4
      • one 1
      • three 3
      • two 2
  19. Stop the deamons in the reverse order they were started. On the master node:
    • sbin/mr-jobhistory-daemon.sh stop historyserver
    • sbin/yarn-daemons.sh stop nodemanager
    • sbin/yarn-daemon.sh stop resourcemanager
    • sbin/hadoop-daemons.sh stop datanode
    • sbin/hadoop-daemon.sh stop namenode
    • Use the jvps alias to verify that all processes have stopped on all machines.
  20. At this point, the EC2 instances for the cluster can either be "stopped" or "terminated". If they are merely stopped, they can be restarted at a later time, as described below.
  21. You're done. Congratulations. Get a drink.
  22. There are benefits (financial to say the least) in shutting the cluster down when it isn't needed. Stopping and restarting the cluster is actually quite straight-forward:
    • Stop all of the Hadoop daemons. Verify that jvps reports no processes running on any of the nodes.
    • Stop the instances in EC2.
    • At some future time, restart the instances in EC2. Wait for them to fully restart. In particular, it can take a few minutes for their private ip addresses to come back...but when they do they will get the same private ip addresses they had before, and those are the ip addresses that are configured into /etc/hosts, ${HADOOP_HOME}/etc/hadoop/masters, and ${HADOOP_HOME}/etc/hadoop/slaves. Thus, Hadoop will work without further assistance.
    • Log into the master node and start the daemons as described above ( DO NOT FORMAT THE NAMENODE AGAIN, DOING SO WILL ERASE ALL DATA CURRENTLY STORED IN HDFS! ).
      • Log into the slave nodes and use jvps and ns to carefully verify that the cluster is running properly.
    • Test HDFS access. Verify that data previously stored in HDFS is still present and accessible.
    • Test MapReduce functionality.
    • Everything should be in working order.
    • Always remember to stop the daemons before stopping the cluster. I am uncertain of the implications of stopping the cluster in an unclean order...although it should be pretty robust.