By now I’ve shown you how to install a single node Hadoop cluster. This configures the cluster with HDFS and YARN functionality, but you may have noticed that submitting a MapReduce job doesn’t show anything in the YARN resource manager. If you are trying to understand how MapReduce interacts with YARN, this doesn’t help you…and it breaks the principle we’ve been trying to follow of trying to set up a cluster that works like a regular cluster that just happens to be on one node.
This post will show you the steps you need to set up MapReduce support in YARN in your cluster.
MapReduce and YARN
The developers of YARN created it to be API compatible with MapReduce. This means that if you want to run a MapReduce job in YARN, all you need to do is recompile it. For the purposes of configuring MapReduce and YARN, you only need to set up a couple of configuration files.
User and Directory configuration
This guide will create one additional user:
mapred – This user will be used for all MapReduce operations. It is a member of the hadoop group.
- Create the mapred user as a system user with the primary group being hadoop and also make it a member of the users group.
Create a directory /usr/share/mapred/logs, and make it owned by the mapred user
> sudo mkdir /usr/share/mapred > sudo mkdir /usr/share/mapred/logs > sudo chown -R mapred /usr/share/mapred > sudo chgrp -R users /usr/share/mapred
Environment Variable Setup
Now its time to configure the environment variables for the mapred user. Add these lines to the .profile file for the mapred user:
export HADOOP_PREFIX=/usr/share/hadoop export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Like most config files, the mapred-site.xml file that comes in the Hadoop distribution is empty. There are a couple of settings that need to be added just to get MapReduce to work in YARN, and others that should be changed for the single node setup. Use these settings:
<configuration> <property> <name>mapreduce.job.tracker.address</name> <value>HOST_NAME</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx768m</value> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.map.memory.mb</name> <value>1024</value> <description>Larger resource limit for maps.</description> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx768m</value> <description>Heap-size for child jvms of maps.</description> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>1024</value> <description>Larger resource limit for reduces.</description> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx768m</value> <description>Heap-size for child jvms of reduces.</description> </property> </configuration>
Pay particular attention to the first two settings: mapreduce.job.tracker.address and mapreduce.framework.name. The first setting tells MapReduce where to find the job tracker (replace HOST_NAME with how you are referring to your server…either IP address or DNS name). The second setting is used to configure the MapReduce framework to use YARN.
There’s two settings that need to be added to yarn-site.xml to enable MapReduce.
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property>
With this, you have made all of the changes you need to make to enable regular MapReduce jobs to run inside of YARN.
Start the Job History Server
The configuration files are set up…now it’s time to start up the single service related to MapReduce: the Job History Server. You can use this script to start the Job History Server:
# Start the MR JobHistory daemon sudo -i -u mapred mr-jobhistory-daemon.sh start historyserver
Accessing the Web UIs
As a validation check, you should now be able to see the web UI for the JobHistory server is up. You can do so by accessing this URL:
JobHistory Server UI: http://HOST_NAME:19888
If this is not accessible, make sure the ports are open in your firewall, or look in the logs for errors on startup.
Now let’s verify your MapReduce settings. You can test MapReduce in the same way that you tested YARN…by running some of the example applications that come with the distribution.
To run a test MapReduce job, run the following command:
> $HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter out
After you start this job, you will be able to see its status in the YARN Resource Manager. If you can see this, you have successfully set up MapReduce and YARN!