All of the posts on this site have had the unique spin of showing you how to set up a single node cluster. This does give you a good idea of how a cluster is set up, but in some ways, it’s no substitute for the real thing.
These instructions show you how to set up a basic cluster and make use of the tools included in Hadoop to start and stop services on the cluster. As with other posts, this one focuses on open source Apache, as all of the commercial providers include their own installation instructions.
These instructions have been tested on:
- CentOS Linux (version 6.6 used here)
- Apache Hadoop 2.6.0
This setup will include one name node, one resource manager node (with other services installed on it), and a single data node. Obviously, the steps for setting up the data node can be repeated to include more data nodes.
As mentioned, the OS used for these instructions was CentOS 6.6. The VMs were configured with 2 Gb each.
As with the single node setup, decide how you will refer to your server: DNS name or IP address? If you are going to use the DNS name, make sure that the name resolves from the local host by using the ping command.
Next, make sure that all servers can resolve their names or IP addresses. The easiest way to do this in a test setup is to add entries to /etc/hosts for all servers.
The following names will be used in these instructions:
- NN_HOST – The host for the name node
- RM_HOST – The host for the resource manager
- DN1_HOST – The host for the data node
This guide will create 3 users for the Hadoop cluster and one group:
hadoop – This group will contain all Hadoop system users.
hdfs – This user will be used for all core hdfs operations. It is a member of the hadoop group.
yarn – This user will be used for all Yarn operations. It is a member of the hadoop group.
mapred – This user will be used for the MapReduce job history server. It is a member of the hadoop group.
- Create the hadoop group as a system group.
- Create the hdfs user as a system user with the primary group being hadoop and also make it a member of the users group.
- Create the yarn user as a system user with the primary group being hadoop and also make it a member of the users group.
- Create the mapred user as a system user with the primary group being hadoop and also make it a member of the users group.
Do this for all of the hosts.
SSH Key Setup
The next thing to do is to generate SSH keys for all of the Hadoop users. Do this by running ssh-keygen:
> ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hdfs/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hdfs/.ssh/id_rsa. Your public key has been saved in /home/hdfs/.ssh/id_rsa.pub. The key fingerprint is: 92:3c:1f:6e:45:df:79:f4:f6:49:41:b9:e0:10:f2:bc hdfs@NN_HOST The key's randomart image is: +--[ RSA 2048]----+ | . o o-. o.| | . . . o = .o. | |. . = . = .. .| | ... o .. . + .. | |. . . + E S o | |. . . . | | | | | | | +-----------------+
Accept the default answers on every prompt (including the prompts for a passphrase).
Do these steps for all users that were created.
Note: If the users already existed before starting these steps, don’t run ssh-keygen if the users already had SSH keys generated for them. You can check this by looking for a file named id_rsa in the .ssh directory.
Now the next step is to create an authorized_keys file in the user’s .ssh directory that contains the contents of all id_rsa.pub keys that were generated for each user (one for hdfs, one for mapred, etc). Place this file in the .ssh directory on all hosts. This will ensure that there are passwordless SSH connections set up between all hosts for system users.
Note: Make sure that the permissions on the .ssh directory are 700 (owner access only). Opening this up to more than just owner could result in SSH denying all public key (i.e. passwordless) authentication.
Make sure Java is installed on all hosts. As mentioned in the single node setup instructions, use a late edition of Java 6 or Java 7. See http://wiki.apache.org/hadoop/HadoopJavaVersions for compatible versions.
Download and Install Hadoop
Download Hadoop from the releases page: http://hadoop.apache.org/releases.html . For this post, I will use the 2.6.0 release (named hadoop-2.6.0.tar.gz).
As the hdfs user, extract the archive:
> tar -zxf hadoop-2.6.0.tar.gz
Now move the extracted directory to the path /usr/share, and create a symbolic link to it:
> sudo mv hadoop-2.6.0 /usr/share > sudo ln -s /usr/share/hadoop-2.6.0 /usr/share/hadoop
Why a symbolic link? A symbolic link will allow you to easily have multiple versions of Hadoop side by side in your installation, and you would be able to switch between them simply by changing the location that the link points to.
Create a directory /usr/share/yarn/logs, and make it owned by the yarn user. Similarly, create the /usr/share/mapred/logs directory owned by the mapred user.
> sudo mkdir /usr/share/yarn > sudo mkdir /usr/share/yarn/logs > sudo chown -R yarn:users /usr/share/yarn > sudo mkdir /usr/share/mapred > sudo mkdir /usr/share/mapred/logs > sudo chown -R mapred:users /usr/share/mapred
Note on groups: You can make your Hadoop directories owned by the hadoop group instead of users if you want to keep users from seeing files in those directories.
Perform these steps on all hosts.
Configure the Hadoop Layout
Now its time to configure the directories for the major Hadoop components. Create a file named hadoop-layout.sh in the /usr/share/hadoop/libexec directory with these contents:
# Set all Hadoop paths export HADOOP_PREFIX=/usr/share/hadoop export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop export MAPRED_LOG_DIR=/usr/share/mapred/logs export YARN_LOG_DIR=/usr/share/yarn/logs
This step is different from the single node instructions, but the same steps can be used in there as well instead of setting these environment variables in the .profile for the users.
This file can be the same on every one of the hosts.
Now it’s time to set up the initial configuration. Start with hdfs-site.xml in /usr/share/hadoop/etc/hadoop. Add these lines:
<configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/share/hadoop/hdfs/datanode</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/share/hadoop/hdfs/namenode</value> <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> </property> <property> <name>dfs.secondary.http.address</name> <value>RM_HOST:50090</value> <description>The secondary namenode http server address and port.</description> </property> </configuration>
This is the same as the single node configuration with the addition of dfs.secondary.http.address. Use the host that will run resource manager for RM_HOST.
This file can be the same on all hosts.
The core-site.xml file contains a setting to tell services where the name node is located. This is the only setting that needs to be changed for a basic cluster setup. Use this file:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://NN_HOST/</value> <description>NameNode URI</description> </property> </configuration>
For NN_HOST, use the DNS name or IP address for the name node.
This file can be the same on all hosts.
The yarn-site.xml file can be configured similarly to the single node setup, but with a few additions. Here is a good starting configuration:
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>RM_HOST</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> <description>Physical memory, in MB, to be made available to running containers</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <description>Number of CPU cores that can be allocated for containers.</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
The key addition is the specification of yarn.resourcemanager.hostname to specify where the Resource Manager is.
This file can be the same on all hosts.
Here are starting values for mapred-site.xml:
<configuration> <property> <name>mapreduce.job.tracker.address</name> <value>RM_HOST</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx358m</value> </property> <property> <name>mapreduce.map.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.reduce.cpu.vcores</name> <value>1</value> <description>The number of virtual cores required for each map task.</description> </property> <property> <name>mapreduce.map.memory.mb</name> <value>512</value> <description>Larger resource limit for maps.</description> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx358m</value> <description>Heap-size for child jvms of maps.</description> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>512</value> <description>Larger resource limit for reduces.</description> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx358m</value> <description>Heap-size for child jvms of reduces.</description> </property> <property> <name>mapreduce.jobhistory.address</name> <value>RM_HOST:10020</value> </property> </configuration>
This configuration rolls back the memory settings for MapReduce application masters somewhat to help fit within machines with a modest memory limit (the 2Gb used for this test). This can be increased for machines with a larger memory footprint.
In case there are problems seeing tasks start (you will notice this when the app master starts, but no mappers begin running), the memory settings may need to be reduced, or more memory given to the node managers.
Sync up the configurations
All of the files mentioned can be the exact same values on all hosts. To make this easy, you can use a script similar to this one to copy the configurations to all servers:
#!/bin/bash scp -p /usr/share/hadoop/etc/hadoop/core-site.xml RM_HOST:/usr/share/hadoop/etc/hadoop/core-site.xml scp -p /usr/share/hadoop/etc/hadoop/hdfs-site.xml RM_HOST:/usr/share/hadoop/etc/hadoop/hdfs-site.xml scp -p /usr/share/hadoop/etc/hadoop/yarn-site.xml RM_HOST:/usr/share/hadoop/etc/hadoop/yarn-site.xml scp -p /usr/share/hadoop/etc/hadoop/mapred-site.xml RM_HOST:/usr/share/hadoop/etc/hadoop/mapred-site.xml scp -p /usr/share/hadoop/libexec/hadoop-layout.sh RM_HOST:/usr/share/hadoop/libexec/hadoop-layout.sh scp -p /usr/share/hadoop/etc/hadoop/core-site.xml DN1_HOST:/usr/share/hadoop/etc/hadoop/core-site.xml scp -p /usr/share/hadoop/etc/hadoop/hdfs-site.xml DN1_HOST:/usr/share/hadoop/etc/hadoop/hdfs-site.xml scp -p /usr/share/hadoop/etc/hadoop/yarn-site.xml DN1_HOST:/usr/share/hadoop/etc/hadoop/yarn-site.xml scp -p /usr/share/hadoop/etc/hadoop/mapred-site.xml DN1_HOST:/usr/share/hadoop/etc/hadoop/mapred-site.xml scp -p /usr/share/hadoop/libexec/hadoop-layout.sh DN1_HOST:/usr/share/hadoop/libexec/hadoop-layout.sh
A more advanced script can be written to simplify this process, but this helps illustrate what needs to be done.
This script can run without any password prompts as long as the SSH keys have been set up as mentioned in the SSH Key Setup section.
Configure the Data Nodes
Now the data nodes need to be configured. Edit the file /usr/share/hadoop/etc/hadoop/slaves on both the name node and resource manager host to include the data nodes. In this setup, there is only one entry: DN1_HOST.
Format the Name Node
Now it’s time to start running some commands to set up your cluster! The first step is to format your name node. This will set up some configuration files in the directories you have set up for your name node and data node. To do this, run the following:
> su - hdfs hdfs> $HADOOP_PREFIX/bin/hdfs namenode -format
Just as with formatting a hard drive, only do this once, or else you will lose all data stored in HDFS.
You will know that your command worked when you see the following at the bottom of the output on your screen:
INFO common.Storage: Storage directory /usr/share/hadoop/hdfs/namenode has been successfully formatted.
Note: If you are on a 64 bit machine, and you see this message:
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/share/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/02/01 17:02:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Don’t worry…this is common. It just means you are on a 64 bit server and trying to run 32 bit native libraries. As the message says, Hadoop will use built-in Java classes. The particularly annoying thing is that you will see this message every time you run a command. If you want to get rid of this error, you will need to build the libhadoop library on your system. The details on how to build the native libraries are here.
Start the Hadoop and YARN Processes
Now that the name node is formatted and all of the config files are copied out, it’s time to start up the cluster. First, start up the name nodes (primary and secondary) and data nodes with the single command:
USER@NN_HOST> sudo -i -u hdfs /usr/share/hadoop/sbin/start-dfs.sh
This will switch to the hdfs user, then run the built in start-dfs.sh script to start up the name nodes and data nodes. Note that the first time this is run, there may be prompts to accept keys from other servers. These will only need to be answered once.
After starting the HDFS services, start the YARN services by running:
USER@RM_HOST> sudo -i -u yarn /usr/share/hadoop/sbin/start-yarn.sh
This will switch to the yarn user, then start the Resource Manager and all Node Managers.
Start the Job History Server
Start the MR Job History Server by running:
USER@RM_HOST> sudo -i -u mapred /usr/share/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver
This switches to the mapred user to run the history server daemon.
Combined Start/Stop Scripts
Running all of these commands can make the process of starting and stopping a cluster cumbersome. While commercial distributions provide their own interfaces to start and stop services, if you are interested in using straight open source Apache, there are no quick options. I use these scripts to manage cluster services.
#!/bin/bash #Start HDFS sudo -i -u hdfs /usr/share/hadoop/sbin/start-dfs.sh #Start YARN sudo -i -u yarn ssh RM_HOST "/usr/share/hadoop/sbin/start-yarn.sh" #Start the MR Job History Server sudo -i -u mapred ssh RM_HOST "/usr/share/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver"
#!/bin/bash #Stop YARN sudo -i -u yarn ssh RM_HOST "/usr/share/hadoop/sbin/stop-yarn.sh" #Stop HDFS sudo -i -u hdfs /usr/share/hadoop/sbin/stop-dfs.sh #Stop the MR Job History Server sudo -i -u mapred ssh RM_HOST "/usr/share/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver"
These scripts are designed to be run on the name node, but they can easily be adapted to run on another node.
Accessing the Web UIs
As a validation check, you should now be able to see the web UI for both the Name Node and YARN. You can do so by accessing these URLs:
If these are not accessible, make sure the ports are open in your firewall, or look in the logs for errors on startup.
Setting up a User
You’re almost done! Now you need to set up a home directory for the user you will use with Hadoop. As the hdfs user, set up the user and the /tmp directory:
hdfs> hdfs dfs -mkdir /user hdfs> hdfs dfs -mkdir /user/USER_NAME hdfs> hdfs dfs -chown USER_NAME /user/USER_NAME hdfs> hdfs dfs -chgrp users /user/USER_NAME hdfs> hdfs dfs -mkdir /tmp hdfs> hdfs dfs -chgrp users /tmp hdfs> hdfs dfs -chmod 777 /tmp hdfs> hdfs dfs -chmod +t /tmp
Where USER_NAME is the name of the user you want to set up.
It’s important to do this as the hdfs user since this is the superuser. This user can create files anywhere in the system…analogous to root in Linux.
Having /tmp created and set to the correct permissions is necessary later when you start running jobs.
Accessing Hadoop as a User
Hadoop is now set up…it’s time to make sure you can access Hadoop as a user.
Copy the environment variables from the .profile for the hdfs and yarn users to the .profile for your user. Log out and log in again to make sure your .profile has been read.
Now run this command to see an HDFS directory listing:
USER_NAME> hdfs dfs -ls /user Found 1 items. drwxr-xr-x - USER_NAME users 0 2014-02-01 2:10 /user/USER_NAME
If you see this, HDFS is set up!
Testing the YARN Installation
To test your YARN installation, you can run one of the example applications inclued in the YARN distribution called DistributedShell, which runs a shell command in a specified number of containers in the cluster. To run it, you can use this command:
> /usr/share/hadoop/bin/hadoop jar /usr/share/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar /usr/share/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.6.0.jar --shell_command date --num_containers 2 --master_memory 1024
This will kick off a YARN job and dump a bunch of output to the screen. If all goes well, toward the end you should see a log entry containing this message:
INFO distributedshell.Client: Application completed successfully
If you do, that’s a good sign…to check the output of your command, you can switch to the yarn user and run this command:
yarn> grep “” $YARN_LOG_DIR/userlogs/APPLICATION_ID/**/stdout
This will recursively show all stdout logs for the containers in your job. Each one should contain the current date, which is the normal output of the date command (which was used to run this test).
Assuming you see this, YARN is now set up!
Testing the MapReduce Installation
Now let’s verify your MapReduce settings. You can test MapReduce in the same way that you tested YARN…by running some of the example applications that come with the distribution.
To run a test MapReduce job, run the following command:
> /usr/share/hadoop/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar randomwriter out
After you start this job, you will be able to see its status in the YARN Resource Manager. If you can see this, you have successfully set up MapReduce and YARN!
If you have gotten to this point, you have finished the setup of your first networked Hadoop cluster. Congratulations!
This is just a start…there’s definitely room for improvement from here to make things easier to use. Future posts will refer to this setup.