Here’s how you can get started with your first Hadoop cluster. These instructions will walk you through the process of getting started with Hadoop using:
- A Linux server with OpenSuSE installed (12.3 was used here) in text mode
- Apache Hadoop 2.2.0
This will get you started using a single node cluster in pseudo distributed mode. The benefits of this approach is that it is quite similar to how a fully distributed Hadoop cluster will work, except it just happens to be running on only one server.
You can run Hadoop using a single user ID; however, this would create an environment that does not mirror how a real cluster would work. In a real cluster, users accessing the cluster are not the superuser. This guide is designed to give you the steps to create a single node environment that would be representative of how a normal, real world, Hadoop cluster would operate.
I installed my OS in a brand new VM, giving it 2 Gb of memory and a 20 Gb hard disk, with the disk formatted entirely as one partition. After doing this, install all of your OS updates.
Make sure you have a late edition of Java 6 or Java 7 installed (Java 7u51 as of this writing, see http://wiki.apache.org/hadoop/HadoopJavaVersions for other compatible versions), and make sure JAVA_HOME is set appropriately.
Decide how you will refer to your server: DNS name or IP address? If you are going to use the DNS name, make sure that the name resolves from the local host by using the ping command. Add it to /etc/hosts if necessary.
Note: If you want to be able to access Hadoop services (such as the name node UI) from an external system, be sure that your DNS name refers to your external IP address, and not localhost.
In all places in these instructions where you see HOST_NAME, use either the DNS name or IP address.
This guide will create 2 users for the Hadoop cluster and one group:
hadoop – This group will contain all Hadoop system users.
hdfs – This user will be used for all core hdfs operations. It is a member of the hadoop group.
yarn – This user will be used for all Yarn operations. It is a member of the hadoop group.
- Create the hadoop group as a system group.
- Create the hdfs user as a system user with the primary group being hadoop and also make it a member of the users group.
- Create the yarn user as a system user with the primary group being hadoop and also make it a member of the users group.
Download and Install Hadoop
Download Hadoop from the releases page: http://hadoop.apache.org/releases.html . For this post, I will use the 2.2.0 release (named hadoop-2.2.0.tar.gz).
As the hdfs user, extract the archive:
> tar -zxf hadoop-2.2.0.tar.gz
Now move the extracted directory to the path /usr/share, and create a symbolic link to it:
> sudo mv hadoop-2.2.0 /usr/share > sudo ln -s /usr/share/hadoop-2.2.0 /usr/share/hadoop
Why a symbolic link? A symbolic link will allow you to easily have multiple versions of Hadoop side by side in your installation, and you would be able to switch between them simply by changing the location that the link points to.
Create a directory /usr/share/yarn/logs, and make it owned by the yarn user
> sudo mkdir /usr/share/yarn > sudo mkdir /usr/share/yarn/logs > sudo chown -R yarn /usr/share/yarn > sudo chgrp -R users /usr/share/yarn
Note on groups: You can make your Hadoop directories owned by the hadoop group instead of users if you want to keep users from seeing files in those directories. Generally for a single node setup, this is probably excessive.
Environment Variable Setup
Now its time to configure the environment variables for the Hadoop users. Add these lines to the .profile file for the hdfs and yarn users, along with the user you plan to use to access Hadoop:
export HADOOP_PREFIX=/usr/share/hadoop export HADOOP_HOME=$HADOOP_PREFIX export HADOOP_COMMON_HOME=$HADOOP_PREFIX export HADOOP_HDFS_HOME=$HADOOP_PREFIX export HADOOP_MAPRED_HOME=$HADOOP_PREFIX export HADOOP_YARN_HOME=$HADOOP_PREFIX export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
Now it’s time to set up the config files for Hadoop. The first one is hdfs-site.xml. This contains all settings specific to HDFS. In a new archive such as the one we are installing here, the file is empty. Use this for a good starter setup:
<configuration> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/share/hadoop/hdfs/datanode</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/share/hadoop/hdfs/namenode</value> <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> </property> </configuration>
This is a pretty basic setup. It simply specifies the directories to use for the name node and data node.
The core-site.xml file contains a setting to tell services where the name node is located. This is the only setting that needs to be changed for a basic cluster setup. Use this file:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://HOST_NAME/</value> <description>NameNode URI</description> </property> </configuration>
For HOST_NAME, use the DNS name or IP address for your server.
The yarn-site.xml file contains all information about Yarn. The default settings are designed for a full scale setup with decent sized servers and memory. These settings scale back Yarn a little for a single node setup:
<configuration> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>128</value> <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>2</value> <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> <description>Physical memory, in MB, to be made available to running containers</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>4</value> <description>Number of CPU cores that can be allocated for containers.</description> </property> </configuration>
You may need to adjust these settings a little for your machine.
Format the Name Node
Now it’s time to start running some commands to set up your cluster! The first step is to format your name node. This will set up some configuration files in the directories you have set up for your name node and data node. To do this, run the following:
> su - hdfs hdfs> $HADOOP_PREFIX/bin/hdfs namenode -format
Just as with formatting a hard drive, only do this once, or else you will lose all data stored in HDFS.
You will know that your command worked when you see the following at the bottom of the output on your screen:
INFO common.Storage: Storage directory /usr/share/hadoop/hdfs/namenode has been successfully formatted.
Note: If you are on a 64 bit machine, and you see this message:
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/share/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'. 14/02/01 17:02:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Don’t worry…this is common. It just means you are on a 64 bit server and trying to run 32 bit native libraries. As the message says, Hadoop will use built-in Java classes. The particularly annoying thing is that you will see this message every time you run a command. If you want to get rid of this error, you will need to build the libhadoop library on your system. I’ll cover those details in another post. (Update: You can now find the instructions on how to build the native libraries here.)
Start the Hadoop and YARN daemons
The configuration files are set up, and the name node has been formatted…now it’s time to start up some daemons. You can use this script to start everything you need:
# For HDFS # Start the namenode daemon sudo -i -u hdfs hadoop-daemon.sh start namenode # Start the datanode daemon sudo -i -u hdfs hadoop-daemon.sh start datanode # For YARN # Start the resourcemanager daemon sudo -i -u yarn yarn-daemon.sh start resourcemanager # Start the nodemanager daemon sudo -i -u yarn yarn-daemon.sh start nodemanager
This is pretty straightforward…it starts up the name node and data node for HDFS, and the resource manager and node manager for YARN. Unless you have passwordless sudo set up, you will be prompted for the password for the hdfs and yarn users.
After the script finishes, you can check to see that the processes are running by running these commands:
> sudo -i -u hdfs jps nnnn NameNode nnnn DataNode nnnn Jps > sudo -i -u yarn jps nnnn NodeManager nnnn ResourceManager nnnn Jps
The order and numbers will different, but you should see entries for all of the daemons that you started.
Accessing the Web UIs
As a validation check, you should now be able to see the web UI for both the Name Node and YARN. You can do so by accessing these URLs:
If these are not accessible, make sure the ports are open in your firewall, or look in the logs for errors on startup.
Setting up a user
You’re almost done! Now you need to set up a home directory for the user you will use with Hadoop. As the hdfs user, set up the user and the /tmp directory:
hdfs> hdfs dfs -mkdir /user hdfs> hdfs dfs -mkdir /user/USER_NAME hdfs> hdfs dfs -chown USER_NAME /user/USER_NAME hdfs> hdfs dfs -chgrp users /user/USER_NAME hdfs> hdfs dfs -mkdir /tmp hdfs> hdfs dfs -chgrp users /tmp hdfs> hdfs dfs -chmod 777 /tmp hdfs> hdfs dfs -chmod +t /tmp
Where USER_NAME is the name of the user you want to set up.
It’s important to do this as the hdfs user since this is the superuser. This user can create files anywhere in the system…analogous to root in Linux.
Having /tmp created and set to the correct permissions is necessary later when you start running jobs.
Accessing Hadoop as a user
Hadoop is now set up…it’s time to make sure you can access Hadoop as a user.
Copy the environment variables from the .profile for the hdfs and yarn users to the .profile for your user. Log out and log in again to make sure your .profile has been read.
Now run this command to see an HDFS directory listing:
USER_NAME> hdfs dfs -ls /user Found 1 items. drwxr-xr-x - USER_NAME users 0 2014-02-01 2:10 /user/USER_NAME
If you see this, HDFS is set up!
Testing the YARN Installation
To test your YARN installation, you can run one of the example applications inclued in the YARN distribution called DistributedShell, which runs a shell command in a specified number of containers in the cluster. To run it, you can use this command:
> $HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar $HADOOP_PREFIX/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar --shell_command date --num_containers 2 --master_memory 1024
This will kick off a YARN job and dump a bunch of output to the screen. If all goes well, toward the end you should see a log entry containing this message:
INFO distributedshell.Client: Application completed successfully
If you do, that’s a good sign…to check the output of your command, you can switch to the yarn user and run this command:
yarn> grep “” $YARN_LOG_DIR/userlogs/APPLICATION_ID/**/stdout
This will recursively show all stdout logs for the containers in your job. Each one should contain the current date, which is the normal output of the date command (which was used to run this test).
Assuming you see this, YARN is now set up!
At this point, you now have a functioning, single node Hadoop cluster running YARN. I’ll use this cluster as a reference in future posts.
Special thanks to Alex JF for his excellent YARN tutorial, which helped me as I was writing up these steps.