Hive is one of the most popular components of the Hadoop ecosystem…a Hadoop system seems almost bare without it. It provides a good jump start with Hadoop, especially for those with previous SQL experience; however, as you grow in your experience with Hadoop, you’ll come to realize that it isn’t the most optimal tool for your Hadoop jobs. But that’s a story for another post…it remains a great way to get started with any kind of job in Hadoop. On to the instructions!
Set up the Hive User
To continue our method of creating a different user for each Hadoop component, you need to create a user named hive. Like the other users, create this user as a system user with the primary group being hadoop and also make it a member of the users group.
Download and Install Hive
Download Hive from the releases page: http://hive.apache.org/downloads.html . For this post, I will use the 0.12.0 release (named hive-0.12.0-bin.tar.gz).
As the hive user, extract the archive:
hive> tar -zxf hive-0.12.0.tar.gz
Now move the extracted directory to the path /usr/share, rename it to hive-0.12 (to match what we did for Hadoop), and create a symbolic link to it:
hive> sudo mv hive-0.12.0-bin /usr/share/hive-0.12.0 hive> sudo ln -s /usr/share/hive-0.12.0 /usr/share/hive
Why a symbolic link? A symbolic link will allow you to easily have multiple versions of Hive side by side in your installation, and you would be able to switch between them simply by changing the location that the link points to.
Here’s one modification you need to make to the install directory if you have been following the steps in this blog to set up your cluster…installing Hadoop 2.2.0 and Hive 0.12 creates a JAR conflict situation that needs to be resolved. There are two different versions of slf4j installed. Move the conflicting copy in the Hive directory to a backup directory.
As the hive user:
hive> mkdir /usr/share/hive/lib/old
hive> mv /usr/share/hive/lib/slf4j* /usr/share/hive/lib/old
This will remove the conflict.
Environment Variable Setup
Now its time to configure the environment variables for Hive. Add these lines to the .profile file for the hdfs and yarn users, along with the user you plan to use to access Hadoop:
For the path configuration, you can add $HIVE_HOME/bin to the path line that already exists in your .profile.
For the sake of consistency, make these changes to all .profile files for the Hadoop system users (hdfs, yarn, and hive).
Before you get started with operations in Hive, you need to create the Hive warehouse directory, which will hold all of the Hive data.
Do these operations as the hdfs user:
hdfs> hdfs dfs -mkdir /user/hive
hdfs> hdfs dfs -mkdir /user/hive/warehouse
hdfs> hdfs dfs -chown -R hive /user/hive
hdfs> hdfs dfs -chgrp users /user/hive/warehouse
hdfs> hdfs dfs -chmod 770 /user/hive/warehouse
This sets up your hive warehouse directory to be owned by the hive user and writable by the users group, so any users will be able to create databases.
User Config Setup
This is something I discovered as I was writing this up…you need to create a file in your user directory in Linux named .hiverc and add this line to it:
This needs to be set to account for a weird bug you will see when you try to run a Hive command for the first time.
Running Hive for the first time
Now it’s time to try Hive out. As your user (not hdfs or hive), run hive by simply typing:
You will see a number of INFO logging messages related to deprecated settings, such as:
14/02/04 22:54:15 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/02/04 22:54:15 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
Unfortunately, there isn’t anything you can do about these messages. The developers have made changes to remove these messages, but they haven’t been published yet.
The key message you are looking for is the very last line:
Logging initialized using configuration in jar:file:/usr/share/hive-0.12.0/lib/hive-common-0.12.0.jar!/hive-log4j.properties hive>
This tells you that Hive has been started and is running.
Creating a Database
To create a database, from the hive command prompt, type:
hive> create database sample; OK Time taken: 3.863 seconds
This means your database has been created.
Creating a Table
Now that you have created a database, it’s time to create a table. Use this command to create a sample table:
hive> create table sample.sampledata (id int, name string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE; OK Time taken: 0.115 seconds hive>
If you see this, you have created your first table. This has resulted in these directories being created:
Populating the Table
A table isn’t much use without data in it. Open up a text file in your favorite text editor and add these entries:
1 January 2 February 3 March 4 April 5 May 6 June 7 July 8 August 9 September 10 October 11 November 12 December
After each number, be sure to use a tab character instead of spaces.
Save this file as sampledata.txt.
Now, put this file in HDFS using the command:
> hdfs dfs -put sampledata.txt /user/hive/warehouse/sample.db/sampledata
You will see an OK message indicating that the file has been put in HDFS.
Querying the Data
The data is in HDFS, and now it’s time to run a query on it. Run Hive:
You will see the expected deprecation warnings, followed by the Hive prompt. Now run a simple query:
hive> select * from sample.sampledata;
You should see your data come back quickly exactly as you had typed it in. If not, then make sure you have tab characters in your file, and put the file back into HDFS, overwriting the previous one.
The reason this data returns quickly is because you performed a select * on a table with no conditions. When Hive sees this, it goes out to the data nodes that contain the data and just asks for the data. This is *not* something you would want to do with a huge table in Hadoop.
Now it’s time to run a query that will kick off a job:
hive> select count(*) from sample.sampledata;
You will see output that starts off as follows:
Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_1391570039342_0003, Tracking URL = http://dmontroy-hadoop-suse.site:8088/proxy/application_1391570039342_0003/ Kill Command = /usr/share/hadoop/bin/hadoop job -kill job_1391570039342_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-02-04 23:27:35,506 Stage-1 map = 0%, reduce = 0% ...
Eventually, the job will finish and report the result that you would expect to see: 12. The overhead to get to this simple result seems unbearable, but keep in mind that Hive was designed to handle very large data sets involving lots of coordination between nodes. For larger jobs, the overhead isn’t as much of a significant part of the job.
If everything has been configured correctly, you should be able to see the Hive jobs that you have submitted in the YARN Resource Manager UI. Go to http://HOST_NAME:8088 to see your query. If you did not see the query show up in the Resource Manager, make sure you have followed the steps to add MapReduce support to YARN in my previous post. If you don’t add them (particularly the additions to yarn-site.xml), then Hive will run in local mode, and will not submit the job to YARN.
At this point, if you have been following along, you now have a cluster that has HDFS, YARN, and Hive all set up. Using the Hive interface, you can quite easily create non-trivial data sets and run queries against them.