Be a Dooper!

  • Home
  • Hadoop
  • Cassandra
  • Hadoop Reference
    • core-site.xml (Hadoop 2.7)
    • core-site.xml (Hadoop 2.6)
    • hdfs-site.xml (Hadoop 2.7)
    • hdfs-site.xml (Hadoop 2.6)
    • hive-site.xml (1.1)
    • hive-site.xml (1.0)
    • hive-site.xml (0.14)
    • mapred-site.xml (Hadoop 2.7)
    • mapred-site.xml (Hadoop 2.6)
    • oozie-site.xml (4.0)
    • yarn-site.xml (Hadoop 2.7)
    • yarn-site.xml (Hadoop 2.6)
    • Older Versions
  • About

How to Connect Hadoop to an Azure storage account

Posted on June 21, 2015 by dmontroy

With the 2.7.0 release of Hadoop, Hadoop now includes the ability to connect to a Windows Azure storage account.  There are plenty of advantages to this, including the ability to tap into the resources of Azure for storage.  To be fair to other cloud providers, there is also support for Amazon S3 and OpenStack Swift, but this post is specifically to discuss Azure.

Prerequisites

Before you get started, you need to have this:

  • A Hadoop cluster running version 2.7.0
  • A Windows Azure subscription with a storage account created

For the first requirement (Hadoop 2.7.0), you can follow the existing instructions on this site to set up a Hadoop cluster.  For the second (Windows Azure and a storage account), see below.

Quick Start for Windows Azure

To get started with Windows Azure, go to http://azure.microsoft.com and sign up for an account (if you don’t already have one).  You will need a Windows Live ID, which you already have if you have a Microsoft account, Xbox live account, etc.  To be upfront, once you get out of the initial trial for Azure, there is a monthly charge for storage.  If you are only thinking in hobbyist terms, the prices are reasonable:

  • The first Tb (terabyte) of storage costs $0.024 per Gb as of the time of this article.  That’s $2.40 a month for 100 Gb of storage.  It’s obviously much cheaper if you are only messing around with a few Gb of storage.
  • Beyond that, the price per Gb goes down.

Once you have your Windows Azure subscription created, you need only to this to set up a storage account:

  • Log in to the management portal at https://manage.windowsazure.com.
  • Select New — Data Services — Storage — Quick Create
  • In the URL field, type in a unique name (an indicator on the screen will tell you if the name you picked is already taken)
  • Select a location in the Location/Affinity Group section.  You can use the default or pick a region that is close to you.
  • For Replication, select “Locally Redundant” for the lowest cost.  If you are concerned about the replication of your data, you can select one of the other options, but they come at a higher cost.
  • The click the check box by the text Create Storage Account

It’s important to note that you are not charged if you do not keep data in your account.  Charges are pro-rated by month as well.

Once the text “Online” shows in the Status field for your storage account, your storage is available for use.

To get started, click on your storage account, then from the initial “Your storage account has been created!” screen, click Containers.  On the Containers tab, click Add to create a container, and then specify a name for your container.  From this part of the UI you can’t create your own data under the container, but we will be doing that from Hadoop soon.

About Azure Storage Accounts

Before we get back to Hadoop, let’s talk about how you can provide access to your Azure storage account.  Click the Configure tab for your storage account, then click Manage Access Keys at the bottom of the screen.  A dialog will come up that shows you your storage account keys.  It will look like this:

StorageAccountKeys

 

 

 

Your keys will obviously look different, and these aren’t the beadooper account keys anymore…they were regenerated after taking this screen capture.

Why are these keys so important?  They are like passwords.  having the key and the account name gives someone access to all of the data in the storage account.  Azure does have more advanced controls to help to secure access to the data, but for this discussion, we are showing how you can access the data in your storage account from Hadoop, and we will also discuss securing the key in your Hadoop config.  The key thing to remember now is that the access key is what you use in the Hadoop configuration to access your account.  Clicking the paper icon next to the access key will put the key on the clipboard.

Hadoop Configuration for Azure

Now, back to Hadoop.  There are two steps to take in order to set up Hadoop to read from Azure:

  1. Add the Azure libraries to the classpath
  2. Put the key configuration in the core-site.xml file.

So first step…make sure Hadoop knows about the Azure libraries.  If you followed the instructions in this article to set up a cluster, you created a file named hadoop-layout.sh to specify the location of various parts of the cluster.  Add this line at the end of that file, or at the end of hadoop-env.sh if you have another cluster setup:

export HADOOP_CLASSPATH=$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-azure-2.7.0.jar:$HADOOP_PREFIX/share/hadoop/tools/lib/azure-storage-2.0.0.jar:$HADOOP_CLASSPATH

Update for your specific version and the location of these files.  All this is doing is adding hadoop-azure-2.7.0.jar and azure-storage-2.0.0.jar to the classpath.

Now the second part…adding the key configuration to your core-site.xml file.  Here is one way that you can add the key configuration to the file.  This amounts to putting the key directly in plain text into the file.  You can add this information to your core-site.xml:

<property>
  <name>fs.azure.account.key.YOUR_ACCOUNT_NAME.blob.core.windows.net</name>
  <value>YOUR_KEY</value>
</property>

Replace YOUR_ACCOUNT_NAME with your account name and YOUR_KEY with your storage access key.

Now it’s time to try this out.  Run the following command as a Hadoop user:

hadoop fs -ls YOUR_CONTAINER@wasb://YOUR_ACCOUNT_NAME.blob.core.windows.net/

Note the wasb: prefix, and replace YOUR_CONTAINER with the container name you created in your account and YOUR_ACCOUNT_NAME with the name of your account.  Your output should look like this:

15/06/21 19:17:31 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
15/06/21 19:17:31 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
15/06/21 19:17:31 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
15/06/21 19:17:32 INFO impl.MetricsSystemImpl: Stopping azure-file-system metrics system...
15/06/21 19:17:32 INFO impl.MetricsSystemImpl: azure-file-system metrics system stopped.
15/06/21 19:17:32 INFO impl.MetricsSystemImpl: azure-file-system metrics system shutdown complete.

This looks weird, but it’s accurate for an empty container.  The logging is related to the metrics2 system around the Azure libraries.  (With the proper log4j configuration, you can suppress this, or simply direct stderr to /dev/null.)  If there are files in your container, you see them in between azure-file-system metrics system started and Stopping azure-file-system metrics system….  Here’s example output:

15/06/21 19:23:10 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
15/06/21 19:23:10 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
15/06/21 19:23:10 INFO impl.MetricsSystemImpl: azure-file-system metrics system started
Found 1 items
drwxr-xr-x   - youruser supergroup          0 2015-06-21 19:24 wasb://YOUR_CONTAINER@YOUR_ACCOUNT_NAME.blob.core.windows.net/test
15/06/21 19:23:11 INFO impl.MetricsSystemImpl: Stopping azure-file-system metrics system...
15/06/21 19:23:11 INFO impl.MetricsSystemImpl: azure-file-system metrics system stopped.
15/06/21 19:23:11 INFO impl.MetricsSystemImpl: azure-file-system metrics system shutdown complete.

Note:  The trailing slash after .blob.core.windows.net is important…without it, you get a “no such file or directory” error.

Now you can use -mkdir, -put, or any other file system commands to access Azure like an HDFS file system.

Quick Note on Security

Before you roll this out on your production cluster at work, keep in mind that when accessing Azure in this fashion, there is NO security related to user permissions applied…effectively, anyone accessing the file system has rights to this data as the main hdfs user in Hadoop…but only for this file system.  The idea behind this is that Azure assumes authorization is handled by controlling access to the key.  I don’t particularly agree with this statement and think that Azure should support additional permissions checking, but that is the current state of the security model at the time of this post.

Obscuring the Access Key

Even with the security note above aside, it’s still important to understand that by putting your storage access key in the core-site.xml, you are increasing the sensitivity of this file.  Normally this is undesirable, since generally this config file is world readable in a normal cluster.  Fortunately, there is a way that this can be obscured.  It’s described in the Hadoop documentation.  It uses these values in the core-site.xml:

<property>
 <name>fs.azure.account.keyprovider.YOUR_ACCOUNT.blob.core.windows.net</name>
 <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
 <name>fs.azure.account.key.YOUR_ACCOUNT.blob.core.windows.net</name>
 <value>YOUR ENCRYPTED ACCESS KEY</value>
</property>
<property>
 <name>fs.azure.shellkeyprovider.script</name>
 <value>PATH TO DECRYPTION PROGRAM</value>
</property>

So by specifying ShellDecryptionKeyProvider in the fs.azure.account.keyprovider.YOUR_ACCOUNT… property, you can enable the actual key to be provided to Hadoop with the encrypted key passed in to the script specified by fs.azure.shellkeyprovider.script.

Note:  Even though the documentation(as of this post) mentions you need to set the property fs.azure.account.keyprovider.YOUR_ACCOUNT, you actually in fact need to specify the full DNS name of your storage account, including .blob.core.windows.net.

If you want to use your own script, here are the rules:

  • The key value is passed in as the first parameter to the script
  • The script executes as the user running the hadoop command
  • The actual key value should be echoed to stdout

The value in the config file actually doesn’t even need to be an encrypted key.  One way of setting up a script is to use this for the script:

#!/bin/bash
HOMEDIR=`getent passwd $USER |awk -F ':' '{ print $6; }'`
if [ -f "$HOMEDIR/.azureStorageKeys/$1" ]; then
  cat $HOMEDIR/.azureStorageKeys/$1
else
  echo "notfound"
fi

And in core-site.xml, use this for the key:

<property>
  <name>fs.azure.account.key.YOUR_ACCOUNT.blob.core.windows.net</name>
  <value>YOUR_ACCOUNT.blob.core.windows.net</value>
</property>

This will then look in a directory named .azureStorageKeys for a file with the same name as the storage account, thus allowing keys to be secured using Linux file system permissions.  This by itself may not be a practical example for a multi user system (since keys should be distributed to common areas and not recreated in every user directory), but it illustrates all of the mechanics regarding how the key provider script works.

All in all, I would exercise caution on using Azure connectivity for a production cluster where security of data is important.  Further updates could improve the security of the connection in addition to other security features in Azure.  Regardless, knowing how to get data in Hadoop to and from Azure is a useful tool to have in your toolbox in case your organization uses Azure.

Posted in Cloud, Hadoop | Tags: azure, cloud |
« Configuration Reference Updated
Installing Spark 1.4.1 »

Leave a comment Cancel reply

You must be logged in to post a comment.

Categories

  • Building (1)
  • Cassandra (2)
  • Cloud (4)
  • Flume (1)
  • Hadoop (14)
  • Hive (2)
  • MapReduce (1)
  • Oozie (1)
  • Other (1)
  • Spark (5)
  • YARN (3)

Archives

  • January 2018 (1)
  • May 2016 (1)
  • April 2016 (1)
  • December 2015 (2)
  • July 2015 (1)
  • June 2015 (1)
  • April 2015 (2)
  • January 2015 (2)
  • November 2014 (1)
  • October 2014 (1)
  • July 2014 (1)
  • June 2014 (1)
  • April 2014 (2)
  • March 2014 (1)
  • February 2014 (5)
© Be a Dooper!