Hadoop supports Kerberos for secure authentication. Having a secure cluster is great for knowing who your users are, but a secure cluster creates a number of challenges for any tool that is put on top of Hadoop. Whether you are using these tools, building them, or investigating them, it’s helpful to have your own secure cluster that you can use for your work.
Cassandra is a linearly scalable, high availability database that has advanced features such as data center awareness built right into it. Cassandra and Hadoop are open source technologies that serve different use cases. A primary contributor to the Cassandra project is DataStax. DataStax has a commercial distribution available, but for those who are more interested in getting straight to the open source project and starting to use it, here are instructions you can use to get started with a single node Cassandra cluster. Continue reading
Apache Flume is a general purpose data ingestion mechanism that makes it easy to collect, aggregate, and move large amounts of log data. For Hadoop clusters, this is a very useful mechanism to get large amounts of data into your cluster.
These steps show you how to set up the recently released 1.5 version of Flume. These steps were created on Hadoop 2.4.0.
The Apache group has just released Apache Hadoop 2.4.0. It is available for download at your favorite mirror. It includes:
- Support for Access Control Lists in HDFS
- Native support for Rolling Upgrades in HDFS
- Usage of protocol-buffers for HDFS FSImage for smooth operational upgrades
- Complete HTTPS support in HDFS
- Support for Automatic Failover of the YARN ResourceManager
- Enhanced support for new applications on YARN with Application History Server and Application Timeline Server
- Support for strong SLAs in YARN CapacityScheduler via Preemption
…along with the expected bug fixes.
As with release 2.3.0, the instructions I have posted to set up your own cluster also work for 2.4.0.
Apache Spark is all the rage now in the Hadoop ecosystem. It has recently become an Apache top level project, and many people are looking at it as a successor to MapReduce.
You probably want to try this out yourself in your cluster. There are some nuances of Spark that are difficult to figure out. Here’s how you can get past them and start figuring out what Spark is all about and what it could do for you in your cluster.
The Apache group has given approval for the release of Apache Hadoop 2.3.0. It is available for download at your favorite mirror. It includes:
- Support for Heterogeneous Storage hierarchy in HDFS.
- In-memory cache for HDFS data with centralized administration and management.
- Simplified distribution of MapReduce binaries via HDFS in YARN Distributed Cache.
…along with the expected bug fixes. One particularly visual difference is in the Name Node web UI. It has been improved to be more visually appealing.
If you have been following my instructions to set up your own cluster, not to worry…I verified that they also work for this version.
Hive is one of the most popular components of the Hadoop ecosystem…a Hadoop system seems almost bare without it. It provides a good jump start with Hadoop, especially for those with previous SQL experience; however, as you grow in your experience with Hadoop, you’ll come to realize that it isn’t the most optimal tool for your Hadoop jobs. But that’s a story for another post…it remains a great way to get started with any kind of job in Hadoop. On to the instructions! Continue reading
By now I’ve shown you how to install a single node Hadoop cluster. This configures the cluster with HDFS and YARN functionality, but you may have noticed that submitting a MapReduce job doesn’t show anything in the YARN resource manager. If you are trying to understand how MapReduce interacts with YARN, this doesn’t help you…and it breaks the principle we’ve been trying to follow of trying to set up a cluster that works like a regular cluster that just happens to be on one node.
This post will show you the steps you need to set up MapReduce support in YARN in your cluster. Continue reading
Here’s how you can get started with your first Hadoop cluster. These instructions will walk you through the process of getting started with Hadoop using:
- A Linux server with OpenSuSE installed (12.3 was used here) in text mode
- Apache Hadoop 2.2.0
This will get you started using a single node cluster in pseudo distributed mode. The benefits of this approach is that it is quite similar to how a fully distributed Hadoop cluster will work, except it just happens to be running on only one server. Continue reading