Cassandra is a linearly scalable, high availability database that has advanced features such as data center awareness built right into it. Cassandra and Hadoop are open source technologies that serve different use cases. A primary contributor to the Cassandra project is DataStax. DataStax has a commercial distribution available, but for those who are more interested in getting straight to the open source project and starting to use it, here are instructions you can use to get started with a single node Cassandra cluster. Continue reading
Sooner or later, you will find the need to grow beyond individual jobs in Hadoop and create more complicated workflows. Apache Oozie is a workflow scheduler system that is designed to manage jobs in Hadoop.
These steps show you how to get a simple Oozie installation set up in your single node cluster. For these instructions, I used:
- Oozie 4.0.1
- Hadoop 2.4.1
- Hive 0.13.1
- CentOS 6.5
For someone coming from the MapReduce v1 world of job trackers and task trackers, switching to YARN can be a little confusing. The number of “slots” that are available in YARN is more dynamic, depending on the needs of the application.
YARN is quite flexible in how memory requirements for applications are handled. Along with this comes a number of configuration settings that could be a little confusing. Here’s an overview of the settings that are available. This post is specifically related to MapReduce jobs, but it includes general YARN concepts.
Apache Flume is a general purpose data ingestion mechanism that makes it easy to collect, aggregate, and move large amounts of log data. For Hadoop clusters, this is a very useful mechanism to get large amounts of data into your cluster.
These steps show you how to set up the recently released 1.5 version of Flume. These steps were created on Hadoop 2.4.0.
The Apache group has just released Apache Hadoop 2.4.0. It is available for download at your favorite mirror. It includes:
- Support for Access Control Lists in HDFS
- Native support for Rolling Upgrades in HDFS
- Usage of protocol-buffers for HDFS FSImage for smooth operational upgrades
- Complete HTTPS support in HDFS
- Support for Automatic Failover of the YARN ResourceManager
- Enhanced support for new applications on YARN with Application History Server and Application Timeline Server
- Support for strong SLAs in YARN CapacityScheduler via Preemption
…along with the expected bug fixes.
As with release 2.3.0, the instructions I have posted to set up your own cluster also work for 2.4.0.
Apache Spark is all the rage now in the Hadoop ecosystem. It has recently become an Apache top level project, and many people are looking at it as a successor to MapReduce.
You probably want to try this out yourself in your cluster. There are some nuances of Spark that are difficult to figure out. Here’s how you can get past them and start figuring out what Spark is all about and what it could do for you in your cluster.
The Apache group has given approval for the release of Apache Hadoop 2.3.0. It is available for download at your favorite mirror. It includes:
- Support for Heterogeneous Storage hierarchy in HDFS.
- In-memory cache for HDFS data with centralized administration and management.
- Simplified distribution of MapReduce binaries via HDFS in YARN Distributed Cache.
…along with the expected bug fixes. One particularly visual difference is in the Name Node web UI. It has been improved to be more visually appealing.
If you have been following my instructions to set up your own cluster, not to worry…I verified that they also work for this version.
Take a look at the settings reference, if you haven’t done so already. In addition to the settings files for Hadoop, I have now added settings pages for each of the most recent versions of Hive: 0.10, 0.11, and 0.12.
Hive is one of the most popular components of the Hadoop ecosystem…a Hadoop system seems almost bare without it. It provides a good jump start with Hadoop, especially for those with previous SQL experience; however, as you grow in your experience with Hadoop, you’ll come to realize that it isn’t the most optimal tool for your Hadoop jobs. But that’s a story for another post…it remains a great way to get started with any kind of job in Hadoop. On to the instructions! Continue reading