Apache Flume is a general purpose data ingestion mechanism that makes it easy to collect, aggregate, and move large amounts of log data. For Hadoop clusters, this is a very useful mechanism to get large amounts of data into your cluster.
These steps show you how to set up the recently released 1.5 version of Flume. These steps were created on Hadoop 2.4.0.
Download the 1.5 version of Flume from the download page.
User Setup and Installation
Create a system user named flume similar to the yarn and hdfs users in your cluster…make the primary group hadoop and also make the user a member of the users group.
Switch to the flume user:
> su flume
Extract the Flume archive by running the command:
flume> tar -zxf apache-flume-1.5.0-bin.tar.gz
Move the directory to under /usr/share:
flume> sudo mv apache-flume-1.5.0-bin /usr/share/apache-flume-1.5.0
Create a symbolic link to point to the Flume directory:
flume> sudo ln -s /usr/share/apache-flume-1.5.0 /usr/share/flume
Finally, set up the .profile for the flume user by adding these lines:
export HADOOP_HOME=/usr/share/hadoop export FLUME_HOME=/usr/share/flume export PATH=$FLUME_HOME/bin:$HADOOP_HOME/bin:$PATH
After configuring the .profile, exit your flume session and log back in using su – flume to process the .profile:
> su - flume
Now switch to the hdfs user and create the home directory for the flume user:
> su - hdfs hdfs> hadoop fs -mkdir /user/flume hdfs> hadoop fs -chown flume /user/flume hdfs> hadoop fs -chgrp users /user/flume hdfs> hadoop fs -chmod 750 /user/flume hdfs> exit >
Now let’s update the log4j configuration file for Flume. As the flume user, edit log4j.properties
flume> vi /usr/share/flume/conf/log4j.properties
Change “flume.log.dir” to log to /usr/share/flume/logs:
This will ensure that Flume logs are put in /usr/share/flume/logs.
Agent Configuration File
Before running the Flume agent, the agent configuration file needs to be set up. This file defines the data flow. Flume defines the data flow using these components:
Sources – This is the starting point for the data in the flow.
Sinks – This is the ending point for the data in the flow.
Channels – This is the connection that connects sources to sinks.
Here is the initial configuration file that we will use:
# FlumeAgentExample.conf: A single-node Flume configuration # Name the components on this agent agent1.sources = avro-source agent1.sinks = hdfs-sink agent1.channels = mem-channel # Avro source agent1.sources.avro-source.type = avro agent1.sources.avro-source.bind = HOST agent1.sources.avro-source.port = 44445 # HDFS sink agent1.sinks.hdfs-sink.type = hdfs agent1.sinks.hdfs-sink.hdfs.path = /user/flume/FlumeData agent1.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true agent1.sinks.hdfs-sink.hdfs.fileType = DataStream agent1.sinks.hdfs-sink.hdfs.rollSize = 0 agent1.sinks.hdfs-sink.hdfs.rollCount = 0 # Use a channel which buffers events in memory agent1.channels.mem-channel.type = memory agent1.channels.mem-channel.capacity = 1000 agent1.channels.mem-channel.transactionCapacity = 100 # Bind the source and sink to the channel agent1.sources.avro-source.channels = mem-channel agent1.sinks.log-sink.channel = mem-channel agent1.sinks.hdfs-sink.channel = mem-channel
This file sets up an Avro source (meaning the agent listens on a port for an Avro connection) and an HDFS sink (meaning that the data winds up in HDFS) that are connected with a memory channel (a buffer in memory).
Make sure to put your server host name in for the HOST value in the source configuration.
Put this file in /usr/share/flume named FlumeAgentExample.conf.
Agent Launch Script
Now that the configuration file is set up, there is one more step to be done…creating the script that will launch the Flume agent. Here is a simple script that can be used to run the agent using the configuration file above.
flume-ng agent –name agent1 –conf $FLUME_HOME/conf –conf-file -D flume.root.logger= $FLUME_HOME/FlumeAgentExample.conf -Dflume.root.logger=INFO,console
One particular thing to pay attention to in this script is the value passed to –name. This value must match the properties in the configuration file, or else the configuration will not be found.
This script will log output to the console instead of the log file. To log to the log file, you can remove the final part of the launch command: -Dflume.root.logger=INFO,console
Place this file in /usr/share/flume/bin/flumeagent.sh and make it executable using chmod +x.
Run the Agent
Before getting started, make sure the directory where you want to put the Flume data exists. Switch to the flume user and create the directory specified in the agent configuration file.
> su - flume flume> hadoop fs -mkdir /user/flume/FlumeData
Now all that’s left to do is start the agent:
You will see a large command line output to the screen followed by the log output. The very last line you should see is something like:
01 Jun 2014 15:49:40,913 INFO [lifecycleSupervisor-1-3] (org.apache.flume.source.AvroSource.start:245) – Avro source avro-source started.
This tells you that your source has been started.
Trying It All Out
Now that the agent is running, it’s time to try out logging data to it. In a separate session on your Hadoop box, as your own user (not the flume user), create this script:
#!/bin/bash if [ $# -eq 0 ]; then echo No file specified else echo Processing $1... /usr/share/flume/bin/flume-ng avro-client --conf /usr/share/flume/conf --host HOST --port 44445 -Dflume.root.logger=INFO,console -F $1 fi
As before, replace HOST with your server’s host name.
Save this file as sendtoflume.sh. Make it executable using chmod +x.
Now try out the avro client sending data to Hadoop by running:
> sendtoflume.sh <some file path>
This will run an Avro client that will read the file passed in and send the data to Flume. In the Flume log, you should see a line like:
01 Jun 2014 16:07:38,380 INFO [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter.open:261) - Creating /user/flume/FlumeData/FlumeData.1401656858163.tmp
followed by a line like:
01 Jun 2014 16:08:09,666 INFO [hdfs-hdfs-sink-call-runner-0] (org.apache.flume.sink.hdfs.BucketWriter$8.call:669) - Renaming /user/flume/FlumeData/FlumeData.1401656858163.tmp to /user/flume/Flume/FlumeData.1401656858163
This indicates the Flume has written the data.
Note: The number after the “FlumeData.” in the file name is related to the time that the data was read. It will vary from run to run.
Verifying the Results
To check the results, use the hadoop fs -cat command to display the file and pipe the result to cksum to get the checksum:
> hadoop fs -cat /user/flume/FlumeData/FlumeData.1401656858163 |cksum 1316844908 73739
Then cat the original file you used to see it’s checksum:
> cat <some file path> |cksum 1316844908 73739
These numbers won’t be the same as yours, but the numbers should match between the hadoop fs -cat command and the cat command you ran.
Assuming this all matches up, you have successfully set up Flume!