It is hard to ignore all of the hype around Hadoop and Big Data these days. Like most infrastructure engineers, we tend to focus on how to build highly-available, highly-scalable networks – and I’m no exception. However, it is still important to me to keep up with and implement projects on popular trends, directly infrastructure related or not, especially when I can apply the project in some way to the infrastructure. With that, here is my first Hadoop project that uses netflow, nfdump (nfcapd), Hadoop/hdfs and Hive.
The end result is being able to query historical netflow data from a hadoop data store. When you think about it – Hadoop is a great repository for netflow data written by nfdump. Hadoop handles extremely large data repositories (large numbers of large files) with batch querying of data for analysis.
I don’t claim to be a Hadoop expert, so I’m interested in any feedback from people that have built on this example. I will build on it and post about it over the next couple of weeks – I’m most interested in doing a multi-node cluster and comparing performance.
I’m focusing on the ‘undocumented’ aspect of this project which is integrating hadoop and nfdump. There are plenty of articles that explain how to install nfcapd, enable netflow on a switch/router and install hadoop. I’ll give a few pointers below. My general setup was a Juniper SRX210H and a Ubuntu 12.04 server.
First, some basic advice — do not attempt to install hadoop on your own from source. It isn’t worth the pain. Use your operating system’s pre-built packages. There are also pre-built ISOs you can use.
- Install hadoop, hive, java, yarn, etc. I used cloudera’s Hadoop packages for single node install. I haven’t tried a multi-node install yet. Plenty of references online for this. I have my own here.
- Do all the configuration of hadoop/hdfs – for example, initialize a file system, copy some files into hdfs, try the hadoop fs commands to manage files, change permissions, etc
- Do a basic hello world example for hadoop – here is an example I used
- Install nfcapd / nfdump. I chose to compile from source. It is straight forward.
- Configure netflow on your router
- Verify that you’re receiving flows (look in your flows folder) and display a sample netflow file (nfdump -r $filename)
If all of this worked successfully, you’ve got netflow and the Hadoop ‘hello world’ example working, then you’re 75% of the way there. The last step is to integrate nfdump and hadoop by following the steps below.
Basic architecture of Hadoop
It is important to have some foundational knowledge of Hadoop and associated components. I will attempt to highlight my current understanding. Most of this background information is extracted from Hadoop: The Definitive Guide. Major points from this guide are:
- The main components of Hadoop are HDFS and MapReduce
- HDFS is a file system optimized for high throughput and can replicate files across different nodes to improve fault tolerance
- MapReduce is a programming model and allows parallel processing across many nodes and over large amounts of data
- HDFS can really store any kind of data – binary, text, etc – but that doesn’t mean it SHOULD be used for any kind of data
- Hive is data warehouse software that projects structure onto large data-sets – it supports SQL-Like access to structured data (in this case the structured data are CSV files in HDFS)
- A Hive table is data and metadata stored in HDFS (it looks like files in a regular filesystem although you don’t use the usual operating system tools to manage the filesystem)
- You cannot UPDATE data within a Hive table as you could in a traditional SQL database. The concepts are different. You could OVERWRITE or APPEND data on an existing table when you load data.
So – what is the benefit to using Hadoop to store Netflow data? In summary — data fault tolerance, a file system optimized for large data sets, and support for parallel processing across many nodes.
Initial testing and shake out
It is always good to ‘try it once and see if it works’ and this is no exception. It is complicated and there’s no reason to over-complicate it by trying to automate it. So, let’s first try it once.
I should note that the basic structure of this table is to use an external repository (although still in HDFS). External in the Hive world just means that the data is maintained outside of the Hive warehouse. The primary benefit to this is if you drop the table then your source data (CSVs) is not also removed – only the table metadata. If you drop a table with a Hive managed (internal) setup, then you also lose the original data (CSVs). There is no performance difference in an internal or external table, so which approach is at an individual’s discretion. The reality is a ‘load’ operation within an internally managed Hive table is just a file copy within HDFS. A Hive managed data store only requires some basic modification to the approach below. More discussion on this topic is here.
Step #1 – Create a netflow folder in your hdfs
hadoop fs -mkdir /users/netflow /users/netflow/netflow /users/netflow/netflow/input
Step #2 – Dump some csv data from nfdump. The CSV format is critical and you’ll see why in step #4.
nfdump -q -r <provide_file_name> -o csv > /var/tmp/samplefile
Step #3 – Copy a simple sample file into hdfs
hadoop fs -put /var/tmp/samplefile /user/netflow/netflow/input
Step #4 – Create a table in hive – hive really is the beauty of this whole thing and abstracts some of the intricacies of hadoop that an average network engineer doesn’t really want to know
Run ‘hive’ from the command line and you’ll get a hive> prompt
Execute the following to create a table – note that this simple example doesn’t load all of the nfdump fields into the table. Extend with more columns as you desire.
create external table netflow (date1 string, date2 string, sec string, srcip string, dstip string, srcport int, dstport int, protocol string) row format delimited fields terminated by ‘,’ lines terminated by ‘n’ stored as textfile location ‘/user/netflow/netflow/input’;
It isn’t necessary to do anything else for now. This simple process of creating the table ALSO will result in the data being loaded and query-able.
You can do something now as simple as ‘select * from netflow‘ and you’ll see data returned!
Make the process repeatable
You’ve successfully tried it once so let’s make it repeatable and integrate nfdump and hadoop for good. This just requires some simple shell scripts.
Step #1 – Create automation shell script
We have to create a simple shell script that nfcapd will execute every time a capture file is rotated and a new capture file is created. This process will take the previous capture file, export it to csv, then move it into hdfs.
The FULLPATH variable will be a command line argument provided by nfcapd. It will be the rotated out file’s name. The script simply takes the FULLPATH, determines just the file name (cut command), dumps the capture file to a csv in /var/tmp, imports that file into hdfs, then deletes the temp file. This below will obviously require some basic configuration changes for your local setup. I called the file nfhadoop.sh.
echo $FULLPATH | cut -d"/" -f4
echo “processng file $FILE $FULLPATH”
/opt/nfdump-1.6.10p1/bin/nfdump -q -r /opt/nfdump-1.6.10p1/flows/10.0.0.1/$FULLPATH -o csv > /var/tmp/$FILE
echo “copying file to hdfs”
su hdfs -c “hadoop fs -put /var/tmp/$FILE /user/netflow/netflow/input”
#you could also delete the file in the nfdump repository if you want to save disk space and trust hadoop
Step #2 – Change startup scripts to execute file rotation script
Getting nfdump to execute the script is as simple as adding this to your startup script command line or the command string you use when executing nfdump. Modify for your local environment.
-x “/opt/nfdump-1.6.10p1/bin/nfhadoop.sh %f”
Step #3 – Test!
You can monitor syslog and you’ll see something like this indicate the hadoop script is being successfully run. I’d suggest running manually first and provide a file name so you can watch for errors or warnings.
Nov 11 20:30:03 may-nfc-home nfcapd: Launcher: ident: SRX run command: ‘/opt/nfdump-1.6.10p1/bin/nfhadoop.sh 2013/11/11/nfcapd.201311112025’
You can also view the files in hdfs
hadoop fs -ls /user/netflow/netflow/input
Found 315 items
-rw-r–r– 1 hdfs supergroup 370977 2013-11-10 19:03 /user/netflow/netflow/input/nfcapd.201311101740
-rw-r–r– 1 hdfs supergroup 364730 2013-11-10 19:03 /user/netflow/netflow/input/nfcapd.201311101745
-rw-r–r– 1 hdfs supergroup 312235 2013-11-10 19:25 /user/netflow/netflow/input/nfcapd.201311101920
-rw-r–r– 1 hdfs supergroup 291227 2013-11-10 19:30 /user/netflow/netflow/input/nfcapd.201311101925
-rw-r–r– 1 hdfs supergroup 215707 2013-11-10 19:35 /user/netflow/netflow/input/nfcapd.201311101930
And lastly – query the data in hive
hive> select * from netflow limit 20;
2013-11-10 17:54:04 2013-11-10 17:54:04 0.000 10.0.0.19 10.0.0.255 138 138 UDP
Time taken: 0.659 seconds
Some ideas on next projects
To build on this exercise, here are some interesting ideas for next projects:
1) Store your data in a Cloud service like Microsoft/Windows Azure
2) Do a multinode configuration
3) Look at ways to improve query speed through data partitioning by date, etc