Load provides simple command line interface for building high-load jobs on a cluster, based on Cascading.
The open source repo is available on GitHub at https://github.com/Cascading/cascading.load where the “README.md” file has instructions for build and installation, and the “COMMANDS.md” file has a full list of command line options.
Load can be run in Hadoop as a JAR file:
hadoop jar load.jar param1 param2 .. paramN
Or, after installing on your laptop or server, as a command suitable for use in bash scripts, cron jobs, etc.:
cascading.load param1 param2 .. paramN
Why use Load?
There are a number of good reasons for having a library of functional load tests handy:
- * Generate datasets for load testing your Hadoop cluster.
- * Produce a consistent set of baseline metrics for your Hadoop cluster.
Baseline metrics become particularly useful when you need to modify your cluster. Whether you are tuning Hadoop configuration settings, upgrading hardware, or modifying the switch fabric, Load can produce baseline metrics to help give you an objective, quantitative basis for comparisons.
For example, the
generate data app is only one mapper, which makes negligible use of HDFS reads, but substantial use of HDFS writes. Since there is no reducer there is no shuffle phase. So the
generate data app provides an excellent way to obtain baseline metrics for HDFS write throughput.
hadoop jar load.jar --generate -I output/nop -O output/gendata
consume data app provides a complement, since it is only one mapper which reads the data from
generate data and makes negligible use of HDFS writes. There the
consume data app provides an excellent way to obtain baseline metrics for HDFS read throughput. For another example, the
count sort app provides a great way to measure the cost of the shuffle phase on a given cluster configuration.