Or, after installing on your laptop or server, as a command suitable for use in bash scripts, cron jobs, etc.:
cascading.load param1 param2 .. paramN
Why use Load?
There are a number of good reasons for having a library of functional load tests handy:
Generate datasets for load testing your Hadoop cluster.
Produce a consistent set of baseline metrics for your Hadoop cluster.
Baseline metrics become particularly useful when you need to modify your cluster. Whether you are tuning Hadoop configuration settings, upgrading hardware, or modifying the switch fabric, Load can produce baseline metrics to help give you an objective, quantitative basis for comparisons.
For example, the generate data app is only one mapper, which makes negligible use of HDFS reads, but substantial use of HDFS writes. Since there is no reducer there is no shuffle phase. So the generate data app provides an excellent way to obtain baseline metrics for HDFS write throughput.
hadoop jar load.jar --generate -I output/nop -O output/gendata
The consume data app provides a complement, since it is only one mapper which reads the data from generate data and makes negligible use of HDFS writes. There the consume data app provides an excellent way to obtain baseline metrics for HDFS read throughput. For another example, the count sort app provides a great way to measure the cost of the shuffle phase on a given cluster configuration.