Load provides simple command line interface for building high-load jobs on a cluster, based on Cascading.

The open source repo is available on GitHub at https://github.com/Cascading/cascading.load where the “README.md” file has instructions for build and installation, and the “COMMANDS.md” file has a full list of command line options.

Load can be run in Hadoop as a JAR file:

hadoop jar load.jar param1 param2 .. paramN

Or, after installing on your laptop or server, as a command suitable for use in bash scripts, cron jobs, etc.:

cascading.load param1 param2 .. paramN

Why use Load?

There are a number of good reasons for having a library of functional load tests handy:

  • Generate datasets for load testing your Hadoop cluster.
  • Produce a consistent set of baseline metrics for your Hadoop cluster.

Baseline metrics become particularly useful when you need to modify your cluster. Whether you are tuning Hadoop configuration settings, upgrading hardware, or modifying the switch fabric, Load can produce baseline metrics to help give you an objective, quantitative basis for comparisons.

For example, the generate data app is only one mapper, which makes negligible use of HDFS reads, but substantial use of HDFS writes. Since there is no reducer there is no shuffle phase. So the generate data app provides an excellent way to obtain baseline metrics for HDFS write throughput.

hadoop jar load.jar --generate -I output/nop -O output/gendata

The consume data app provides a complement, since it is only one mapper which reads the data from generate data and makes negligible use of HDFS writes. There the consume data app provides an excellent way to obtain baseline metrics for HDFS read throughput. For another example, the count sort app provides a great way to measure the cost of the shuffle phase on a given cluster configuration.