Stream Assertions

Stream assertions are simply a mechanism to 'assert' that one or more values in a data stream meet certain criteria. This is similar to the Java language 'assert' keyword, or a unit test.

An example would be 'assertNotNull' or 'assertMatches'. Here are a few more assertions.

Assertions are treated like any other function or aggregator in Cascading. They are embedded directly into the data processing work flow or 'pipe assembly' by the developer. If an assertion fails, the processing stops, by default. Alternately they can trigger a Failure Trap.

As with any test, sometimes they are wanted, and sometimes they are unnecessary. Thus stream assertions are embedded as either 'strict' or 'validating'.

When running a tests against regression data, it makes sense to use strict assertions. This data should be small and represent many of the edge cases the processing assembly must support robustly. When running tests in staging, or with data that may vary in quality since it is from an unmanaged source, using validating assertions make much sense. Then there are obvious cases where assertions just get in the way and slow down processing and it would be nice to just bypass them.

During runtime, Cascading can be instructed to 'plan out' strict, validating, or all assertions before building the final MapReduce jobs via the MapReduce Job Planner. And they are truly planned out of the resulting job, not just switched off, providing the best performance.

This is just one feature of lazily building MapReduce jobs via a planner, instead of hard coding them.