Cascading 0.5.0 Released

Version 0.5.0 of Cascading is now available for download. For details on new features and bug fixes, see the CHANGES.txt file. For a quick summary, read on.

By far, the biggest change is support for sorting via the GroupBy operator. By default the ‘groupFields’ fields are sorted, but to sort fields that are not being grouped on, set the ‘sortFields’ argument. This will allow the values of every grouping key to be sorted before being handed to an Aggregator function.

Unfortunately all the fields must be sorted in the same way, ascending or descending. You cannot have fields sort in differing orders. This is a limitation on Hadoop and we will submit a patch when possible.

Two new operations have been added, RegexSplitGenerator and ExpressionFilter. The first allows a regex pattern split a Tuple value into multiple Tuple instances, not new fields in a single Tuple instance like RegexSplit. The second allows for simple java expressions to be used as a filter, in the same manner as ExpressionFunction.

Flows can now be skipped if the sink resource exists (stale or not stale). This is useful if the source ‘modified’ date is unreliable and it might be too costly to re-retrieve and re-process the source data to re-create the result data in the sink. This is especially true when fetching large data from a remote site over HTTP.

Lastly there were a few improvements to HttpFileSystem. Specifically a fix to where any query string parameters were being munged. Now a source to a Flow can be a dynamically generated page with query string. The only limitation is that the CONTENT-LENGTH header must be returned with a valid value. Sadly this isn’t true for many web services.

This release also cleans up a few corner cases in the API that were unearthed during the development of our Groovy based scripting extension for Cascading and Hadoop. Our first release should be Real Soon Now.