Cascading 0.7.0 Released

Version 0.7.0 of Cascading is now available for download. For details on new features and bug fixes, see the CHANGES.txt file. This is a major release consisting of many features and some incompatible API changes, please read on.

The changes can be broken down into new features, and incompatible API changes. Sorry, but this release could break your existing code. But it’s all for the good.

First we should mention the most important change in this release is API compatibility with Hadoop 0.17.x. If you have not upgraded Hadoop yet, your Cascading jobs will fail thanks to changes in the Hadoop Core API.

For new features…

We now have support for custom MapReduce jobs via the c.f.MapReduceFlow. Just hand it your custom JobConf instance, and call flow.start(). Or add the MapReduceFlow to a c.c.CascadeConnector to let it participate in a larger workflow.

New merge capabilities in c.p.GroupBy. This allows multiple input branches to be grouped as if a single stream. This is not a join, but a way to make multiple inputs a single stream, if they all share the same fields.

We have added c.p.c.OuterJoin, c.p.c.MixedJoin, c.p.c.LeftJoin, and c.p.c.RightJoin c.p.c.CoGrouper classes. They compliment the default c.p.c.InnerJoin CoGrouper class giving you all the join operations you could wish for.

The MapReduce planner will now force an intermediate file between branches with Hadoop incompatible source Taps on joins/merges. If the taps are compatible (have same Scheme), all branches will be processed in same Mapper before the c.p.Group. This could be a performance penalty, but does overcome some non-intuitive errors. You should write your Flow out to a DOT file to see if it does what you expect.

Also, c.f.Flow.stop() will kill all running jobs on the cluster. Flow will also, by default, register itself with the JVM as a shutdown hook. So ctrl-C on your Cascading job will clean up after itself. You no longer should need to pick out the Cascading steps running on Hadoop and kill them manually.

For incompatible changes…

If you have written your own Operations (Function, Filter, Aggregator, etc), you will need to change those classes to sub-class c.o.BaseOperation, instead of c.o.Operation. This is a cleaner design when faced with Operations that can’t extend another class (c.o.BaseOperation in this case).

c.f.FlowConnector no longer will take a Hadoop JobConf object on the constructor. You must now pass in a Properties object with your raw Hadoop properties, or a Map<Object,Object> with the JobConf as a value. See MultiMapReducePlanner for convenient property setters.

The value is that FlowConnector is now decoupled from Hadoop and any planner implementations that may be provided in the future. By populating a Properties object, you can pass custom parameters down the stack through the planner to the underlying Hadoop framework.

As you can see, lot’s of good stuff in this release. Enjoy.