Foremost, most changes can be stuffed under the heading of performance improvements. I can’t offer scientifically valid metrics, but let’s say my projects are running noticeably faster. These few enhancements constitute the changes.
One, we now skip the reducer if there is no Group in the assembly. This has been on the list for a couple releases, but it is there now and works great.
Two, when writing out ‘key’ and ‘value’ tuples to the reducer from the mapper, the ‘key’ values are removed from the ‘value’ tuple as they are redundant. This reduces the bandwidth on the ‘copy’ phase, and improves the ‘sort’. At the reducer, the ‘key’ and ‘value’ tuples are merged back. This was also on the list for a bit, but we needed to do a bit of refactoring internally before it could be done reliably.
Three, the tuple stream pipeline has been optimized. Every ‘collect’ immediately passes through to the next operator. That is, every tuple in the stream is committed to the output stream before the next is handled. This has mostly been the case for some time, but we finally refactored out the final inefficiencies that were in place to support some edge cases.
Four, Cascades now execute Flows in parallel if there are no dependencies. Flows have run Hadoop jobs in parallel for some time, but now this behavior is shared at the Cascade abstraction.
Also there are some new features.
The S3HttpFileSystem is now read-write. This filesystem is used to access ‘normal’ files on S3, unlike the Hadoop S3FileSystem.
Flows now support FlowListeners that can be notified of various events during a Flow life-cycle. We use it to post messages to SQS when a flow completes.
Taps can now specify that they should be written to directly, bypassing the the ‘default’ Hadoop collector (in the map or reduce phase). This is useful if you need to write data to a special file type or location and don’t want to write your own Hadoop FileSystem class. This also is a workaround for a bug in Hadoop preventing custom FileSystems from being used if loaded from user-space libraries. A side-effect is that user code can write out tuples directly via a Tap instance (great for tests or scripts).
Any Hfs Tap instance referencing a file path starting with file:// (Lfs does this by default), will force the current Hadoop job to run in ‘local’ mode. If this job ran on a cluster, the local file would not be visible to the remove job task, so it must run locally. Note only the one Hadoop job runs locally, not the whole Flow or Cascade. This allows developers to write applications that can load HDFS from local files and then spawn clustered jobs. And these local loading Flows can incorporate filters and operations that clean and format the data on load. If the file passed to the script isn’t a file:// but http:// or s3tp://, the job will be run in clustered mode.
Finally there are a few incompatible changes. The major change was that the Cascade class was moved to the cascading.cascade package. The other API changes are less likely to show up in user code.
We are very happy with this release, and we trust you will be too.