Cascading 0.3.0 Released
Cascading 0.3.0 has just been packaged and is available for download from our downloads page. It incorporates many great changes, read on for more.
The biggest additions are read-only support for HTTP and S3. This support was pushed down into Hadoop, so any Hfs Tap instance can include remote resources with http(s):// or s3tp:// urls. The s3tp url is similiar to the s3:// url, where the authority part of the url includes an AWS account key and secret.
There is also now experimental support for zip files. They aren’t recommended, but if you must read zip files from a remote source, and they are line oriented, a Tap using a TextLine scheme will automatically try to unzip the stream. If there is more than one file in the zip, it will iterate over each entry serially. We may likely push this down into a Hadoop codec in a future release.
There are a number of new handy operations like FieldFormatter, Last, Insert, Debug, and DateFormatter.
Finally there are a few API changes, but most users won’t notice them.
See the CHANGES.TXT file for a comprehensive list of bugs fixed and features added.
Btw, we are successfully using Cascading, and by virtue Hadoop, on largish clusters (10-50 nodes) in Amazon EC2. We currently have Cascades that execute a dozen or so Flows, and subsequently, double digit numbers of unique MapReduce jobs. And it only took a few days in actual coding time to assemble everything with unit tests, without a single hand made Mapper or Reducer.