Thanks to our partners, data Artisans, Cascading users now have an additional compute fabric to execute Cascading 3.0 applications on, Apache Flink.
From the project site..
“Apache Flink is a platform for scalable stream and batch processing. Flink’s execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources.
Apache Flink uses in-memory storage to achieve massive performance gains over MapReduce. It’s active memory management and custom serialization stack enables highly efficient operations on binary data and effectively prevents JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Memory-safe execution means very little parameter tuning is necessary to reliably execute Cascading programs on Flink.”
According to data Artisans, with virtually no code changes, Cascading 3.0 applications will run in Apache Flink, furthering the portability promise of Cascading through their contribution.
We are very excited to see another alternative for high performance production deployments made available to our community.
Link to Source code: http://cascading.org/cascading-flink/
Data Artisans blog: http://data-artisans.com/announcing-cascading-on-flink/
We have just published Cascading 2.7.1, a minor maintenance release.
This release resolves the following issues:
We have just published a new maintenance release 3.0.1 of Cascading.
This release resolves the following issue:
– Fixed issue in c.f.t.p.Hadoop2TezFlowStepJob where the LocalResources were not passed to the AppMaster correctly causing ClassNotFoundException during split calculation for custom InputFormats.
It can be downloaded from these locations:
We are happy to announce the release of Cascading-Hive 2.0. This release adds compatibility with Cascading 3.0. Furthermore it contains a major contribution from the Cascading community, namely hotels.com: It is now possible to read and write ACID ORC tables with Cascading-Hive. This feature relies on corc, an ORC integration for Cascading, also created by hotels.com. The demo directory contains a new application demonstrating this new feature.
The jars are deployed on conjars and the code is available on github.
Cascading-Hive allows you to read and write Hive tables from within Cascading Flows as well as running any HiveQL query as part of a Cascade.
We are happy to announce Cascading 3.0 is now publicly available for download.
The biggest change in this version, compared to previous releases, is Cascading has added native support for Apache Tez along side Apache Hadoop MapReduce and Cascading’s native local in memory mode. It is now trivial (a matter of changing a few lines of code) to move your application to run on Tez instead of MapReduce. We’ve seen others run performance tests with Scalding and Tez and are reporting significant performance improvements.
This milestone release of Cascading with Apache Tez support means we’ve completed the work to the query planner to make it faster for us and the community to integrate Cascading with other compute fabrics, as they become available. We hope to announce additional platform support in the near future.
Along with the ease of adding new platforms, the new query planner should also show some improvements over Cascading 2.x execution times on MapReduce. Additionally, we’ve given the developer direct control over how they optimize their MapReduce and Tez jobs perform so you can tune performance to your specific needs.
Please note this is a major release, thus all deprecated methods have been removed, along with some incompatible API changes to the Cascading public API, you will need to edit and recompile in order to upgrade to 3.0.
As we continue to advance the code base, a number of other enhancements and bug fixes are included in the release. For the complete list of changes in Cascading 3.0, please see the change log.
We are happy to announce that Cascading 2.7 is now publicly available for download. This is the last planned minor release of Cascading in the 2.x line before we make Cascading 3.0 final.
This release contains new features and bug fixes. In summary, two features of particular interest are PartitionTap support for small files, and Traps can now capture diagnostic information on the failure. Changes of note are:
- Added support for o.a.h.m.l.CombineFileInputFormat in the Hadoop specific c.t.h.PartitionTap implementation.
- Added c.t.Tap#prepareResourceForRead() and c.t.Tap#prepareResourceForWrite() methods to allow for client side tap resource initialization.
- Updated trap handling to capture diagnostic information within a trap when configured via a c.t.TrapProps instance.
- Updated c.t.u.TupleHasher to use MurmurHash3 32bit for hashCode calculation.
- Added ability to provide a custom cache to be used in c.p.a.AggregateBy and c.p.a.Unique.
- Updated c.f.h.MapReduceFlow to support both the org.apache.hadoop.mapred.* and org.apache.hadoop.mapreduce.* APIs.
- Updated Cascading SDK
For more details on new features and resolved issues see the change log.