News

Latest News & Updates

Cascading 4 for Streaming with Amazon S3 and Apache Kafka

Work on Cascading 4 continues with a new sub-project that provides new Cascading local mode Tap implementations for Amazon S3 and Apache Kafka.

Checkout the project on github: https://github.com/cwensel/cascading-local

Historically Cascading’s in-memory local (read non-distributed) mode has been primarily promoted as a platform for testing complex Cascading Flows in unit and integration tests without all the ceremony, time, and expense of spinning up an Apache Hadoop MapReduce or Apache Tez cluster.

But this is changing in Cascading 4. Incrementally we are improving various aspects of the local mode planner and execution platform, these changes are being driven by users working through new ways of leveraging Cascading, like with the S3 and Kafka Tap implementations.

The AWS S3 Tap implementation (S3Tap), can either retrieve a single key, or detect a key is a prefix and retrieve all the objects that match that prefix in the given S3 bucket.

More importantly, the S3Tap can record the last key seen (to disk or memory) so that on a restart the S3Tap will continue fetching keys that lexicographically come after. With a well designed key format (key prefixes using date/time strings, for example), a Cascading application can be written that periodically polls for new data in a bucket and streams it to disk, a queue, or back into a new S3 bucket.

The S3Tap can also be used to write data to S3 as well (at the time of this writing, it is not compatible with the PartitionTap, an obvious enhancement).

The Apache Kafka Tap implementation (KafkaTap) can attach to one or more Kafka topics and stream data as well as attach and write to one or more topics. Currently the KafkaTap relies on Kafka offset recording, but an easy enhancement would be to mirror the S3Tap for tracking the offset.

Why is this interesting?

There has been a lot of investment in Cascading over the years by users. This investment is in the form of familiarity and re-usable Cascading operations. Now a Cascading developer can pick up Cascading and her existing libraries built for other internal Cascading projects and use them to process the same data in real-time vs the batch style processing Hadoop MapReduce provided for.

Also, with no heavy weight dependencies on Hadoop or other processing platforms, Cascading local mode can be used in constrained and lightweight environments like AWS Lambda.

Imagine data being dropped by distributed and remote applications into S3, S3 firing a “create” event where a Cascading application wrapped in a Lambda is invoked to pre-process and push data into another storage location, database, or queue.

Driven acquired by Xplenty. What does this mean for Cascading?

Cascading 3.2.0 Released

Cascading 3.2.0 has been released.

This release covers the following issues and improvements:

– Much improved planner support for complex and tight c.p.Pipe assemblies that arise when merging files with themselves, or other split and joins with no intermediate operations.

– c.t.h.DistCacheTap is now applied by default to any accumulated side of a HashJoin by rule.

– Added a c.f.h.MultiMapReduceFlow that can manage executing multiple JobConf instances as a single c.f.Flow.

For a detailed list of changes in this release, see:

https://github.com/Cascading/cascading/blob/3.2.0/CHANGES.txt

It can be downloaded here:

http://www.cascading.org/downloads/

http://conjars.org/cascading/cascading-core/versions/3.2.0

The source can be found here:

https://github.com/Cascading/cascading/tree/3.2.0

Cascading 3.1 Release

We are happy to announce Cascading 3.1 is now publicly available for download.

Version 3.1 improves the performance of Cascading over 3.0, resolves a number of issues during planning of complex workloads when running on MapReduce and Apache Tez, and further delivers on the promise of new platform portability with the addition of Apache Flink as an execution platform.

As of version 3.1, Cascading can further leverage declared type information during serialization of data across the ‘shuffle’ (partitioning phase) between map and reduce tasks or Tez vertices. For many applications, performing ETL or data cleansing type workloads, having flexible type support within a column can dramatically improve reliability by having operation request values as the types they require (a string or an integer). When integrating complex data sets, a given column or field may not be consistent across data sets, so requiring a common type instead of a means to lazily convert value on demand can be difficult to rely on in practice. But for applications and systems that are consistent, declaring types not only enforce consistency (or prevent inconsistency), the type information is now used to improve serialization IO efficiency.

In addition to leveraging additional type information, applications consuming data from directory partitioned data sets (data stored by date or other values) can be pre-filtered using a new filtering mechanism introduced in 3.1. For example, if a data set spans 10 years and is partitioned by year, and month, a filter can be used to prevent a Cascading application from reading files that are not within the expected time range during runtime by excluding partitions (directories) that do not match the filter. This simple enhancement allows applications to have greater performance without impacting the maintainability of large complex applications.

We are also very happy to remind users of the availability of Cascading on Apache Flink, with the release of 3.1.0, the Cascading Connector on Flink can be run against a stable Cascading release. For more information, see this Apache Conference presentation Faster Workflows.

Please note this is a minor release retaining API compatibility with Cascading 3.0 public methods.

As we continue to advance the code base, a number of other enhancements and bug fixes are included in the release. For the complete list of changes in Cascading 3.1, please see the change log.

Announcing Cascading 3.0 on Apache Flink

Thanks to our partners, data Artisans, Cascading users now have an additional compute fabric to execute Cascading 3.0 applications on, Apache Flink.

From the project site..

“Apache Flink is a platform for scalable stream and batch processing. Flink’s execution engine features low-latency pipelined and scalable batched data transfers and high-performance, in-memory operators for sorting and joining that gracefully go out-of-core in case of scarce memory resources.

Apache Flink uses in-memory storage to achieve massive performance gains over MapReduce. It’s active memory management and custom serialization stack enables highly efficient operations on binary data and effectively prevents JVM OutOfMemoryErrors as well as frequent Garbage Collection pauses. Memory-safe execution means very little parameter tuning is necessary to reliably execute Cascading programs on Flink.”

According to data Artisans, with virtually no code changes, Cascading 3.0 applications will run in Apache Flink, furthering the portability promise of Cascading through their contribution.

We are very excited to see another alternative for high performance production deployments made available to our community.

Link to Source code: http://cascading.org/cascading-flink/
Data Artisans blog: http://data-artisans.com/announcing-cascading-on-flink/

Enjoy!

Cascading Newsletter - September 2015

September 2015 Newsletter

Cascading 3.0 Maintenance Release

We have just published Cascading 3.0.2, a minor maintenance release.  Upgrading is recommended for all users.

This release resolves the following issues:

  • Updated Apache Tez to 0.6.2 to prevent deadlocks in complex DAGs. Note this release is incompatible with Tez 0.6.1.
  • Fixed issues concerning detailed stats retrieval robustness for both MapReduce and Tez platforms.
  • Updated build to exclude jgrapht-ext, further isolation of jgrapht apis to support reliable shading.
  • Fixed issue on Apache Tez where a split before and subsequent splicing back into a c.p.HashJoin could create an invalid plan.
  • Fixed issue where an unreachable YARN timeline server could cause the application to fail.
  • Fixed issue with NPE when retrieving Tez task status from timeline server.

For a detailed list of changes in this release, see:

https://github.com/Cascading/cascading/blob/3.0.2/CHANGES.txt

It can be downloaded here:

http://www.cascading.org/downloads/

The source can be found here:

https://github.com/Cascading/cascading/tree/3.0.2

Enjoy!

Cascading 2.7 Maintenance Release

We have just published Cascading 2.7.1, a minor maintenance release.
This release resolves the following issues:
  • Fixed issue where c.p.GroupBy or c.p.CoGroup would fail if attempting to group or join incoming Fields.UNKNOWN tuple streams using relative positions in the grouping fields selectors.
  • Fixed issue where c.u.ShutdownUtil could log a NPE if a hook is removed during JVM shutdown.

https://github.com/Cascading/cascading/blob/2.7.1/CHANGES.txt

It can be downloaded here:

http://www.cascading.org/downloads/
https://github.com/Cascading/cascading/tree/2.7.1

Enjoy!

Cascading Newsletter - July 2015

July 2015 Newsletter

Cascading 3.0 Maintenance Release

We have just published a new maintenance release 3.0.1 of Cascading.
This release resolves the following issue:
– Fixed issue in c.f.t.p.Hadoop2TezFlowStepJob where the LocalResources were not passed to the AppMaster correctly causing ClassNotFoundException during split calculation for custom InputFormats.
It can be downloaded from these locations: