Cascading 4 for Streaming with Amazon S3 and Apache Kafka

Work on Cascading 4 continues with a new sub-project that provides new Cascading local mode Tap implementations for Amazon S3 and Apache Kafka.

Checkout the project on github:

Historically Cascading’s in-memory local (read non-distributed) mode has been primarily promoted as a platform for testing complex Cascading Flows in unit and integration tests without all the ceremony, time, and expense of spinning up an Apache Hadoop MapReduce or Apache Tez cluster.

But this is changing in Cascading 4. Incrementally we are improving various aspects of the local mode planner and execution platform, these changes are being driven by users working through new ways of leveraging Cascading, like with the S3 and Kafka Tap implementations.

The AWS S3 Tap implementation (S3Tap), can either retrieve a single key, or detect a key is a prefix and retrieve all the objects that match that prefix in the given S3 bucket.

More importantly, the S3Tap can record the last key seen (to disk or memory) so that on a restart the S3Tap will continue fetching keys that lexicographically come after. With a well designed key format (key prefixes using date/time strings, for example), a Cascading application can be written that periodically polls for new data in a bucket and streams it to disk, a queue, or back into a new S3 bucket.

The S3Tap can also be used to write data to S3 as well (at the time of this writing, it is not compatible with the PartitionTap, an obvious enhancement).

The Apache Kafka Tap implementation (KafkaTap) can attach to one or more Kafka topics and stream data as well as attach and write to one or more topics. Currently the KafkaTap relies on Kafka offset recording, but an easy enhancement would be to mirror the S3Tap for tracking the offset.

Why is this interesting?

There has been a lot of investment in Cascading over the years by users. This investment is in the form of familiarity and re-usable Cascading operations. Now a Cascading developer can pick up Cascading and her existing libraries built for other internal Cascading projects and use them to process the same data in real-time vs the batch style processing Hadoop MapReduce provided for.

Also, with no heavy weight dependencies on Hadoop or other processing platforms, Cascading local mode can be used in constrained and lightweight environments like AWS Lambda.

Imagine data being dropped by distributed and remote applications into S3, S3 firing a “create” event where a Cascading application wrapped in a Lambda is invoked to pre-process and push data into another storage location, database, or queue.