If you are a Scala fan, checkout the Scalding announcement from Twitter. Or just grab the Scalding code from GitHub.
Of course, don't forget the other language bindings Cascalog, PyCascading, and Cascading.JRuby.Welcome
Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to 'think' in MapReduce.
Cascading is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application.
As a library and API that can be driven from any JVM based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are "operationalized". That is, a single deployable Jar can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell. Instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.
The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware.
Cascading is not a new text based query syntax (like Pig) or another complex system that must be installed on a cluster and maintained (like Hive). But Cascading is both complementary and a valid alternative to either application.
Cascading does support the development of such languages and DSLs like Multitool, Cascalog, and Cascading.JRuby. Multitool allows you to either "grep", "sed", or join large datasets on a Hadoop FileSystem or Amazon S3 from the command line.
Cascading is Open-Source software. Production and Developer Support can be obtained through Concurrent, Inc.
Cascading has a strong community of users and contributors, see our Cascading modules page for related projects and extensions.
Cascading, extensions, and related libraries are also hosted in the Conjars maven repository maintained by Concurrent, Inc. The repository is open to the public.
Read more about Cascadings features or thumb through the Cascading User Guide.
Recent Events
If interested in running Python on Apache Hadoop, checkout PyCascading from Twitter.
Here is the official announcement on our mail-list.
If Clojure is more your thing, there is always Cascalog, another project from the Twitter data teams (formerly BackType).
Scale Unlimited will be offering their online course, Introduction to Cascading, this November 18th.
After months of work, we are very happy to announce availability of Cascading 2.0 WIP (Work in Progress).
2.0 is still under development, but it has become stable enough for us to make the work public so we can get early feedback on the APIs and other related changes, without causing unnecessary headaches to early adopters.
Currently nearly all changes are internal except for these...
- Decoupled internal planner from Hadoop and providing a "local" mode planner for fast in-memory processing.
- Changed the Tap APIs to improve development of custom taps.
- Changed Cascading license from GPL v3 to Apache 2.0.
Do note we have a number of additional improvements in the works commonly requested by users. More on that soon.
To download WIP builds, please visit the Concurrent downloads page. Or grab the source from the public Git repository on GitHub.
For a comprehensive list of changes, see the CHANGES.txt file.
Apache Solr integration Tap has just been added to the Cascading extensions page for download from GitHub.
The No Fluff, Just Stuff conference tour is running a series of presentations on Cascading and Cascalog. Check out the video below for a great introduction to Cascading.
After a bit of work, we have repackaged both Cascading Load and Multitool giving them helper bash wrappers for installing, running, and updating. The new packages are on the download page.
After unpacking, multitool for example, just run ./bin/multitool install or ./bin/multitool help for more information.
Multitool is a command line interface for running sed and grep like application on Apache Hadoop. It even supports joins across multiple files. It's perfect for finding files or creating large test datasets from larger ones.
Cascading.Load is a command line tool for creating complex loads on a Apache Hadoop cluster for performance tuning.
Both tools are based on Cascading, of course.
Interested in getting started with Hadoop, Cascading, and Cascalog?
If so, sign up for the Cascalog Workshop here in sunny San Francisco, Saturday February 19th, here before space runs out.
Nathan Marz of BackType and the author of Cascalog will be leading the workshop. Chris K Wensel, the author of Cascading, will be lurking about lending a hand where possible.