Welcome

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.

Cascading is a thin Java library that sits on top of Hadoop's MapReduce layer. It is not a new text based query syntax (like Pig) or another complex system that must be installed on a cluster and maintained (like Hive). Though Cascading is both complimentary to and is a valid alternative to either application.

Cascading is simply a query processing API that lets the developer quickly assemble complex distributed processes without having to "think" in MapReduce. And to efficiently schedule them based on their dependencies. Obviously simple data processing applications are supported as well, as complex applications tend to start simple.

Cascading is Open Source and dual licensed under the GPL and OEM/Commercial Licenses. OEM/Commercial Licenses and Developer Support can be obtained through Concurrent, Inc.

Cascading has a strong community of users and contributors, see our Cascading modules page for related projects and extensions.

Read more about Cascadings features or thumb through the Cascading User Guide.

Recent Events

Bixo Hackathon

|

There will be a Bixo hackathon in Nevada City, CA this Sept 7th and 8th. Read more about it here.

Note that even if you’re not a hard-core Bixo user, fringe benefits from participating include learning a lot about the very useful underlying technologies (Cascading, Hadoop, HttpClient) as well as getting an excuse to visit beautiful Nevada City.

Hope to see you there.

O'Reilly Strata Conference

|

The new Strata Conference has just been announced with a Call for Proposals ending Sept 28.

This new conference is on the 'business of data' and is the sister conference to Velocity.

Hope to see lots of proposals coming in from Hadoop, Cascading, Bixo, and Cascalog users and developers.

Cascading 1.1.2

|

We are happy to announce that Cascading 1.1.2 is now publicly available for download.

This release features many bug fixes.

For a detailed list of changes see: CHANGES.txt

This release will run against Hadoop 0.18.3, 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce.

Note the tests will not compile or run against Hadoop 0.18.3 due to package changes since that version.

BigDataCamp 2010

|
Quick note that Chris will be at the BigDataCamp on June 28, 2010, the night before the Hadoop Summit. Register now before all the seats are taken.

Cascading 1.1.0 Available

|

We are happy to announce that Cascading 1.1.0 is now publicly available for download.

This release features many performance and usability enhancements while remaining backwards compatible with 1.0.

Specifically:

  • Performance optimizations with all join types
  • Numerous job planner optimizations
  • Dynamic optimizations when running in Amazon Elastic MapReduce and S3
  • API usability improvements around large number of field names
  • Support for TSV, CSV, and custom delimited text files
  • Support for manipulating and serializing non-Comparable custom Java types
  • Debug levels supported by the job planner

For a detailed list of changes see: CHANGES.txt

Along with this release are a number of extensions created by the Cascading user community.

Among these extension are:

  • Bixo - a data mining toolkit
  • DBMigrate - a tool for migrating data to/from RDBMSs into Hadoop
  • Apache HBase, Amazon SimpleDB, and JDBC integration
  • JRuby and Clojure based scripting languages for Cascading
  • Cascalog - a robust interactive extensible query language

This release will run against Hadoop 0.18.3, 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce.

Note the tests will not compile or run against Hadoop 0.18.3 due to package changes since that version.

Interview on Parallel Programming

|

A very interesting interview with Billy Newport on InfoQ about "the need for higher level abstraction to do parallel programming with multi-core systems effectively."

"Billy Newport is a Distinguished Engineer working on WebSphere eXtreme Scale (ObjectGrid) and on WebSphere high availability."

Nathan Marz has just announced and released Cascalog.

Cascalog is an interactive query language for Hadoop with a focus on simplicity, expressiveness, and flexibility intended to be used by Analysts and Developers alike.

Cascalog eschews the SQL syntax for a simpler and more expressive syntax based on Datalog.

With this added expressiveness, Cascalog can query existing data stores "out of the box" with no required data "importing" or "under the hood" configuration necessary.

Because Cascalog sits on top of Clojure, a powerful JVM based language and interactive shell, adding new operations to a query is as simple as defining a new function.

Cascalog also relies on Cascading, a robust data processing API and query planner.

Here is the canonical "word count" query in Cascalog:

(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))

You can check out an introductory blog post here: http://nathanmarz.com/blog/introducing-cascalog/

The project is hosted here: http://github.com/nathanmarz/cascalog

The recently released Karmasphere Studio 1.2 now includes support for Cascading 1.0 in the free community download.

Karmasphere Studio is an IDE and Debugger for Hadoop MapReduce application developers that also includes integration with the Amazon Web Services platform.

And with Cascading support directly in the Debugger and IDE, developers can even more quickly develop and debug complex Hadoop jobs.

Also worthy of note, Karmasphere recently received $5M Series A funding.

Cascading 1.1 RC3 Available

|

Cascading 1.1 RC3 is now available from the downloads page.

Note we are no longer serving downloads from Google Code but from links off the download page.

Cascading-DBMigrate

|

Nathan at BackType has announced and released Cascading-DBMigrate.

In short, DBMigrate is a more flexible and reliable alternative to Sqoop for moving data to/from a relational data store.

Cascading.JDBC has been around for quite a while, but DBMigrate overcomes some of the limitations when dealing with MySQL servers (AsterData did not have the same limitations) and OFFSET/LIMIT queries.