External Data Interfaces

Cascading provides the means for executing jobs or Flow to retrieve and store data from external sources, other than those provided in the native cluster.

Specifically, Amazon S3 and HTTP based services.

Via the S3 resource interface, data can be retrieved and saved to S3 from a Cascading Flow. This makes running a cluster in Amazon EC2 a snap. Flows can be defined that fetch and store data to S3, participating with more complex Flows that do much more interesting and proprietary things to the raw data.

Also any data set can be retrieved over standard HTTP. This also includes URL with a query string for dynamic data. The only limitation is that the result HTTP response includes a 'Content-Length' header. Other wise the cluster has no idea how much data to expect and thus now many resources to allocate. To get around this, using the Groovy scripting layer, tools like wget and curl can be used to prefetch data.

This is fully extensible allowing for other types of data sources and sinks to participate first class in a given Flow. On the todo list are JDBC and HBase adapters for populating those data stores with Flow result sets.