Cascading was designed to be easily configured and enhanced by
developers. Besides allowing for custom Operations, developers can
provide custom Tap and
Scheme types so applications can connect to
system external to Hadoop.
A Tap represents something "physical", like a file or a database table. Subsequently Tap implementations are responsible for life cycle issues around the resource they represent, like tests for existence, or deleting.
A Scheme represents a format or representation, like a text format
for a file, or columns in a table. Schemes are responsible for
converting the Tap managed resources proprietary format to and from a
cascading.tuple.Tuple instance.
Unfortunately creating custom Taps and Schemes can be an involved
process and requires some knowledge of Hadoop and the Hadoop FileSystem
API. Most commonly, the cascading.tap.Hfs class
can be subclassed if a new file system is to be supported, assuming
passing a fully qualified URL to the Hfs
constructor isn't sufficient (the Hfs tap will
look up a file system based on the URL scheme via the Hadoop FileSystem
API).
Delegating to the Hadoop FileSystem API is not a strict
requirement, but the developer will need to implement a Hadoop
org.apache.hadoop.mapred.InputFormat and/or
.org.apache.hadoop.mapred.OutputFormat so that
Hadoop knows how to split and handle the incoming/outgoing data. The
custom Scheme is responsible for setting
InputFormat and
OutputFormat on the
JobConf via the sinkInit
and sourceInit methods.
For examples on how to implement a custom Tap and Scheme, see the Cascading Modules page for samples.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.