All input data comes from, and all output data feeds to, a
cascading.tap.Tap instance.
A Tap represents a resource like a data file on the local file system, on a Hadoop distributed file system, or even on Amazon S3. Taps can be read from, which makes it a "source", or written to, which makes it a "sink". Or, more commonly, Taps can act as both sinks and sources when shared between Flows.
All Taps must have a Scheme associated with them. If the Tap is about where the data is, and how to get it, the Scheme is about what the data is. Cascading provides two Scheme classes, TextLine and SequenceFile.
TextLine reads and writes text files and returns Tuples with two field names by default, "offset" and "line". These values are inherited from Hadoop. When written to, all Tuple values are converted to Strings and joined with the TAB character (\t).
SequenceFile is based on the Hadoop Sequence file, which is a binary format. When written or read from, all Tuple values are saved in their native binary form. This is the most efficient file format, but being binary, the result files can only be read by Hadoop applications.
The fundamental difference behind TextLine
and SequenceFile schemes is that tuples stored in
the SequenceFile remain tuples, so when read,
they do not need to be parsed. So a typical Cascading application will
read text files, and parse each line into a Tuple
for processing. The final Tuples are saved via the
SequenceFile scheme so future applications can
just read the file directly into Tuple instances
without the parsing step.
The above example creates a new Hadoop FileSystem Tap that can read/write text files. Since only one field name was provided, the "offset" field is discarded, resulting in an input tuple stream with only "line" values.
The three most common Tap classes used are, Hfs, Dfs, and Lfs. The MultiTap and TemplateTap are utility Taps.
The cascading.tap.Lfs Tap is used to
reference local files. Local files are files on the same machine
your Cascading application is started. Even if a remote Hadoop
cluster is configured, if a Lfs Tap is used as either a source or
sink in a Flow, Cascading will be forced to run in "local mode"
and not on the cluster. This is useful when creating applications
to read local files and import them into the Hadoop distributed
file system.
The cascading.tap.Dfs Tap is used to
reference files on the Hadoop distributed file system.
The cascading.tap.Hfs Tap uses the
current Hadoop default file system. If Hadoop is configured for
"local mode" its default file system will be the local file
system. If configured as a cluster, the default file system is
likely the Hadoop distributed file system. The Hfs is convenient
when writing Cascading applications that may or may not be run on
a cluster. Lhs and Dfs subclass the Hfs Tap.
The cascading.tap.MultiTap is used to
tie multiple Tap instances into a single Tap. The only restriction
is that all the Tap instances passed to a new MultiTap share the
same Scheme classes (not necessarily the same Scheme
instance).
The cascading.tap.TemplateTap is used
to sink tuples into directory paths based on the values in the
Tuple. More can be read below in Template Taps.
Keep in mind Hadoop cannot source data from directories with
nested sub-directories, and it cannot write to directories that already
exist. But you can simply point the Hfs
Tap to a directory of data files and they all
will be used as input, no need to enumate each individual file into a
MultiTap.
To get around the last issue, the Hadoop related Taps allow for a
SinkMode value to be set when constructed.
Example 3.7. Overwriting An Existing Resource
Tap tap = new Hfs( new TextLine( new Fields( "line" ) ), path, SinkMode.REPLACE );
Here are all the modes available by the built-in Tap types.
SinkMode.KEEPThis is the default behavior. If the resource exists, attempting to write to it will fail.
SinkMode.REPLACEThis allows Cascading to delete the file immediately after the Flow is started.
SinkMode.APPENDCurrently unsupported, but future versions of Hadoop will support appends fully.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.