The Tuple class is a generic container for
all java.lang.Object instances (1.0 required all
objects be of type java.lang.Comparable).
Subsequently any primitive value or custom Class can be stored in a
Tuple instance, that is, returned by a
Function, Aggregator, or
Buffer as a result value.
But for this to work any Class that isn't a primitive value or a
Hadoop Writable type will need to have a
corresponding Hadoop 'serialization' class registered in the Hadoop
configuration files for your cluster. Hadoop
Writable types work because there is already a
generic serialization implementation built into Hadoop. See the Hadoop
documentation for registering a new serialization helper or to create
Writable types. Cascading will automatically
inherit any registered serialization implementations.
During serialization and deserialization of
Tuple instances that contain custom types, the
Cascading Tuple serialization framework will need
to store the class name (as a String) before
serializing the custom object. This can be very space inneficient. To
overcome this, custom types can add the
SerializationToken Java annotation to the custom
type class. The SerializationToken annotation
expects two arrays, one of integers named tokens, and one of Class name
strings. Both arrays must be the same size, and no token can be less
than 128 (the first 128 values are for internal use).
During serialization and deserialization, the token values are
used instead of the String Class names to reduce
the amount of storage used.
Serialization tokens may also be stored in the Hadoop config files
or set as a property passed to the FlowConnector,
with the property name cascading.serialization.tokens. The
value of this property is a comma separated list of
token=classname values.
Note Cascading will natively serialize/deserialize all primitives
and byte arrays (byte[]). It also uses the token 127 for
the Hadoop BytesWritable class.
Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.