6.8 Custom Types and Serialization

The Tuple class is a generic container for all java.lang.Object instances (1.0 required all objects be of type java.lang.Comparable). Subsequently any primitive value or custom Class can be stored in a Tuple instance, that is, returned by a Function, Aggregator, or Buffer as a result value.

But for this to work any Class that isn't a primitive value or a Hadoop Writable type will need to have a corresponding Hadoop 'serialization' class registered in the Hadoop configuration files for your cluster. Hadoop Writable types work because there is already a generic serialization implementation built into Hadoop. See the Hadoop documentation for registering a new serialization helper or to create Writable types. Cascading will automatically inherit any registered serialization implementations.

During serialization and deserialization of Tuple instances that contain custom types, the Cascading Tuple serialization framework will need to store the class name (as a String) before serializing the custom object. This can be very space inneficient. To overcome this, custom types can add the SerializationToken Java annotation to the custom type class. The SerializationToken annotation expects two arrays, one of integers named tokens, and one of Class name strings. Both arrays must be the same size, and no token can be less than 128 (the first 128 values are for internal use).

During serialization and deserialization, the token values are used instead of the String Class names to reduce the amount of storage used.

Serialization tokens may also be stored in the Hadoop config files or set as a property passed to the FlowConnector, with the property name cascading.serialization.tokens. The value of this property is a comma separated list of token=classname values.

Note Cascading will natively serialize/deserialize all primitives and byte arrays (byte[]). It also uses the token 127 for the Hadoop BytesWritable class.

Copyright © 2007-2008 Concurrent, Inc. All Rights Reserved.