Cascading 2.2 is starting to take shape for those interested in test driving emerging features.
Of note is “field type” support. This allows fields read from an input file to have type information retained through to where the data is sinked/stored to a file.
This is important for a few reasons:
- Detecting incompatible comparisons during joins and sorting at planner time
- Retain canonical types in a Tuple
- Reading and writing field type information from/into long term archive files (Avro, Thrift, etc)
- Reducing intermediate file size by guaranteeing field type information
- Custom type coercion via CoercibleType interface
The CoercibleType interface is of particular importance.
Consider reading a CSV file with a date column, like
Internally date information is best handled as a
long timestamp. But when externalized as a String, it should read as a date string, not a stringified long value.
The DateType implementation of CoercibleType can be used when declaring the date field. Given the correct string date format string, the value of the date field will be stored as its canonical type,
So if an Operation or sink Scheme wants the value as a string, by calling
tupleEntry.getString("date"), it will be automatically converted back to the proper date string.
Or to store a long value of the date string, the code can call
tupleEntry.setString("date", "28/Dec/2012:16:17:12:931 -0800"), resulting in
tupleEntry.getObject("date") instanceof Long is
CoercibleType isn’t a replacement for data-cleansing code that can handle contingencies in the data, but for data that is known to be clean, even data emitted from prior Cascading Flows, it is quite handy.
This opens up the door for more complex types that may have multiple representations. Consider a hypothetical
Person object that can be serialized as binary to disk, but has a JSON String representation, or has a Map Object in memory/runtime representation.