Reducing intermediate file size by guaranteeing field type information
The CoercibleType interface is of particular importance.
- Custom type coercion via CoercibleType interface
Consider reading a CSV file with a date column, like
Internally date information is best handled as a
long timestamp. But when externalized as a String, it should read as a date string, not a stringified long value.
The DateType implementation of CoercibleType can be used when declaring the date field. Given the correct string date format string, the value of the date field will be stored as its canonical type,
So if an Operation or sink Scheme wants the value as a string, by calling
tupleEntry.getString("date"), it will be automatically converted back to the proper date string.
Or to store a long value of the date string, the code can call
tupleEntry.setString("date", "28/Dec/2012:16:17:12:931 -0800"), resulting in
tupleEntry.getObject("date") instanceof Long is
CoercibleType isn’t a replacement for data-cleansing code that can handle contingencies in the data, but for data that is known to be clean, even data emitted from prior Cascading Flows, it is quite handy.
This opens up the door for more complex types that may have multiple representations. Consider a hypothetical
Person object that can be serialized as binary to disk, but has a JSON String representation, or has a Map Object in memory/runtime representation.
See conjars or the Concurrent site for downloads.