Check out the latest Cascading & Driven news, updates, events, and useful resources such as tutorials, extensions, and more. In this issue, get the latest update on Cascading 3.0—which is coming soon; Learn more about Maestro which is built on Scalding; Writing to Hbase using Scalding; Read how LiveRamp performs Transitive Elimination for large-scale problems; Get an introduction to Apache Tez via the Warsaw HUG slideshare presentation; Pre-order “Learning Cascading” from basics to advanced topics and links to tutorials, extensions and more… Continue reading
Latest News & Updates
We are happy to announce that Cascading Fluid 1.0 is now publicly available.
As a fluent API, Java IDEs, like IntelliJ IDEA and Eclipse, will auto-suggest the next API call based on the prior method call. Only methods that would logically be next in the chain will be suggested. This lowers the burden on new Cascading developers who wish to rapidly create data-processing applications on Apache Hadoop.
The Fluid API is generated directly from Cascading compiled libraries and supports all currently supported Cascading final and WIP releases, including Cascading 3.0 WIP which provides support for Apache Tez.
Current release Java docs can be found here:
To see Fluid in action, checkout part 6 of the Cascading for the Impatient series which has been ported to Fluid:
The source to the complete ported Impatient series:
We are happy to announce that Cascading SDK 2.6 is now publicly available.
The Cascading SDK is a collection of tools, documentation, libraries, tutorials and example projects for the greater Cascading community.
- Cascading 2.6 support for all tools (Lingual 1.2, Multitool 2.6, Load 2.6, Scalding, Cascalog)
- Teradata Tap is now included in the Cascading-JDBC project
- New and updated tutorials: Cobol copybook, ETL, Teradata, Redshift
For more details:
We are happy to announce that Lingual 1.2 is now publicly available.
The purpose of Lingual is to ease migration of SQL based workloads onto Hadoop, and to simplify integration with Hadoop through standards based APIs. Lingual provides an ANSI SQL interface on Cascading for Apache Hadoop.
Lingual 1.2 includes both critical bug fixes, updated to support Cascading 2.6, and improved support for Driven. It is highly recommended that users upgrade to this release.
- Fixed issue where c.l.f.SQLPlanner could execute a flow instead of just planning it
- Fixed issue in shell wrapper where a previously set LINGUAL_CLASSPATH would be lost and therefore the cascading-service.properties file could not be loaded by setting LINGUAL_CLASSPATH before running lingual shell
- Several changes to reduce memory footprint and help GC in long running processes with many flows
For more details:
We are happy to announce that Cascading 2.6 is now publicly available for download.
This release contains new features and bug fixes.
Of note are the new DecoratorTap and DistCacheTap (itself a DecoratorTap sub-class) classes. Working together, Flows can cache data directly into the Hadoop distributed cache when accumulating data for a HashJoin.
And for Driven users, new Java annotations allow for additional meta-data to be sent to the Driven UI when visualizing assemblies. For example, the ‘expression’ in the ExpressionFilter will be visible in the Driven 1.1 EAP.
What’s New in Cascading 2.6:
- Added c.t.DecoratorTap class to simplify wrapping a given c.t.Tap instance with additional meta-data.
- Added c.t.h.DistCacheTap a decorator for a c.t.h.Hfs instance that uses o.a.h.f.DistributedCache to read files transparently from local disk. This is useful for c.p.HashJoins.
- Added new Java Annotations for managing the visibility of data sent to a remote management service.
- Updated Apache Hadoop to 2.4.1 in cascading-hadoop2-mr1
- Updated c.p.a.AggregateBy and c.p.a.Unique to count cache flushes, hits, and misses. Previously only AggregateBy tracked cache flushes.
- Fixed issue where c.p.a.AggregateBy and c.p.a.UniqueBy were not honoring the c.t.Hasher interface
For more details on new features and resolved issues see:
Cascading 1.0 when released, represented a huge milestone. An enterprise friendly Java API, not a syntax, and fail fast planner allowed developers to build robust, maintainable, data-oriented applications that could execute reliably on Apache Hadoop for hours or days.
Cascading 2.0 made a nod towards cluster platform independence with the addition of an in memory ‘local’ mode that proved applications could be built without direct API access to the underlying platform. This provided much needed isolation across the evolving Hadoop platform and improved developer experience. Write once, run on any vendor provided platform.
Cascading 3.0 proves we can abstract away the underlying platform by providing additional platforms users can execute against, not just MapReduce. Apache Tez is the first such addition. Tez is the latest parallel computation fabric from the Apache group. It builds on Hadoop, keeping the Hadoop File System for instance, but uses a model that removes inefficiencies that are baked into the MapReduce model.
But we should point out the Cascading 3.0 abstraction is not just over the platform APIs, but also the semantics of the given platform like MapReduce or a DAG like model (like the one provided by Apache Tez). This is achieved by the Cascading “query planner”, it maps from one model to another.
Another way to restate this is that the Cascading query planner is not ‘mapping’ MapReduce onto other models, but mapping the Cascading model directly onto the underlying platform model, be it MapReduce or a DAG representation.
This makes the Cascading object model and their associated semantics an Intermediate Representation (or Intermediate Language) for data parallel cluster computing. Where new models can be created on top, over Cascading, and new ones can be mapped below, underneath Cascading. All the while retaining the qualities of both. LLVM provides this functionality to the compiler world. For example, Clang’s C, C++, and Objective-C; the Julia language (a technical computing language); and the Swift language from Apple all rely on the LLVM provided Intermediate Representation (IR).
To keep things less ivory tower, we usually call Cascading a framework.
But in practice, Cascading as an IR is quite common. Cascading is the basis for Lingual, Pattern, Scalding, and Cascalog, each of which are higher order syntaxes or languages that are feature rich, reliable, and very useful unto themselves. Cascading is also being embedded in a number of commercial products that expose a different (more vertical) model to the end user.
How is this done? In large part, it’s due to the brand new query planner we have spent the balance of this year developing and testing. First with our ‘local’ mode and the MapReduce model, which were made available earlier this year in a Cascading 3.0 WIP release. And during the last couple of months, adding more degrees of freedom to support a model with, well, more degrees.
We could have just used the MapReduce planner directly on Tez, but the point of the DAG model in Tez is to remove the inefficiencies inherent to MapReduce. For example, no more ‘identity mapper’; added support for multiple outputs; no prefixing data with join ordinality; suppressing sorting when not needed; removing HDFS as an intermediate store between jobs; etc.
Over the years we realized we need to rethink how the prior planners mapped Cascading into new platform models, but more importantly, how we can move control of adding new Cascading primitives, optimizations, or whole new platforms back to the Cascading user. The answer was somewhat obvious when looking at similar systems — create a rule language users can use to create rules and to create new rule sets.
But there are some reasonably complicated difficulties we had to address. Recognizing that what a Cascading user creates is itself a DAG (Directed Acyclic Graph) of elements that the user would like to execute remotely is key.
From there, the key difficulties we had to address boiled down to:
- Allowing rules to be abstract enough so that they can target patterns of nodes and edges in the user DAG without having to account for all possible permutations
- Isolating the structural and compatibility testing of the target DAG from the structural transformation of the DAG
- Allow for high levels of code reusability
- Allow rules to offer syntactic context on errors
- Allow rules to leverage all available meta-data, including the structural qualities of the DAG itself
- Present a simple ‘language’ that allows users to develop custom rules
Most of the above translates to having “rules” that can assert a DAG is a valid structure, transform a DAG into a new DAG, or partition a DAG into sub-DAGs.
In terms of MapReduce, we want to define rules that verify aggregations always follow a grouping; identify and group “map” side and “reduce” side elements; insert temporary files between the “reduce” side elements and the “map” side elements; break the graph up into physical jobs.
The root of the solution is to rely on a way to match sub-DAGs (or sub-graphs) within a larger graph. That is, we need a regular expression like “language”, not for strings and text, but for graphs of nodes and edges. In the literature this is called Sub-Graph Isomorphism Matching.
So we created what we call the Expression Graph API, a basis for building higher order complex logic allowing us to reason about or manipulate large complex graphs.
Fundamentally what we have is an API that allows a ‘match graph’ (think regular expression string) to be constructed that then can be applied to a larger graph to see if there is a match or, more generally useful, similarity between both structures. And if so, we can then perform an Assertion, Transformation, or Partition against the matched elements.
Consider the expression graph below, a simple DAG of two elements with some meta-data:
It will match the following elements in gray and not those with a dotted outline.
The above expression graph knows it is looking for a circle that is also a ‘join’ between two incoming elements. If there is no ‘join’, then don’t match.
But the world isn’t that simple. We had to go further to allow for matching of sub-graphs of distinguished elements.
That is, some elements in a graph can be ignored in a given case, but not in others. Thus the elements that matter, the distinguished elements, must be the only ones that match the graph expression. But the transformation must be applied to the full sub-graph bound by any matched distinguished elements.
Consider the following expression graph, a circle or a triangle will match:
It will match the gray elements in the following graph:
These would be the distinguished elements. Matching them allows us to remove all the other elements, or actually, mask them from subsequent matches, resulting in a contracted graph of the form:
Now we can apply a new expression graph (not shown) to find our target sub-graph of distinguished elements, the result is as follows:
If we isolate the new sub-graph, above in gray, we can expand it to include it original intermediate nodes and edges, that is, unmask previously masked elements.
Finally we can provide yet another expression graph to assert or transform this final graph within the original full graph. In the graph above, we could provide a new expression graph looking for a square with horizontal lines, and if found we an insert a new element after it, along the dotted edge.
Alternately, we can extract the new sub-graph from the full original graph effectively letting us partition the larger graph into smaller ones.
We have only touched on the basics, but the language is still evolving while already quite powerful.
One last value of having assertions, transformations, and partitioning implemented as basic building blocks for re-usability; users can create new “rule sets” to optimize specific applications, or create a set of “rule sets” with a voting strategy to choose the best resulting execution plan based on cost or some other metric.
Cascading 3.0 with Apache Tez support is out now (see our announcement), using the new query planner to its fullest extent, please give it a spin and keep watching our site for announcements for new features in 3.0. You can find the latest Cascading 3.0 WIP release here. For more details about the Apache Tez developer release, see the announcement from Hortonworks.
We are happy to announce that the latest Cascading 3.0 WIP now adds Apache Tez as a supported runtime platform. We are making this release available so interested parties can begin testing Tez deployments against existing Cascading applications.
A downloadable version of Cascading 3.0 WIP and its documentation are available at:
For more information about Cascading on Apache Tez:
- Cascading on Apache Tez Notes: https://github.com/cwensel/cascading/tree/wip-3.0/cascading-hadoop2-tez
- Sample Applications: https://github.com/Cascading/cascading.samples/tree/wip-3.0
Get notified when Cascading 3.0 support for Apache Tez is final (if you’re not already in the loop):
For more details about the Apache Tez developer release, see the announcement from Hortonworks: