News

Latest News & Updates

Cascading SDK 2.6

We are happy to announce that Cascading SDK 2.6 is now publicly available.

The Cascading SDK is a collection of tools, documentation, libraries, tutorials and example projects for the greater Cascading community.

http://cascading.org/sdk

What’s New:

  • Cascading 2.6 support for all tools (Lingual 1.2, Multitool 2.6, Load 2.6, Scalding, Cascalog)
  • Teradata Tap is now included in the Cascading-JDBC project
  • New and updated tutorials: Cobol copybook, ETL, Teradata, Redshift

For more details:

https://github.com/Cascading/CascadingSDK#cascading-26-sdk

Lingual 1.2

We are happy to announce that Lingual 1.2 is now publicly available.

The purpose of Lingual is to ease migration of SQL based workloads onto Hadoop, and to simplify integration with Hadoop through standards based APIs. Lingual provides an ANSI SQL interface on Cascading for Apache Hadoop.

Lingual 1.2 includes both critical bug fixes, updated to support Cascading 2.6, and improved support for Driven. It is highly recommended that users upgrade to this release.

http://www.cascading.org/downloads

What’s New:

  • Fixed issue where c.l.f.SQLPlanner could execute a flow instead of just planning it
  • Fixed issue in shell wrapper where a previously set LINGUAL_CLASSPATH would be lost and therefore the cascading-service.properties file could not be loaded by setting LINGUAL_CLASSPATH before running lingual shell
  • Several changes to reduce memory footprint and help GC in long running processes with many flows

For more details:

https://github.com/Cascading/lingual

Cascading 2.6

We are happy to announce that Cascading 2.6 is now publicly available for download.

This release contains new features and bug fixes.

Of note are the new DecoratorTap and DistCacheTap (itself a DecoratorTap sub-class) classes. Working together, Flows can cache data directly into the Hadoop distributed cache when accumulating data for a HashJoin.

And for Driven users, new Java annotations allow for additional meta-data to be sent to the Driven UI when visualizing assemblies. For example, the ‘expression’ in the ExpressionFilter will be visible in the Driven 1.1 EAP.

http://cascading.org/downloads/

What’s New in Cascading 2.6:

  • Added c.t.DecoratorTap class to simplify wrapping a given c.t.Tap instance with additional meta-data.
  • Added c.t.h.DistCacheTap a decorator for a c.t.h.Hfs instance that uses o.a.h.f.DistributedCache to read files transparently from local disk. This is useful for c.p.HashJoins.
  • Added new Java Annotations for managing the visibility of data sent to a remote management service.
  • Updated Apache Hadoop to 2.4.1 in cascading-hadoop2-mr1
  • Updated c.p.a.AggregateBy and c.p.a.Unique to count cache flushes, hits, and misses. Previously only AggregateBy tracked cache flushes.
  • Fixed issue where c.p.a.AggregateBy and c.p.a.UniqueBy were not honoring the c.t.Hasher interface

For more details on new features and resolved issues see:
https://github.com/Cascading/cascading/blob/2.6/CHANGES.txt

The Cascading 3.0 Query Planner

Cascading 1.0 when released, represented a huge milestone. An enterprise friendly Java API, not a syntax, and fail fast planner allowed developers to build robust, maintainable, data-oriented applications that could execute reliably on Apache Hadoop for hours or days.

Cascading 2.0 made a nod towards cluster platform independence with the addition of an in memory ‘local’ mode that proved applications could be built without direct API access to the underlying platform. This provided much needed isolation across the evolving Hadoop platform and improved developer experience. Write once, run on any vendor provided platform.

Cascading 3.0 proves we can abstract away the underlying platform by providing additional platforms users can execute against, not just MapReduce. Apache Tez is the first such addition. Tez is the latest parallel computation fabric from the Apache group. It builds on Hadoop, keeping the Hadoop File System for instance, but uses a model that removes inefficiencies that are baked into the MapReduce model.

But we should point out the Cascading 3.0 abstraction is not just over the platform APIs, but also the semantics of the given platform like MapReduce or a DAG like model (like the one provided by Apache Tez). This is achieved by the Cascading “query planner”, it maps from one model to another.

Another way to restate this is that the Cascading query planner is not ‘mapping’ MapReduce onto other models, but mapping the Cascading model directly onto the underlying platform model, be it MapReduce or a DAG representation.

This makes the Cascading object model and their associated semantics an Intermediate Representation (or Intermediate Language) for data parallel cluster computing. Where new models can be created on top, over Cascading, and new ones can be mapped below, underneath Cascading. All the while retaining the qualities of both. LLVM provides this functionality to the compiler world. For example, Clang’s C, C++, and Objective-C; the Julia language (a technical computing language); and the Swift language from Apple all rely on the LLVM provided Intermediate Representation (IR).

To keep things less ivory tower, we usually call Cascading a framework.

But in practice, Cascading as an IR is quite common. Cascading is the basis for Lingual, Pattern, Scalding, and Cascalog, each of which are higher order syntaxes or languages that are feature rich, reliable, and very useful unto themselves. Cascading is also being embedded in a number of commercial products that expose a different (more vertical) model to the end user.

How is this done? In large part, it’s due to the brand new query planner we have spent the balance of this year developing and testing. First with our ‘local’ mode and the MapReduce model, which were made available earlier this year in a Cascading 3.0 WIP release. And during the last couple of months, adding more degrees of freedom to support a model with, well, more degrees.

We could have just used the MapReduce planner directly on Tez, but the point of the DAG model in Tez is to remove the inefficiencies inherent to MapReduce. For example, no more ‘identity mapper’; added support for multiple outputs; no prefixing data with join ordinality; suppressing sorting when not needed; removing HDFS as an intermediate store between jobs; etc.

Over the years we realized we need to rethink how the prior planners mapped Cascading into new platform models, but more importantly, how we can move control of adding new Cascading primitives, optimizations, or whole new platforms back to the Cascading user. The answer was somewhat obvious when looking at similar systems — create a rule language users can use to create rules and to create new rule sets.

But there are some reasonably complicated difficulties we had to address. Recognizing that what a Cascading user creates is itself a DAG (Directed Acyclic Graph) of elements that the user would like to execute remotely is key.

From there, the key difficulties we had to address boiled down to:

  • Allowing rules to be abstract enough so that they can target patterns of nodes and edges in the user DAG without having to account for all possible permutations
  • Isolating the structural and compatibility testing of the target DAG from the structural transformation of the DAG
  • Allow for high levels of code reusability
  • Allow rules to offer syntactic context on errors
  • Allow rules to leverage all available meta-data, including the structural qualities of the DAG itself
  • Present a simple ‘language’ that allows users to develop custom rules

Most of the above translates to having “rules” that can assert a DAG is a valid structure, transform a DAG into a new DAG, or partition a DAG into sub-DAGs.

In terms of MapReduce, we want to define rules that verify aggregations always follow a grouping; identify and group “map” side and “reduce” side elements; insert temporary files between the “reduce” side elements and the “map” side elements; break the graph up into physical jobs.

The root of the solution is to rely on a way to match sub-DAGs (or sub-graphs) within a larger graph. That is, we need a regular expression like “language”, not for strings and text, but for graphs of nodes and edges. In the literature this is called Sub-Graph Isomorphism Matching.

So we created what we call the Expression Graph API, a basis for building higher order complex logic allowing us to reason about or manipulate large complex graphs.

Fundamentally what we have is an API that allows a ‘match graph’ (think regular expression string) to be constructed that then can be applied to a larger graph to see if there is a match or, more generally useful, similarity between both structures. And if so, we can then perform an Assertion, Transformation, or Partition against the matched elements.

Consider the expression graph below, a simple DAG of two elements with some meta-data:

Cascading 3.0 Query Planner - IMG_1

It will match the following elements in gray and not those with a dotted outline.

Cascading 3.0 Query Planner - IMG_2

The above expression graph knows it is looking for a circle that is also a ‘join’ between two incoming elements. If there is no ‘join’, then don’t match.

But the world isn’t that simple. We had to go further to allow for matching of sub-graphs of distinguished elements.

That is, some elements in a graph can be ignored in a given case, but not in others. Thus the elements that matter, the distinguished elements, must be the only ones that match the graph expression. But the transformation must be applied to the full sub-graph bound by any matched distinguished elements.

Consider the following expression graph, a circle or a triangle will match:

Cascading 3.0 Query Planner - IMG_3

It will match the gray elements in the following graph:

Cascading 3.0 Query Planner - IMG_4

These would be the distinguished elements. Matching them allows us to remove all the other elements, or actually, mask them from subsequent matches, resulting in a contracted graph of the form:

Cascading 3.0 Query Planner - IMG_5

Now we can apply a new expression graph (not shown) to find our target sub-graph of distinguished elements, the result is as follows:

Cascading 3.0 Query Planner - IMG_6

If we isolate the new sub-graph, above in gray, we can expand it to include it original intermediate nodes and edges, that is, unmask previously masked elements.

Cascading 3.0 Query Planner - IMG_7

Finally we can provide yet another expression graph to assert or transform this final graph within the original full graph. In the graph above, we could provide a new expression graph looking for a square with horizontal lines, and if found we an insert a new element after it, along the dotted edge.

Alternately, we can extract the new sub-graph from the full original graph effectively letting us partition the larger graph into smaller ones.

We have only touched on the basics, but the language is still evolving while already quite powerful.

One last value of having assertions, transformations, and partitioning implemented as basic building blocks for re-usability; users can create new “rule sets” to optimize specific applications, or create a set of “rule sets” with a voting strategy to choose the best resulting execution plan based on cost or some other metric.

Cascading 3.0 with Apache Tez support is out now (see our announcement), using the new query planner to its fullest extent, please give it a spin and keep watching our site for announcements for new features in 3.0. You can find the latest Cascading 3.0 WIP release here. For more details about the Apache Tez developer release, see the announcement from Hortonworks.

Cascading 3.0 WIP Now Supports Apache Tez

We are happy to announce that the latest Cascading 3.0 WIP now adds Apache Tez as a supported runtime platform. We are making this release available so interested parties can begin testing Tez deployments against existing Cascading applications.

A downloadable version of Cascading 3.0 WIP and its documentation are available at:

http://www.cascading.org/wip

For more information about Cascading on Apache Tez:

Get notified when Cascading 3.0 support for Apache Tez is final (if you’re not already in the loop):

http://www.cascading.org/new-fabric-support

For more details about the Apache Tez developer release, see the announcement from Hortonworks:

http://hortonworks.com/blog/introducing-apache-tez-0-5

Fluid — A Fluent API for Cascading

We have announced our new Fluid project and need your help in testing it out.

Fluid is an API library exposing the Cascading library as a Fluent API. Fluid’s primary goal is not only to make hard things possible, but also to keep simple things simple. Fluid generates the builders for every current Cascading release (even those in WIP).

The project is currently a work in progress and we are looking for feedback on the API before we make it 1.0.

IntelliJ Plugin for Cascading

We have published an initial IntelliJ plugin for Cascading designed to improve the experience of developing data-oriented applications in modern IDEs.

The first version of the plugin allows developers to quickly visualize and debug their Cascading code with Driven when developing with IntelliJ.

Also, we are planning on improving this plugin in future releases and are maintaining a to-do list on GitHub

If interested in contributing, please open an issue on the project to discuss, fork, and submit a pull request.

Driven 1.0

We are happy to announce that Driven 1.0 is now generally available. Driven provides developers with operational visibility for Cascading applications, including those built with Cascading dynamic programming languages (i.e. Scalding, Cascalog, Lingual, Pattern, etc.). Existing users should also make sure they have the latest version of the plugin (see link below).

Driven Visualization

Key Features:

  • Visualize your Data App
  • Diagnose App Failures
  • Optimize App Performance
  • Collaborate in Teams
  • Track Application History

Get started with the latest Driven plugin:

http://docs.cascading.io/driven/1.0/getting-started

Learn more about Driven with our new screencasts:

http://cascading.io/driven/#Get_Started

Cascading-Hive

We are happy to announce a new open source project that we have been working on: Cascading-Hive. Now, you can use Cascading and Apache Hive together.

Key Features:

  • Run Hive queries within a Cascade
  • Read from Hive tables within a Cascading Flow
  • Write/create Hive tables from a Cascading Flow
  • Write/create partitioned Hive tables from a Cascading Flow
  • Deconstruct a Hive view into Taps

Cascading-Hive works with Cascading 2.5.4+ and Apache Hive 0.10-0.13 on Hadoop 1.x and Hadoop 2.x. You can find more information on the GitHub project:

http://github.com/cascading/cascading-hive

Driven Updates – App Performance, Teams

We pushed out a new Driven update this week with a some major enhancements:

  • Graphical view of performance metrics at the step level for each application
  • Team management capabilities to allow team members to view the same applications
  • Numerous improvements to graph rendering and visualization

Performance View

To get started with the plugin:

If you have any issues or questions, drop us a note on the Driven Forums: