Cascading

Cascading is a proven application development platform for building Data applications on Apache Hadoop. Whether solving simple or complex data problems, Cascading balances an optimal level of abstraction with the necessary degrees of freedom through a computation engine, systems integration framework, data processing and scheduling capabilities.

  • Build Enterprise
    Data Applications
    Cascading is designed from the ground up so you can easily test, maintain, deploy and manage your data applications as part of your standard business process. Leverage existing skill sets and tools to build comprehensive data applications on Hadoop.
  • Easily Extensible
    From dynamic programming languages like Scalding and Cascalog, to interoperability with tools like Jaspersoft and R, Cascading’s easily extensible framework supports a variety of extensions, tools, and other integrations.
  • Test-Driven
    Development
    Efficiently test code and process local files before you deploy on a cluster with Cascading’s local or in-memory mode. Incorporate inline data assertions to define results at any point in your pipeline, where failed assertions are available for analysis.
  • Application Portability
    Write once, then run on different computation fabrics. With ever-changing business requirements and needs for data applications to be flexible, by virtue, applications written in Cascading are portable across any fabric that Cascading supports.

Cascading Benefits

  • Quickly build robust, reliable, data-oriented applications
  • Develop testable and reusable integrations, data processing code and algorithms
  • Leverage existing best practices, skill sets and tools

KEY FEATURES

  • Java API
    Cascading is a Java library and does not require installation. Cascading fits directly into a standard development process, and you don’t have to do anything extra except use APIs.
  • Data Processing API
    The data processing APIs define data processing flows. The APIs exposed provide a rich set of capabilities that allow you to think in terms of the data and the business problem such as sort, average, filter, merge etc.
  • Data Integration API
    The data integration API allows you to isolate your integration dependencies from your business logic. You can easily read/write from a variety of external systems to Hadoop, and then write those results to another system.
  • Scheduler API
    Scheduler APIs can schedule work from 3rd party applications. The Process Scheduler coupled with the Riffle life-cycle annotations allows Cascading to schedule unit of work from any third-party application.
  • Process Planner
    Cascading’s physical planner automatically creates MapReduce jobs ready for processing on your cluster.
  • Taps and Schemes
    Taps and Schemes enable read/write capabilities between any source and in any format. Cascading comes with several pre-built taps and schemes and also provides you the flexibility to quickly build your own.
  • Standard Relational Operations
    Many common operations used in relational environments such as regular expression operations, Java expression operations, XML operations and logical filter operations are available in Cascading.
  • Scriptable Interface
    Any Java-compatible scripting language can import and instantiate Cascading classes, create pipe assemblies and flows, and execute those flows. Users can also create their own DSLs to handle common idioms.
  • Local mode / In-Memory mode
    On a single node, Cascading’s local mode can be used to efficiently test code and process local files before being deployed on a cluster. The built-in testability allows debugging before production deployment.
  • Dynamic Programming Languages
    The Cascading community has built dynamic programming languages on top of the Java API for greater productivity. There are several to choose from: Lingual (ANSI SQL), Pattern (PMML), Scalding (Scala), Cascalog (Clojure) and more!
  • Hadoop Support
    Cascading runs on all popular Hadoop distributions and Hadoop-as-a-service providers. We ensure that Cascading can run on-premise or in the cloud to meet your deployment needs.

THE SECRET SAUCE

WHAT MAKES CASCADING SO EFFECTIVE

Division of Logic

Cascading allows you to develop your business logic separately from your integration logic. Develop complete applications and write unit tests without touching a single Hadoop API. It gives you the degrees of freedom to easily move through the application development life-cycle and separately deal with integrating existing systems.

Think in Business Terms

Cascading provides a rich API that allows you to think in terms of data and business problems with capabilities such as sort, average, filter, merge, etc. The computation engine and process planner convert your business logic into efficient parallel jobs and delivers the optimal plan at run-time to your Hadoop installation.

Systems Integration

Hadoop is never used alone and Cascading allows you to easily read and write from a variety of external systems to Hadoop and then write results to another system. The Cascading SDK comes with many pre-built and supported integrations, with many more provided by the community.

FAQ

  • Who is Cascading for?
    Enterprise Development
    Cascading was designed to fit into any Enterprise development environment. With a clear separation between “data processing” and “data integration”, its clean Java API, and JUnit testing framework, Cascading can easily be tested and deployed at any scale.

    Data Science
    Because Cascading is Java-based, it naturally fits into JVM-based languages like Scala, Clojure, Jruby, Jython, and Groovy. Within many of these languages, the Cascading community has created many scripting and query languages that simplify ad hoc and production-ready analytics as well as machine learning applications.
  • What are typical use cases for Cascading?
    Typical uses cases for Cascading include, and are not limited to: data processing/ETL, data aggregation, data discovery, marketing funnel analytics, customer engagement analysis, fraud detection, web crawlers, social recommender systems, retail pricing, climate analysis, geolocation, genomics, plus a variety of other kinds of machine learning and optimization problems. Learn more
  • How is Cascading different from Pig and Hive?

    When it comes to application development, both Pig and Hive have shortcomings when dealing with complexity and testing, while Cascading applications are built to scale. With Pig, easy problems can be easily solved, but the harder problems become quite complicated. With Hive, the language is not compliant with ANSI SQL standards, which makes its applicability and interoperability with existing SQL systems challenging.  Furthermore, it is non-deterministic, which makes its behavior difficult to predict when compared to Cascading’s deterministic planner. Both Pig and Hive will generally require significantly more code when it comes to integration and incorporating business logic. Cascading separates business logic and integration logic and has system integration capabilities already built-in. Also, with Pig and Hive, it is notoriously difficult to build complex data workflows, and equally as hard to troubleshoot your data applications. Cascading allows developers to unit test and execute test-driven deployment best practices at scale.

    Cascading Pig Hive
    Ad hoc queries
    Complex workflows
    Create unit tests
    Application portability
    Pluggable data sources
    Extensible into other languages
    Support for non-Hadoop platforms
    No Installation required