Cascading is a proven application development platform for building Big Data applications on Apache Hadoop. Whether solving simple or complex data problems, Cascading balances an optimal level of abstraction with the necessary degrees of freedom through a computation engine, systems integration framework, data processing and scheduling capabilities.

Uniquely, Cascading offers Hadoop development teams portability. As new, more interesting, compute fabrics are developed, teams will need the ability to move existing applications without incurring the cost to rewrite them. With Cascading, it is simply a matter of changing a few lines of code and a Cascading application is ported to another supported compute fabric. Today, Cascading applications run on and can be ported between MapReduce, Apache Tez and Apache Flink.

  • Build Data Applications that are Scale-free
    With Cascading, developers can build and test their application locally, and then deploy them at scale in production. The underlying query planner accounts for scale so that your applications will run properly even when encountering large data sets or small clusters.
  • Systems Integration
    Hadoop never lives alone. Easily build applications that integrate Hadoop with your existing legacy systems. There are many community-supported projects that allows your app to move data in and out of various sources (i.e. Elasticsearch, HBase, Cassandra, MongoDB, and more).
  • Resolve Staffing Bottlenecks
    You don’t need to become a MapReduce expert in order to build applications on top of Hadoop. Cascading’s Java API allows organizations to use Java to build robust data-driven applications. Cascading's extensions also give users the ability to leverage SQL, Scala and data modeling skills.
  • Test-Driven
    Efficiently test code and process local files before you deploy on a cluster with Cascading’s local or in-memory mode. Incorporate inline data assertions to define results at any point in your pipeline, where failed assertions are available for analysis.
  • Application Portability
    Write once, then run on different computation fabrics. With ever-changing business requirements and needs for data applications to be flexible, by virtue, applications written in Cascading are portable across any fabric that Cascading supports.
  • Reduced Operational Complexity
    Reduce the operational complexity required to get your application production-ready. With Cascading, the process is simple. Cascading applications are packaged into a single JAR file ready to hand over to operations.

Cascading Benefits

  • Quickly build robust, reliable, data-oriented applications
  • Eliminate compute fabric lock-in
  • Develop testable and reusable integrations, data processing code and algorithms
  • Leverage existing best practices, skill sets and tools


  • Java API
    Cascading is a Java library and does not require installation. Cascading fits directly into a standard development process, and you don’t have to do anything extra except use APIs.
  • Data Processing API
    The data processing APIs define data processing flows. The APIs exposed provide a rich set of capabilities that allow you to think in terms of the data and the business problem such as sort, average, filter, merge etc.
  • Data Integration API
    The data integration API allows you to isolate your integration dependencies from your business logic. You can easily read/write from a variety of external systems to Hadoop, and then write those results to another system.
  • Scheduler API
    Scheduler APIs can schedule work from 3rd party applications. The Process Scheduler coupled with the Riffle life-cycle annotations allows Cascading to schedule unit of work from any third-party application.
  • Query Process Planner
    Cascading’s physical planner automatically creates MapReduce, Apache Tez or Apache Flink jobs ready for processing on your cluster.
  • Taps and Schemes
    Taps and Schemes enable read/write capabilities between any source and in any format. Cascading comes with several pre-built taps and schemes and also provides you the flexibility to quickly build your own.
  • Standard Relational Operations
    Many common operations used in relational environments such as regular expression operations, Java expression operations, XML operations and logical filter operations are available in Cascading.
  • Scriptable Interface
    Any Java-compatible scripting language can import and instantiate Cascading classes, create pipe assemblies and flows, and execute those flows. Users can also create their own DSLs to handle common idioms.
  • Local mode / In-Memory mode
    On a single node, Cascading’s local mode can be used to efficiently test code and process local files before being deployed on a cluster. The built-in testability allows debugging before production deployment.
  • Dynamic Programming Languages
    The Cascading community has built dynamic programming languages on top of the Java API for greater productivity. There are several to choose from: Lingual (ANSI SQL), Pattern (PMML), Scalding (Scala), Cascalog (Clojure) and more!
  • Hadoop Support
    Cascading runs on all popular Hadoop distributions and Hadoop-as-a-service providers. We ensure that Cascading can run on-premise or in the cloud to meet your deployment needs.



Division of Logic

Cascading allows you to develop your business logic separately from your integration logic. Develop complete applications and write unit tests without touching a single Hadoop API. It gives you the degrees of freedom to easily move through the application development life-cycle and separately deal with integrating existing systems.


Cascading provides a rich API that allows you to think in terms of data and business problems with capabilities such as sort, average, filter, merge, etc. The computation engine and process planner convert your business logic into efficient parallel jobs, delivering the optimal plan at run-time to your compute fabric of choice.

Systems Integration

Hadoop is never used alone and Cascading allows you to easily read and write from a variety of external systems to Hadoop and then write results to another system. The Cascading SDK comes with many pre-built and supported integrations, with many more provided by the community.


  • Who is Cascading for?
    Enterprise Development
    Cascading was designed to fit into any Enterprise development environment. With a clear separation between “data processing” and “data integration”, its clean Java API, and JUnit testing framework, Cascading can easily be tested and deployed at any scale.

    Data Science
    Because Cascading is Java-based, it naturally fits into JVM-based languages like Scala, Clojure, Jruby, Jython, and Groovy. Within many of these languages, the Cascading community has created many scripting and query languages that simplify ad hoc and production-ready analytics as well as machine learning applications.
  • What are typical use cases for Cascading?
    Typical uses cases for Cascading range from the complex (i.e. data processing/ETL applications) to the cutting-edge (i.e. geolocation and genomics).

    For a list, see Use Cases

    For specific examples from leading companies, check out our Case Studies
  • How is Cascading different from Pig and Hive?

    When it comes to application development, both Pig and Hive have shortcomings when dealing with complexity and testing, while Cascading applications are built to scale. With Pig, easy problems can be easily solved, but the harder problems become quite complicated. With Hive, the language is not compliant with ANSI SQL standards, which makes its applicability and interoperability with existing SQL systems challenging.  Furthermore, it is non-deterministic, which makes its behavior difficult to predict when compared to Cascading’s deterministic planner. Both Pig and Hive will generally require significantly more code when it comes to integration and incorporating business logic. Cascading separates business logic and integration logic and has system integration capabilities already built-in. Also, with Pig and Hive, it is notoriously difficult to build complex data workflows, and equally as hard to troubleshoot your data applications. Cascading allows developers to unit test and execute test-driven deployment best practices at scale.

    Cascading Pig Hive
    Ad hoc queries
    Complex workflows
    Create unit tests
    Application portability
    Pluggable data sources
    Extensible into other languages
    Support for non-Hadoop platforms
    No Installation required