Groovy DSL Scripting
It should be noted Cascading can be used from any JVM based scripting language. But this page documents the use of a particular Domain Specific Language (DSL) written in Groovy explicitly for creating Cascading applications. Unfortunately it turns out such applications are harder to debug, so we recommend scripting Cascading applications against the API directly, which turns out is quite simple, if not simpler.
Subsequently, this Groovy DSL has been moved to the Cascading Modules page and is now maintained separately from Cascading Core. To download and build, visit the Modules page.
With the Cascading Module Cascading.groovy DSL, users can create sophisticated data processing applications using the Groovy scripting language.
This DSL was designed for those groups that need to expose Hadoop to the 'casual' user who needs to get and manipulate valuable data on an organizations Hadoop cluster, but who possibly doesn't have the time to learn Java, the Hadoop API, or to think in MapReduce to solve their problems and access data.
No Groovy code is run in the cluster, it is only used as a means to build complex Cascading Flows and Cascades. Think of it as an Ant build file. Ant is configured by XML build files which represent internal graph of tasks the Ant tool needs to execute. We find XML tedious for such applications and figured Groovy to be a better alternative.
Here is what the canonical 'word count' example looks like in Cascading.groovy syntax:
Flow flow = builder.flow("wordcount")
{
// specify the input file, and its scheme, text
// the text scheme reads and processes each line at a time
source(input, scheme: text())
// break up each line into words
tokenize(/[.,]*\s+/)
// group each unique word
group()
// count the size of each unique word group
// stuff the results in a new field named 'count'
count()
// sort the results in reverse (descending) order
// by the 'count' value
group(["count"], reverse: true)
// write out the results, same scheme as the source
// here we delete the sink if it exists already
sink(output)
}
Typically the word count examples in the wild don't sort the results, but, since it was only one additional line, we added it.
The Cascading.groovy extension provides a Groovy builder for assembling Cascading pipe assemblies, tap maps, flows and cascades. It allows the script writer to be as formal or informal as the situation merits. The above example is very brief and informal since the problem at hand is trivial and simple.
But there are cases where the arrangement of sinks and sources is complex. The data processing includes many splits and joins. There needs to be many logical grouping of discrete processing routines. And there might be a difficult number of these routines to keep managed and available for reuse. In these cases, a more verbose and formal representation may be necessary.
One possible expansion of the informal form above can be seen here:
def assembly = builder.assembly(name: "wordcount")
{
eachTuple(args: ["line"], results: ["word"])
{
regexSplitGenerator(declared: ["word"], pattern: /[.,]*\s+/) // tokenize is an alias
}
group(["word"])
everyGroup(args: ["word"], results: ["word", "count"])
{
count()
}
group(["count"], reverse: true)
}
def map = builder.map()
{
source(name: "wordcount")
{
hfs(input)
{
text(["line"])
}
}
sink(name: "wordcount")
{
hfs(output)
{
text()
}
}
}
Flow flow = builder.flow(name: "wordcount", map: map, assembly: assembly);
Those more familiar to Cascading Java API will find this more natural, but should see the benefits of a more abbreviated form.
For a much more complex example, see our implementation of the wide finder 2 benchmark in the git repository. Note that this example uses stream assertions to validate the data in the stream.
Having 'two' formats may seem confusing, but the intent of this builder is to allow for simple things to be simple. As the writer adds more complexity to their application, the Cascading builder is in a position to absorb it.
Getting started is simple. It only requires HADOOP_HOME be set in the users environment, for Groovy to be installed, and for Cascading.groovy to be downloaded and installed via the ./cascading.groovy/setup.groovy script file.
Once installed, users can execute any of the samples, in the ./cascading.groovy/samples directory, via the groovy interpreter.
On startup, Hadoop will be automatically loaded into the Groovy interpreter CLASSPATH, determined by the current HADOOP_HOME and optionally HADOOP_CONF environment variables.
See the API doc for the cascading.groovy package for details on the supported syntax and operations.