<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Cascading</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/" />
    <link rel="self" type="application/atom+xml" href="http://www.cascading.org/atom.xml" />
    <id>tag:www.cascading.org,2008-04-05://2</id>
    <updated>2008-06-17T20:10:18Z</updated>
    
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Open Source 4.1</generator>

<entry>
    <title>Cascading.groovy 0.2.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/06/cascadinggroovy-020-released.html" />
    <id>tag:www.cascading.org,2008://2.410</id>

    <published>2008-06-17T20:10:18Z</published>
    <updated>2008-06-17T20:10:18Z</updated>

    <summary>We are pleased to announce that the 0.2.0 release of Cascading.groovy, our Groovy language interpreter extension, is available for download. This release makes some minor additions the the base DSL syntax and support for the new Cascading features stream assertions...</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[We are pleased to announce that the 0.2.0 release of Cascading.groovy, our Groovy language interpreter extension, is available for <a href="http://code.google.com/p/cascading/downloads/list">download</a>. This release makes some minor additions the the base DSL syntax and support for the new Cascading features stream assertions and traps, providing for highly fault tolerant scriptable data processing applications.]]>
        Also note a companion release of Cascading 0.6.1 is also available. It represents no significant changes.
    </content>
</entry>

<entry>
    <title>Cascading 0.6.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/06/cascading-060-released.html" />
    <id>tag:www.cascading.org,2008://2.408</id>

    <published>2008-06-12T17:50:21Z</published>
    <updated>2008-06-12T17:50:21Z</updated>

    <summary>Version 0.6.0 of Cascading is now available for download. For details on new features and bug fixes, see the CHANGES.txt file. For a quick summary, read on....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>Version 0.6.0 of Cascading is now available for <a href="http://code.google.com/p/cascading/downloads/list">download</a>. For details on new features and bug fixes, see the <a href="http://code.google.com/p/cascading/source/browse/tags/cascading-0.6.0/CHANGES.txt">CHANGES.txt</a> file. For a quick summary, read on.</p>]]>
        <![CDATA[<p>This release provides two major features. Stream Assertions and Trap Taps.</p>

<p>Stream Assertions are used in a similar fashion as the Java language <code>assert</code> function. </p>

<p>As the developer assembles more complex assemblies, it makes sense to inline assertions on the data expected in the stream. Assertions can test that a given source is clean, or verify that certain functions, filters, or aggregators are working as expected.</p>

<p>Assertions can be applied in two scopes, Strict or Validating. Strict assertions make sense as regression or unit style tests, and validating can be used as sanity checks during staging or production. </p>

<p>When a given assembly is planned into a Flow using the FlowConnector, unwanted assertions can be planned out completely so they offer no performance penalty. So re-usable assemblies can have loads of assertions internally, but they won't translate into any overhead if unwanted during runtime.</p>

<p>The next feature is Trap Taps. They are similar to sinks and sources, except instead of being bound to the head or tail of a given assembly, they are bound to pipes within an assembly. If an operation invoked by a given Pipe instance (Each or Every) fails, the incoming Tuple will be saved to the named trap Tap.</p>

<p>This allows systems to continue running with no data loss if bad data leaks into the stream causing an operation to fail. This is extremely useful for low fidelity processes like web crawling and indexing. If a page just can't be parsed, it can be saved for later and the job continues its work without it.</p>

<p>Please note this release has only been tested with Hadoop 0.16.x. </p>]]>
    </content>
</entry>

<entry>
    <title>Cascading.groovy 0.1.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/05/cascadinggroovy-010-released.html" />
    <id>tag:www.cascading.org,2008://2.402</id>

    <published>2008-05-05T22:34:48Z</published>
    <updated>2008-05-05T22:34:48Z</updated>

    <summary>We are pleased to announce the 0.1.0 release of Cascading.groovy, our Groovy language interpreter extension. With Cascading.groovy, Hadoop applications can be scripted by both advanced and casual Hadoop users without thinking in MapReduce. Read our Groovy Scripting Overview for more...</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>We are pleased to announce the 0.1.0 release of Cascading.groovy, our Groovy language interpreter extension. With Cascading.groovy, Hadoop applications can be scripted by both advanced and casual Hadoop users without thinking in MapReduce. Read our <a href="http://www.cascading.org/documentation/groovy.html">Groovy Scripting Overview</a> for more details.</p>]]>
        <![CDATA[<p>We consider this a usable Alpha release, it being our first. </p>

<p>The underlying core, Cascading, is very stable and feature rich. But the Groovy builder will still likely undergo various changes as we get more feedback from the community.</p>

<p>Our expertise is not in writing <a href="http://en.wikipedia.org/wiki/Domain_Specific_Language">DSL</a>'s, but nevertheless, we have one now. And would love feedback regarding the syntax and features.</p>]]>
    </content>
</entry>

<entry>
    <title>Cascading 0.5.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/05/cascading-050-released.html" />
    <id>tag:www.cascading.org,2008://2.400</id>

    <published>2008-05-05T19:55:26Z</published>
    <updated>2008-05-05T19:55:26Z</updated>

    <summary>Version 0.5.0 of Cascading is now available for download. For details on new features and bug fixes, see the CHANGES.txt file. For a quick summary, read on....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>Version 0.5.0 of Cascading is now available for <a href="http://code.google.com/p/cascading/downloads/list">download</a>. For details on new features and bug fixes, see the <a href="http://code.google.com/p/cascading/source/browse/tags/cascading-0.5.0/CHANGES.txt">CHANGES.txt</a> file. For a quick summary, read on.</p>]]>
        <![CDATA[<p>By far, the biggest change is support for sorting via the GroupBy operator. By default the 'groupFields' fields are sorted, but to sort fields that are not being grouped on, set the 'sortFields' argument. This will allow the values of every grouping key to be sorted before being handed to an Aggregator function. </p>

<p>Unfortunately all the fields must be sorted in the same way, ascending or descending. You cannot have fields sort in differing orders. This is a limitation on Hadoop and we will submit a patch when possible.</p>

<p>Two new operations have been added, RegexSplitGenerator and ExpressionFilter. The first allows a regex pattern split a Tuple value into multiple Tuple instances, not new fields in a single Tuple instance like RegexSplit. The second allows for simple java expressions to be used as a filter, in the same manner as ExpressionFunction.</p>

<p>Flows can now be skipped if the sink resource exists (stale or not stale). This is useful if the source 'modified' date is unreliable and it might be too costly to re-retrieve and re-process the source data to re-create the result data in the sink. This is especially true when fetching large data from a remote site over HTTP.</p>

<p>Lastly there were a few improvements to HttpFileSystem. Specifically a fix to where any query string parameters were being munged. Now a source to a Flow can be a dynamically generated page with query string. The only limitation is that the CONTENT-LENGTH header must be returned with a valid value. Sadly this isn't true for many web services.</p>

<p>This release also cleans up a few corner cases in the API that were unearthed during the development of our Groovy based scripting extension for Cascading and Hadoop. Our first release should be Real Soon Now.</p>]]>
    </content>
</entry>

<entry>
    <title>Cascading 0.4.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/04/cascading-040.html" />
    <id>tag:mt.cascading.org,2008://2.389</id>

    <published>2008-04-02T19:42:15Z</published>
    <updated>2008-04-05T20:15:44Z</updated>

    <summary>Version 0.4.0 of Cascading is now available for download. See below for a review of the major changes. For more details, see the changes.txt file....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>Version 0.4.0 of <a href="http://www.cascading.org/">Cascading</a> is now available for <a href="http://code.google.com/p/cascading/downloads/list">download</a>. See below for a review of the major changes. For more details, see the <a href="http://cascading.googlecode.com/svn/tags/cascading-0.4.0/CHANGES.TXT">changes.txt</a> file.</p>]]>
        <![CDATA[<p>Foremost, most changes can be stuffed under the heading of performance improvements. I can't offer scientifically valid metrics, but let's say my projects are running noticeably faster. These few enhancements constitute the changes.</p>

<p>One, we now skip the reducer if there is no <a href="http://www.cascading.org/javadoc/cascading/pipe/Group.html">Group</a> in the assembly. This has been on the list for a couple releases, but it is there now and works great.</p>

<p>Two, when writing out 'key' and 'value' tuples to the reducer from the mapper, the 'key' values are removed from the 'value' tuple as they are redundant. This reduces the bandwidth on the 'copy' phase, and improves the 'sort'. At the reducer, the 'key' and 'value' tuples are merged back. This was also on the list for a bit, but we needed to do a bit of refactoring internally before it could be done reliably. </p>

<p>Three, the tuple stream pipeline has been optimized. Every 'collect' immediately passes through to the next operator. That is, every tuple in the stream is committed to the output stream before the next is handled. This has mostly been the case for some time, but we finally refactored out the final inefficiencies that were in place to support some edge cases.</p>

<p>Four, <a href="http://www.cascading.org/javadoc/cascading/cascade/Cascade.html">Cascades</a> now execute <a href="http://www.cascading.org/javadoc/cascading/flow/Flow.html">Flows</a> in parallel if there are no dependencies. Flows have run Hadoop jobs in parallel for some time, but now this behavior is shared at the Cascade abstraction.</p>

<p>Also there are some new features.</p>

<p>The <a href="http://www.cascading.org/javadoc/cascading/tap/hadoop/S3HttpFileSystem.html">S3HttpFileSystem</a> is now read-write. This filesystem is used to access 'normal' files on S3, unlike the Hadoop S3FileSystem.</p>

<p>Flows now support <a href="http://www.cascading.org/javadoc/cascading/flow/FlowListener.html">FlowListeners</a> that can be notified of various events during a Flow life-cycle. We use it to post messages to SQS when a flow completes.</p>

<p><a href="http://www.cascading.org/javadoc/cascading/tap/Tap.html">Taps</a> can now specify that they should be written to directly, bypassing the the 'default' Hadoop collector (in the map or reduce phase). This is useful if you need to write data to a special file type or location and don't want to write your own Hadoop FileSystem class. This also is a workaround for a <a href="https://issues.apache.org/jira/browse/HADOOP-3021">bug in Hadoop preventing custom FileSystems from being used</a> if loaded from user-space libraries. A side-effect is that user code can write out tuples directly via a Tap instance (great for tests or scripts).</p>

<p>Any <a href="http://www.cascading.org/javadoc/cascading/tap/Hfs.html">Hfs</a> <a href="http://www.cascading.org/javadoc/cascading/tap/Tap.html">Tap</a> instance referencing a file path starting with file:// (<a href="http://www.cascading.org/javadoc/cascading/tap/Lfs.html">Lfs</a> does this by default), will force the current Hadoop job to run in 'local' mode. If this job ran on a cluster, the local file would not be visible to the remove job task, so it must run locally. Note only the one Hadoop job runs locally, not the whole Flow or Cascade. This allows developers to write applications that can load HDFS from local files and then spawn clustered jobs. And these local loading Flows can incorporate filters and operations that clean and format the data on load. If the file passed to the script isn't a file:// but http:// or s3tp://, the job will be run in clustered mode. </p>

<p>Finally there are a few incompatible changes. The major change was that the Cascade class was moved to the cascading.cascade package. The other API changes are less likely to show up in user code.</p>

<p>We are very happy with this release, and we trust you will be too. </p>]]>
    </content>
</entry>

<entry>
    <title>Cascading 0.3.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/03/cascading-030.html" />
    <id>tag:mt.cascading.org,2008://2.388</id>

    <published>2008-03-05T00:39:58Z</published>
    <updated>2008-04-05T20:15:16Z</updated>

    <summary>Cascading 0.3.0 has just been packaged and is available for download from our downloads page. It incorporates many great changes, read on for more....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p><a href="http://www.cascading.org/">Cascading</a> 0.3.0 has just been packaged and is available for download from our <a href="http://code.google.com/p/cascading/downloads/list">downloads page</a>. It incorporates many great changes, read on for more.</p>]]>
        <![CDATA[<p>The biggest additions are read-only support for <a href="http://www.cascading.org/javadoc/cascading/tap/hadoop/HttpFileSystem.html">HTTP</a> and <a href="http://www.cascading.org/javadoc/cascading/tap/hadoop/S3HttpFileSystem.html">S3</a>. This support was pushed down into Hadoop, so any Hfs Tap instance can include remote resources with http(s):// or s3tp:// urls. The s3tp url is similiar to the s3:// url, where the authority part of the url includes an AWS account key and secret.</p>

<p>There is also now experimental support for zip files. They aren't recommended, but if you must read zip files from  a remote source, and they are line oriented, a <a href="http://www.cascading.org/javadoc/cascading/tap/Tap.html">Tap</a> using a <a href="http://www.cascading.org/javadoc/cascading/scheme/TextLine.html">TextLine</a> scheme will automatically try to unzip the stream. If there is more than one file in the zip, it will iterate over each entry serially. We may likely push this down into a Hadoop codec in a future release.</p>

<p>There are a number of new handy operations like <a href="http://www.cascading.org/javadoc/cascading/operation/text/FieldFormatter.html">FieldFormatter</a>, <a href="http://www.cascading.org/javadoc/cascading/operation/aggregator/Last.html">Last</a>, <a href="http://www.cascading.org/javadoc/cascading/operation/Insert.html">Insert</a>, <a href="http://www.cascading.org/javadoc/cascading/operation/Debug.html">Debug</a>, and <a href="http://www.cascading.org/javadoc/cascading/operation/text/DateFormatter.html">DateFormatter</a>.</p>

<p>Finally there are a few API changes, but most users won't notice them.</p>

<p>See the CHANGES.TXT file for a comprehensive list of bugs fixed and features added.</p>

<p>Btw, we are successfully using <a href="http://www.cascading.org/">Cascading</a>, and by virtue Hadoop, on largish clusters (10-50 nodes) in Amazon EC2. We currently have <a href="http://www.cascading.org/javadoc/cascading/Cascade.html">Cascades</a> that execute a dozen or so <a href="http://www.cascading.org/javadoc/cascading/flow/Flow.html">Flows</a>, and subsequently, double digit numbers of unique MapReduce jobs. And it only took a few days in actual coding time to assemble everything with unit tests, without a single hand made Mapper or Reducer.</p>]]>
    </content>
</entry>

<entry>
    <title>Cascading 0.2.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/02/cascading-020.html" />
    <id>tag:mt.cascading.org,2008://2.387</id>

    <published>2008-02-06T17:33:39Z</published>
    <updated>2008-04-05T20:14:31Z</updated>

    <summary>Just uploaded the 0.2.0 release of Cascading. You can download it from here....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>Just uploaded the 0.2.0 release of <a href="http://www.cascading.org/">Cascading</a>. You can download it from <a href="http://code.google.com/p/cascading/downloads/list">here</a>.</p>]]>
        <![CDATA[<p>The most significant change is a "spillable" list added to CoGroup that allows it to operate on any size co-groupings. </p>

<p>Note there are no limitations with normal GroupBy calls as they stream directly through the stack. CoGrouping must accumulate all the groups before emitting them through some join policy (inner, outer, etc).</p>

<p>Also wanted to point out we have had 91 downloads since Jan 21. This is great news for such a young project.</p>

<p>Enjoy!</p>]]>
    </content>
</entry>

<entry>
    <title>Cascading 0.1.0 Released</title>
    <link rel="alternate" type="text/html" href="http://www.cascading.org/2008/01/cascading-010.html" />
    <id>tag:mt.cascading.org,2008://2.386</id>

    <published>2008-01-22T01:42:16Z</published>
    <updated>2008-04-05T20:13:46Z</updated>

    <summary>A little note to let everyone know Cascading is now available for download and includes the full source. Please visit our project site for more information....</summary>
    <author>
        <name></name>
        <uri>http://chris.wensel.net/</uri>
    </author>
    
        <category term="News" scheme="http://www.sixapart.com/ns/types#category" />
    
    
    <content type="html" xml:lang="en" xml:base="http://www.cascading.org/">
        <![CDATA[<p>A little note to let everyone know <a href="http://www.cascading.org/">Cascading</a> is now available for download and includes the full source. Please visit our project site for more information.</p>]]>
        <![CDATA[<p>By no means is this release feature complete or to be considered final. There is still much work to do, but we believe it to be stable and useful, if not under documented.</p>]]>
    </content>
</entry>

</feed>
