Hadoop and MapReduce paradigm provides ease of writing parallel data processing. However, many application require a number of Map-Reduce jobs that join, clean, aggregate, and analyze large volume of data. Such a set of connected jobs form a pipeline. Programming/managing such pipelines can be tricky and can cause major impediments to developer productivity.
I will be showcasing Apache-Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy. Crunch facilitates MapReduce pipelines in Java, without having to use MapReduce constructs such as Map/Reduce functions or Writables. Also it has none of its own type system unlike Pig and Hive. For programmers this means less wrestling with MapReduce and Pig/Hive concepts, and more focus on solving our actual problems. Crunch offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license.
As an active contributor to the project I know concepts in Crunch, its type system and pipelined architecture quite well, which I will explain to the audience. I will also demonstrate how to create pipelines in Crunch using some basic operations like join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.