Crunch : MapReduce Pipelines made easy

Hadoop and  MapReduce paradigm provides ease of writing parallel data processing. However, many application require a number of Map-Reduce jobs that join, clean, aggregate, and analyze large volume of data. Such a set of connected jobs form a pipeline. Programming/managing such pipelines can be tricky and can cause major impediments to developer productivity.

I will be showcasing Apache-Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy. Crunch facilitates MapReduce pipelines in Java, without having to use MapReduce constructs such as Map/Reduce functions or Writables. Also it has none of its own type system unlike Pig and Hive. For programmers this means less wrestling with MapReduce and Pig/Hive concepts, and more focus on solving our actual problems. Crunch offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license.

As an active contributor to the project I know concepts in Crunch, its type system and pipelined architecture quite well, which I will explain to the audience. I will also demonstrate how to create pipelines in Crunch using some basic operations like join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.


Rahul Sharma is a Senior Consultant for Xebia IT Architects. He has 7 years of experience in the Software Industry. He has worked on several projects using Java/J2ee as the primary technology. He has an inclination to open source technologies and likes to explore/delve into new frameworks .He has been an active contributor to Crunch.

Enhanced by Zemanta

Comments are closed.