doug cutting | Barton's Blog

Hadoop World, a belated summary

February 13, 2012

With O’Reilly’s big data conference Strata coming up in just a couple of weeks, I thought I might as well get around to finally writing up my notes from Hadoop World . The event, which was put on by Cloudera, was held last November 8-9 in New York city. There were over 1,400 attendees from 580 companies and 27 countries with two thirds of the audience being technical.

Growing beyond geek fest

The event itself has picked up significant momentum over the last three years going from 500 attendees, to 900 the second year, to over 1400 this past year. The tone has gone from geek-fest to an event focused also on business problems e.g. one of the keynotes was by Larry Feinsmith, managing director of the office of the CIO at JP Morgan Chase. Besides Dell, other large companies like HP, Oracle and Cisco also participated.

As a platinum sponsor, Dell had both a booth and a technical presentation. At the event we announced that we would be open sourcing the Crowbar barclamp for Hadoop and at out booth we showed off the Dell | Hadoop Big Data Solution which is based on Cloudera Enterprise.

Cutting’s observations

Doug Cutting, the father of Hadoop, Cloudera employee and chairman of the Apache software foundation, gave a much anticipated keynote. Here are some of the key things I caught:

Still young: While Cutting felt that Hadoop had made tremendous progress he saw it as still young with lots of missing parts and niches to be filled.
Big Top: He talked about the Apache “Bigtop” project which is an open source program to pull together the various pieces of the Hadoop ecosystem. He explained that Bigtop is intended to serve as the basis for the Cloudera Distribution of Hadoop (CDH), much the same way Fedora is the basis for RHEL (Redhat Enterprise Linux).
“Hadoop” as “Linux“: Cutting also talked about how Hadoop has become the kernel of the distributed OS for big data. He explained that, much the same way that “Linux” is technically only the kernel of the GNU Linux operating system, people are using the word Hadoop to mean the entire Hadoop ecosystem including utilities.

Interviews from the event

To get more of the flavor of the event here is a series of interviews I conducted at the show, plus one where I got the camera turned on me:

Hadoop World: What Dell is up to with Big Data, Open Source and Developers

Extra-credit reading

Cloudera Teams With O’Reilly Media to Merge Hadoop World and Strata Conferences

Blogs regarding Dell’s crowbar announcement

Hadoop Glossary

Hadoop ecosystem
- Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
- MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Hadoop acts as a platform for executing MapReduce. MapReduce came out of Google
- HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.

Major Hadoop utilities:
- HBase: The Hadoop database that supports structured data storage for large tables. It provides real time read/write access to your big data.
- Hive: A data warehousing solution built on top of Hadoop. An Apache project
- Pig: A platform for analyzing large data that leverages parallel computation. An Apache project
- ZooKeeper: Allows Hadoop administrators to track and coordinate distributed applications. An Apache project
- Oozie: a workflow engine for Hadoop
- Flume: a service designed to collect data and put it into your Hadoop environment
- Whirr: a set of libraries for running cloud services. It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
- Sqoop: a tool designed to transfer data between Hadoop and relational databases. An Apache project
- Hue: a browser-based desktop interface for interacting with Hadoop