Hadoop World, a belated summary

February 13, 2012

With O’Reilly’s big data conference Strata coming up in just a couple of weeks, I thought I might as well get around to finally writing up my notes from Hadoop World .  The event, which was put on by Cloudera, was held last November 8-9 in New York city.   There were over 1,400 attendees from 580 companies and 27 countries with two thirds of the audience being technical.

Growing beyond geek fest

The event itself has picked up significant momentum over the last three years going from 500 attendees, to 900 the second year, to over 1400 this past year.  The tone has gone from geek-fest to an event focused also on business problems e.g. one of the keynotes was by Larry Feinsmith, managing director of the office of the CIO at JP Morgan Chase.  Besides Dell, other large companies like HP, Oracle and Cisco also participated.

As a platinum sponsor, Dell  had both a booth and a technical presentation.   At the event we announced that we would be open sourcing the Crowbar barclamp for Hadoop and at out booth we showed off the Dell | Hadoop Big Data Solution which is based on Cloudera Enterprise.

Cutting’s observations

Doug Cutting, the father of  Hadoop, Cloudera employee and chairman of the Apache software foundation, gave a much anticipated keynote.  Here are some of the key things I caught:

  • Still young: While Cutting felt that Hadoop had made tremendous progress he saw it as still young with lots of missing parts and niches to be filled.
  • Big Top: He talked about the Apache “Bigtop” project which is an open source program to pull together the various pieces of the Hadoop ecosystem.  He explained that Bigtop is intended to serve as the basis for the Cloudera Distribution of Hadoop (CDH), much the same way Fedora is the basis  for RHEL (Redhat Enterprise Linux).
  • “Hadoop” as “Linux“: Cutting also talked about how Hadoop has become the kernel of the distributed OS for big data.  He explained that, much the same way that “Linux” is technically only the kernel of the GNU Linux operating system, people are using the word Hadoop to mean the entire Hadoop ecosystem including utilities.

Interviews from the event

To get more of the flavor of the event here is a series of interviews I conducted at the show, plus one where I got the camera turned on me:

Extra-credit reading

Blogs regarding Dell’s crowbar announcement

Hadoop Glossary

  • Hadoop ecosystem
    • Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.  It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
    • MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.  Hadoop acts as a platform for executing MapReduce.  MapReduce came out of Google
    • HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.
  • Major Hadoop utilities:
    • HBase: The Hadoop database that supports structured data storage for large tables.   It provides real time read/write access to your big data.
    • Hive:  A data warehousing solution built on top of Hadoop.  An Apache project
    • Pig: A platform for analyzing large data that leverages parallel computation.  An Apache project
    • ZooKeeper:  Allows Hadoop administrators to track and coordinate distributed applications.  An Apache project
    • Oozie: a workflow engine for Hadoop
    • Flume: a service designed to collect data and put it into your  Hadoop environment
    • Whirr: a set of libraries for running cloud services.  It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
    • Sqoop: a tool designed to transfer data between Hadoop and relational databases.  An Apache project
    • Hue: a browser-based desktop interface for interacting with Hadoop

How to create a Basic or Advanced Crowbar build for Hadoop

November 29, 2011

As I mentioned in my previous entry, the code for the Hadoop barclamps is now available at our github repo.

To help you through the process, Crowbar lead architect Rob Hirschfeld has put together the two videos below.  The first, Crowbar Build (on cloud server), shows you how to use a cloud server to create a Crowbar ISO using the standard build process.  The second,  Advanced Crowbar Build (local) shows how to build a Crowbar v1.2 ISO using advanced techniques on a local desktop using a virtual machine.

Crowbar Build (on cloud server)

Advanced Crowbar Build (local)

Pau for now…


Open source Crowbar code now available for Hadoop

November 29, 2011

Earlier this month we announced that Dell would be open sourcing the Crowbar “barclamps” for Hadoop.  Well today is the day and the code is now available at our github repo.

Whats a Crowbar barclamp?

If you haven’t heard of project Crowbar it’s a software framework developed at Dell that started out as an installation tool for OpenStack.  As the project grew beyond installation to include monitoring capabilities, network discovery, performance data gathering etc., the developers behind it, Rob Hirschfeld and Greg Althaus, decided to rewrite it to allow modules to plug into the basic Crowbar functionality.  These modules or “barclamps” allow the framework to be used by a variety of projects.  Besides the OpenStack and Hadoop barclamps written by Dell, VMware created a Cloud Foundry barclamp and DreamHost created a Ceph barclamp.

To help you get your bearings

As I mentioned in the opening  paragraph, the code for the Hadoop barclamp is now available.  To help you get started, below are a couple of videos that Rob put together.  The first walks you through how to install Crowbar and the second one explains how to use Crowbar to deploy Hadoop.

Extra-credit reading

Pau for  now…


Dell to opensource software to ease Hadoop install & management

November 8, 2011

It wouldn’t be surprising if you were surprised to learn that Dell is developing software.  To say that this is an area we haven’t been known for in the past would be an understatement.  While we may not pose a direct threat to Microsoft any time soon, we have been coding in a few focused areas.  One of those areas is cloud installation and management and is represented by our project Crowbar.  While Crowbar began life simply as a way to install Openstack on Dell hardware, it has expanded from there.

Today’s news is that we have developed and will be opensourcing “barclamps” (modules that sit on top of crowbar) for: Cloudera CDH/Enterprise, Zookeeper, Pig, Hbase, Flume and Sqoop.  All these modules will speed and ease the deployment, configuration and operation of Hadoop clusters.  But don’t take my word for it.  Take a listen to Crowbar’s architect Rob Hirschfeld as he explains Crowbar and today’s announcement:

Look for the code on Crowbar GitHub repo by the last week of November.  If you want to get involved, learn how.

Extra-credit reading:

Pau for now…


%d bloggers like this: