Hadoop World, a belated summary

February 13, 2012

With O’Reilly’s big data conference Strata coming up in just a couple of weeks, I thought I might as well get around to finally writing up my notes from Hadoop World .  The event, which was put on by Cloudera, was held last November 8-9 in New York city.   There were over 1,400 attendees from 580 companies and 27 countries with two thirds of the audience being technical.

Growing beyond geek fest

The event itself has picked up significant momentum over the last three years going from 500 attendees, to 900 the second year, to over 1400 this past year.  The tone has gone from geek-fest to an event focused also on business problems e.g. one of the keynotes was by Larry Feinsmith, managing director of the office of the CIO at JP Morgan Chase.  Besides Dell, other large companies like HP, Oracle and Cisco also participated.

As a platinum sponsor, Dell  had both a booth and a technical presentation.   At the event we announced that we would be open sourcing the Crowbar barclamp for Hadoop and at out booth we showed off the Dell | Hadoop Big Data Solution which is based on Cloudera Enterprise.

Cutting’s observations

Doug Cutting, the father of  Hadoop, Cloudera employee and chairman of the Apache software foundation, gave a much anticipated keynote.  Here are some of the key things I caught:

  • Still young: While Cutting felt that Hadoop had made tremendous progress he saw it as still young with lots of missing parts and niches to be filled.
  • Big Top: He talked about the Apache “Bigtop” project which is an open source program to pull together the various pieces of the Hadoop ecosystem.  He explained that Bigtop is intended to serve as the basis for the Cloudera Distribution of Hadoop (CDH), much the same way Fedora is the basis  for RHEL (Redhat Enterprise Linux).
  • “Hadoop” as “Linux“: Cutting also talked about how Hadoop has become the kernel of the distributed OS for big data.  He explained that, much the same way that “Linux” is technically only the kernel of the GNU Linux operating system, people are using the word Hadoop to mean the entire Hadoop ecosystem including utilities.

Interviews from the event

To get more of the flavor of the event here is a series of interviews I conducted at the show, plus one where I got the camera turned on me:

Extra-credit reading

Blogs regarding Dell’s crowbar announcement

Hadoop Glossary

  • Hadoop ecosystem
    • Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.  It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
    • MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.  Hadoop acts as a platform for executing MapReduce.  MapReduce came out of Google
    • HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.
  • Major Hadoop utilities:
    • HBase: The Hadoop database that supports structured data storage for large tables.   It provides real time read/write access to your big data.
    • Hive:  A data warehousing solution built on top of Hadoop.  An Apache project
    • Pig: A platform for analyzing large data that leverages parallel computation.  An Apache project
    • ZooKeeper:  Allows Hadoop administrators to track and coordinate distributed applications.  An Apache project
    • Oozie: a workflow engine for Hadoop
    • Flume: a service designed to collect data and put it into your  Hadoop environment
    • Whirr: a set of libraries for running cloud services.  It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
    • Sqoop: a tool designed to transfer data between Hadoop and relational databases.  An Apache project
    • Hue: a browser-based desktop interface for interacting with Hadoop

Hadoop World: What Dell is up to with Big Data, Open Source and Developers

December 18, 2011

Besides interviewing a bunch of people at Hadoop World, I also got a chance to sit on the other side of the camera.  On the first day of the conference I got a slot on SiliconANGLE’s the Cube and was interviewed by Dave Vellante, co-founder of Wikibon and John Furrier, founder of SiliconANGLE.

-> Check out the video here.

Some of the ground we cover

  • How Dell got into the cloud/scale-out arena and how that lead us to Big Data
  • (2:08) The details behind the Dell|Cloudera solution for Apache Hadoop and our “secret sauce,” project crowbar.
  • (4:00) Dell’s involvement in and affinity for open source software
  • (5:31) Dell’s interest in and strategy around courting developers
  • (7:35) Dell’s strategy of Make, Partner or Buy in the cloud space
  • (11:10) How real is OpenStack and how is it evolving.

Extra-credit reading

Pau for now…

Hadoop World: Talking to Splunk’s Co-founder

December 4, 2011

Last but not least in the 10 interviews I conducted while at Hadoop World is my talk with Splunk‘s CTO and co-founder Erik Swan.  If you’re not familiar with Splunk think of it as a search engine for machine data, allowing you to monitor and analyze what goes on in your systems.  To learn more, listen to what Erik has to say:

Some of the ground Erik covers:

  • What is Splunk and what do they do?
  • (1:43)  The announcement they made at Hadoop world about integrating with Hadoop and what that means.
  • (4:25) How Erik and Rob Das got the the idea to get involved in the wacky world of machine data and to create Splunk.

Extra-credit reading

Pau for now…

Hadoop World: NoSQL database MongoDB

November 28, 2011

I’m getting near the end of the interviews that I did while at Hadoop World earlier this month, just one more after this (with Splunk’s CTO and co-founder).

Today’s entry features a talk I had with Nosh Petigara, director of product strategy at 10gen, the company behind MongoDB.

Some of the ground that Nosh covers

  • Who is 10gen and what is MongoDB
  • (0:29) How does Nosh define NoSQL
  • (1:20) What use cases is Mongo best at
  • (2:14) Some examples of customers using Mongo (foursquare, Disney and MTV) and what they’re using it for
  • (3:08) How Mongo and Hadoop work together
  • (4:03) Whats in Mongo’s future that Nosh is excited about

Extra-credit reading

  • Mongo Conference: MongoSV (Dec 9 in Silicon valley)

Pau for now…

Hadoop World: Ubuntu, Hadoop and Juju

November 14, 2011

I’m always interested in what’s happening at Canonical and with Ubuntu.  Last week at Hadoop World I ran into a couple of folks from the company (coincidentally both named Mark but neither Mr. Shuttleworth).  Mark Mims from the server team was willing to chat so I grabbed some time with him to learn about what he was doing at Hadoop World and what in the heck is this “charming” Juju?

Some of the ground Mark covers

  • Making the next version of Ubuntu server better for Hadoop and big data
  • (0:34) What are “charms” and what do they have to do with service orchestration
  • (2:05) Charm school and learning to write Juju charms
  • (2:54)  Where does “Orchestra” fit in and how can it be used to spin up OpenStack
  • (3:40) What’s next for Juju

But wait, there’s more!

Stay tuned for more interviews from last week’s Hadoop world.  On tap are:

  • Todd Papaioannou from Battery Ventures
  • John Gray of Facebook
  • Erik Swan of Splunk
  • Nosh Petigara of 10gen/MongoDB.

Extra-credit reading

Pau for now..

Hadoop World: Karmasphere and big data intelligence

November 14, 2011

One thing Hadoop isn’t great at right out of the box is data analytics, that’s where a company like Karmasphere comes in.  Karmasphere provides business intelligence software that data analysts can use to use to mine the data that Hadoop sucks up.

Last week at Hadoop World I grabbed some time with Karamsphere’s Chairman and co-founder, Martin Hall to learn more about where he and his company play in the wild world of big data.

Some of the ground Martin covers

  • Where does Karmasphere play in the big data stack, how is it used and by whom
  • (0:38) Where did the idea for developing Karmasphere come from
  • (1:58) What is the Karmasphere “secret sauce”
  • (2:18) What are the main industries and use cases where their offerings are used
  • (3:40) What can we look forward to in future releases

But wait, there’s more!

Stay tuned for more interviews from last week’s Hadoop world.  On tap are: Mark Mims of Canonical, Todd Papaioannou from Battery Ventures, John Gray of Facebook, Erik Swan of Splunk and Nosh Petigara of 10gen/MongoDB.

Extra-credit reading

Pau for now..

%d bloggers like this: