Next week, myself, Michael Cote and a whole other bunch of Dell folk will be heading out to Portland for the 15th annual OSCON-ana-polooza. We will have two talks that you might want to check out:
And speaking of Project Sputnik, we will be giving away three of our XPS 13 developer editions: one as a door prize at the OpenStack birthday party, one as a drawing at our booth and one to be given away at James and Joseph’s talk listed above.
We will also have a limited amount of the shirt to the right so stop by the booth.
But wait, there’s more….
To learn firsthand about Dell’s open source solutions be sure to swing by booth #719 where we will have experts on hand to talk to you about our wide array of solutions:
OpenStack cloud solutions
Hadoop big data solutions
Crowbar
Project Sputnik (the client to cloud developer platform)
Dell Multi-Cloud Manager (the platform formerly known as “Enstratius”)
Back in September I posted an entry about the Modular Data Center that we set up in the Dell parking lot. Here is a time lapse video showing the MDC and the location being built out.
The MDC allows customers to test solutions at scale. It is running OpenStack and various Big Data goodies such as Hadoop, Hbase, Cassandra, MongoDB, Gluster etc…
Customers can tap into the MDC from Dell’s solution centers around the world and do proof of concepts as well competitive bake-offs between various big data technologies so they can determine which might best suit their environment and use case.
Why use valuable internal real estate when you can just set up a Modular Data Center (MDC) in your parking lot? The point wasn’t lost on the Dell Solution Center team who, with help from our partners Intel, is doing just that here in Round Rock.
The new MDC, which should be online in a few weeks, will host Dell’s OpenStack-Powered Cloud and Apache Hadoop solutions for customers to test drive and build POCs in Dell Solution Centers around the world.
Here’s the MDC being lowered into place yesterday.
Here are some pics I snapped this morning when I went down to get my coffee. (double click on them to see them full sized)
At our sales kickoff in Vegas, Rob Hirschfeld chose a unique vehicle to succinctly convey our Big Data story here at Dell. Check out the video below to hear one of our chief software architects for our Big Data and OpenStack solutions explain, in less than 90 seconds, what we are up to in the space and the value it brings customers.
Here is part two of three of the Web glossary I complied. As I mentioned in my last two entries, in compiling this I pulled information from various and sundry sources across the Web including wikipedia, community and company web sites and the brain of Cote.
Enjoy
General terms
Structured data: Data that can be organized in a structure e.g. rows or columns so that it is identifiable. The most universal form of structured data is a database like SQL or Access.
Unstructured data: Data that has no identifiable structure. Unstructured data typically includes bitmap images/objects, text and other data types that are not part of a database. Most enterprise data today can actually be considered unstructured. An email is considered unstructured data.
Big Data: Data characterized by one or more of the following characteristics: Volume – A large amount of data, growing at large rates; Velocity – The speed at which the data must be processed and a decision made; Variety – The range of data, types and structure to the data
Relational Databases (RDBMS) Management Systems: These databases are the incumbents in enterprises today and store data in rows and columns. They are created using a special computer language, structured query language (SQL), that is the standard for database interoperability. Examples: IBM DB2, MySQL, Microsoft SQL Server, PostgreSQL, Oracle RDBMS, Informix, Oracle Rdb, etc.
NoSQL: refers to a class of databases that 1) are intended to perform at internet (Facebook, Twitter, LinkedIn) scale and 2) reject the relational model in favor of other (key-value, document, graph) models. They often achieve performance by having far fewer features than SQL databases and focus on a subset of use cases. Examples: Cassandra, Hadoop, MongoDB, Riak
Recommendation engine: A recommendation engine takes a collection of frequent itemsets as input and generates a recommendation set for a user by matching the current user’s activity against the discovered patterns. The recommendation engine is on-line process, therefore its efficiency and scalability are key, e.g. people who bought X often also bought Y.
Geo-spatial targeting: the practice of mapping advertising, offers and information based on geo location.
Behavioral targeting: a technique used by online publishers and advertisers to increase the effectiveness of their campaigns. Behavioral targeting uses information collected on an individual’s web-browsing behavior, such as the pages they have visited or the searches they have made, to select which advertisements to display to that individual.
Clickstream analysis: On a Web site, clickstream analysis is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order – which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). There are two levels of clickstream analysis, traffic analysis and e-commerce analysis.
Projects/Entities
Gluster: a software company acquired by Red Hat that provides an open source platform for scale-out Public and Private Cloud Storage.
Relational Databases
MySQL: the most popular open source RDBMS. It represents the “M” in the LAMP stack. It is now owned by Oracle.
Drizzle: A version of MySQL that is specifically targeted the cloud. It is currently an open source project without a commercial entity behind it.
Percona: A MySQL support and consulting company that also supports Drizzle.
PostgreSQL: aka Postgres is is an object-relational database management system (ORDBMS) available for many platforms including Linux, FreeBSD, Solaris, Windows and Mac OS X.
Oracle DB – not used so much in new WebTech companies, but still a major database in the development world.
SQL Server – Microsoft’ s RDBMS
NoSQL Databases
MongoDB: an open source, high-performance, database written in C++. Many Linux distros include a MongoDB package, including CentOS, Fedora, Debian, Ubuntu and Gentoo. Prominent users include Disney interactive media group, New York Times, foursquare, bit.ly, Etsy. 10gen is the commercial backer of MongoDB.
Riak: a NoSQL database/datastore written in Erlang from the company Basho. Originally used for the Content Delivery Network Akamai.
Couchbase: formed from the merger of CouchOne and Membase. It offers Couchbase server powered by Apache CouchDB and is available in both Enterprise and Community editions. The author of CouchDB was a prominent Lotus Notes architect.
Cassandra: A scalable NoSQL database with no single points of failure. A high-scale, key/value database originating from Facebook to handle their message inboxes. Backed by DataStax, which came out of Rackspace.
Mahout: A Scalable machine learning and data mining library. An analytics engine for doing machine learning (e.g., recommendation engines and scenarios where you want to infer relationships).
Hadoop ecosystem
Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Hadoop acts as a platform for executing MapReduce. MapReduce came out of Google
HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.
Major Hadoop utilities:
HBase: The Hadoop database that supports structured data storage for large tables. It provides real time read/write access to your big data.
Hive: A data warehousing solution built on top of Hadoop. An Apache project
Pig: A platform for analyzing large data that leverages parallel computation. An Apache project
ZooKeeper: Allows Hadoop administrators to track and coordinate distributed applications. An Apache project
Oozie: a workflow engine for Hadoop
Flume: a service designed to collect data and put it into your Hadoop environment
Whirr: a set of libraries for running cloud services. It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
Sqoop: a tool designed to transfer data between Hadoop and relational databases. An Apache project
Hue: a browser-based desktop interface for interacting with Hadoop
Cloudera: a company that provides a Hadoop distribution similar to the way Red Hat provides a Linux distribution. Dell is using Cloudera’s distribution of Hadoop for its Hadoop solution.
Solr: an open source enterprise search platform from the Apache Lucene project. Backed by the commercial company Lucid Imagination.
Elastic Search: an open source, distributed, search engine built on top of Lucene (raw search middleware).
Besides interviewing a bunch of people at Hadoop World, I also got a chance to sit on the other side of the camera. On the first day of the conference I got a slot on SiliconANGLE’s the Cube and was interviewed by Dave Vellante, co-founder of Wikibon and John Furrier, founder of SiliconANGLE.
As I mentioned in my previous entry, the code for the Hadoop barclamps is now available at our github repo.
To help you through the process, Crowbar lead architect Rob Hirschfeld has put together the two videos below. The first, Crowbar Build (on cloud server), shows you how to use a cloud server to create a Crowbar ISO using the standard build process. The second, Advanced Crowbar Build (local) shows how to build a Crowbar v1.2 ISO using advanced techniques on a local desktop using a virtual machine.
Earlier this month we announced that Dell would be open sourcing the Crowbar “barclamps” for Hadoop. Well today is the day and the code is now available at our github repo.
Whats a Crowbar barclamp?
If you haven’t heard of project Crowbar it’s a software framework developed at Dell that started out as an installation tool for OpenStack. As the project grew beyond installation to include monitoring capabilities, network discovery, performance data gathering etc., the developers behind it, Rob Hirschfeld and Greg Althaus, decided to rewrite it to allow modules to plug into the basic Crowbar functionality. These modules or “barclamps” allow the framework to be used by a variety of projects. Besides the OpenStack and Hadoop barclamps written by Dell, VMware created a Cloud Foundry barclamp and DreamHost created a Ceph barclamp.
To help you get your bearings
As I mentioned in the opening paragraph, the code for the Hadoop barclamp is now available. To help you get started, below are a couple of videos that Rob put together. The first walks you through how to install Crowbar and the second one explains how to use Crowbar to deploy Hadoop.
The next in my series of video interviews from Hadoop World is with Mark Azad who covers technical solutions for Couchbase. If you’re not familiar with Couchbase it’s a NoSQL database provider and the company was formed when, earlier this year, CouchOne and Membase merged.
Here’s what Mark had to say.
Some of the ground Mark covers
What is Couchbase and what is NoSQL
How Couchbase works with Hadoop
What its product line up looks like and his new combined offering coming next year
Some of Couchbase’s customers and how Zynga uses them
What excites Mark the most up the upcoming year in Big Data
Yesterday, Hadoop World 2011 wrapped here in New York. During the event I was able to catch up with a bunch of folks representing a wide variety of members of the ecosystem. On the first day I caught up with Ed Dumbill of O’Reilly Media who writes about big data for O’Reilly Radar and also is the GM for O’Reilly’s big data conference, Strata.
Here’s what Ed had to say.
Some of the ground Ed covers
What is Strata and what does it cover
How will this years conference differ from last
Which customer types are making the best use of Hadoop, will Strata verticalize going forward
What is Ed looking forward to most in the upcoming Strata.
In the previous entry I mentioned that we have developed and will be opensourcing “barclamps” (modules that sit on top of Crowbar) for: Cloudera CDH/Enterprise, Zookeeper, Pig, Hbase, Flume and Sqoop. All these modules will speed and ease the deployment, configuration and operation of Hadoop clusters.
If you would like to get involved, check out this 1 min video from Rob Hirschfeld talking about how:
It wouldn’t be surprising if you were surprised to learn that Dell is developing software. To say that this is an area we haven’t been known for in the past would be an understatement. While we may not pose a direct threat to Microsoft any time soon, we have been coding in a few focused areas. One of those areas is cloud installation and management and is represented by our project Crowbar. While Crowbar began life simply as a way to install Openstack on Dell hardware, it has expanded from there.
Today’s news is that we have developed and will be opensourcing “barclamps” (modules that sit on top of crowbar) for: Cloudera CDH/Enterprise, Zookeeper, Pig, Hbase, Flume and Sqoop. All these modules will speed and ease the deployment, configuration and operation of Hadoop clusters. But don’t take my word for it. Take a listen to Crowbar’s architect Rob Hirschfeld as he explains Crowbar and today’s announcement:
Rob Hirschfeld, aka “Commander Crowbar,” recently posted a blog entry looking back at how Crowbar came to be, how its grown and where he hopes it will go from here.
What’s a Crowbar?
If you’re not familiar with Crowbar, its an open source software framework that began life as an installation tool to speed installation of OpenStack on Dell hardware. The project incorporates the Opscode Chef Server tool and was originally created here at Dell by Rob and Greg Althaus. Just four short months ago at OSCON 2011 the project took a big step forward when, along with the announcement of our OpenStack solution, we announced that we were opensourcing it.
DevOps-ilicous
As Rob points out in his blog, as we were delivering Crowbar as an installer a collective light bulb went off and we realized the role that Chef and tools like it play in a larger movement taking place in many Web shops today: the movement of DevOps.
The DevOps approach to deployment builds up systems in a layered model rather than using packaged images…Crowbar’s use of a DevOps layered deployment model provides flexibility for BOTH modularized and integrated cloud deployments.
On beyond installation and OpenStack
As the team began working more with Crowbar, it occurred to them that its use could be expanded in two ways: it could be used to do more than installation and it could be expanded to work with projects beyond OpenStack.
As for functionality, Crowbar now not only installs and configures but once the initial deployment is complete, Crowbar can be used to maintain, expand, and architect the instance, including BIOS configuration, network discovery, status monitoring, performance data gathering, and alerting.
The first project beyond OpenStack that we used Crowbar on was Hadoop. In order to expand Crowbar’s usage we created the concept of “barclamps” which are in essence modules that sit on top of the basic Crowbar functionality. After we created the Hadoop barclamp, others picked up the charge and VMware created a Cloud Foundry barclamp and DreamHost created a Ceph barclamp.
It takes a community
Crowbar development has recently been moved out into the open. As Rob explains,
Big Data represents the next not-completely-understood got-to-have strategy. This first dawned on me about a year ago and has continued to become clearer as the phenomenon has gained momentum. Contributing to Big Data-mania is Hadoop, today’s weapon of choice in the taming and harnessing of mountains of unstructured data, a project that has its own immense gravitational pull of celebrity.
So what
But what is the value of slogging through these mountains of data? In a recent Forrester blog, Brian Hopkins lays it out very simply:
We estimate that firms effectively utilize less than 5% of available data. Why so little? The rest is simply too expensive to deal with. Big data is new because it lets firms affordably dip into that other 95%. If two companies use data with the same effectiveness but one can handle 15% of available data and one is stuck at 5%, who do you think will win?
The only problem is that while unstructured data (email, clickstream data, photos, web logs, etc.) makes up the vast majority of today’s data, the majority of the incumbent data solutions aren’t designed to handle it. So what do you do?
Deal with it
Hadoop, which I mentioned above, is your first line of offense when attacking big data. Hadoop is an open source highly scalable compute and storage platform. It can be used to collect, tidy up and store boatloads of structure and unstructured data. In the case of enterprises it can be combined with a data warehouse and then linked to analytics (in the case web companies they forgo the warehouse).
And speaking of web companies Hopkins explains
Google, Yahoo, and Facebook used big data to deal with web scale search, content relevance, and social connections, and we see what happened to those markets. If you are not thinking about how to leverage big data to get the value from the other 95%, your competition is.
So will Big Data truly displace Cloud as the current must-have buzz-tastic phenomenon in IT? I’m thinking in many circles it will. While less of a tectonic shift, Big Data’s more “modest” goals and concrete application make it easier to draw a direct line between effort and business return. This in turn will drive greater interest, tire kicking and then implementation. But I wouldn’t kick the tires for too long for as the web players have learned, Big Data is a mountain of straw just waiting to be spun into gold.
Dell has been working for the last four plus years outfitting the biggest of the big web superstars like Facebook and Microsoft Azure with infrastructure. More recently we have been layering software such as Hadoop, OpenStack and crowbar on top of that infrastructure. This has not gone unnoticed by web pub GigaOm:
Want to become the next Amazon Web Services or Facebook? Dell could have sold you the hardware all along, but now it has the software to make those servers and storage systems really hum.
They also made the following observation:
Because [Dell] doesn’t have a legacy [software] business to defend, it can blaze a completely new trail that has its trailhead where Oracle, IBM and HP leave off.
Letting customers focus on what matters most
Its a pretty exciting time to be at Dell as we continue to move up the stack outfitting web players big and small. The idea is to get these players established and growing in an agile and elastic way so they can concentrate on serving customers rather than building out their underpinning software and systems.
A few weeks ago we announced that Dell, with a little help from Cloudera, was delivering a complete Apache Hadoop solution. Well as of last week its now officially available!
As a refresher:
The solution is comprised of Cloudera’s distribution of Hadoop, running on optimized Dell PowerEdge C2100 servers with Dell PowerConnect 6248 switch, delivered with joint service and support from both companies. You can buy it either pre-integrated and good-to-go or you can take the DIY route and set up yourself with the help of
Dell’s chief architect for big data, Aurelian Dumitru (aka. A.D.) presented a talk at OSCON the week before last with the heady title, “Hadoop – Enterprise Data Warehouse Data Flow Analysis and Optimization.” The session, which was well attended, explored the integration between Hadoop and the Enterprise Data Warehouse. AD posted a fairly detailed overview of his session on his blog but if you want a great high level summary, check this out:
Some of the ground AD covers
Mapping out the data life cycle: Generate -> Capture -> Store -> Analyze ->Present
Where does Hadoop play and where does the data warehouse? Where do they overlap?
Data continues to grow at an exponential rate and no place is this more obvious than in the Web space. Not only is the amount exploding but so is the form data’s taking whether that’s transactional, documents, IT/OT, images, audio, text, video etc. Additionally much of this new data is unstructured/ semi-structured which traditional relational databases were not built to deal with.
Enter Hadoop, an Apache open source project which, when combined with Map Reduceallows the analysis of entire data sets, rather than sample sizes, of structured and unstructured data types. Hadoop lets you chomp thru mountains of data faster and get to insights that drive business advantage quicker. It can provide near “real-time” data analytics for click-stream data, location data, logs, rich data, marketing analytics, image processing, social media association, text processing etc. More specifically, Hadoop is particularly suited for applications such as:
Search Quality — search attempts vs. structured data analysis; pattern recognition
Recommendation engine — batch processing; filtering and prediction (ie use information to predict what similar users like)
Ad-targeting – batch processing; linear scalability
Thread analysis for spam fighting and detecting click fraud — batch processing of huge datasets; pattern recognition
Data “sandbox” – “dump” all data in Hadoop; batch processing (ie analysis, filtering, aggregations etc); pattern recognition
The Dell | Cloudera solution
Although Hadoop is a very powerful tool, it can be a bit daunting to implement and use. This fact wasn’t lost on the founders of Cloudera who set up the company to make Hadoop easier to used by packaging it and offering support. Dell has joined with this Hadoop pioneer to provide the industry’s first complete Hadoop Solution (aptly named “the Dell | Cloudera solution for Apache Hadoop”).
The solution is comprised of Cloudera’s distribution of Hadoop, running on optimized Dell PowerEdge C2100 servers with Dell PowerConnect 6248 switch, delivered with joint service and support. Dell offers two flavors of this big data solution: Cloudera’s distribution with the free download of Hadoop software, and Cloudera’s enterprise version of Hadoop that comes with a charge.
It comes with its own “crowbar” and DIY option
The Dell | Cloudera solution for Apache Hadoop also comes with Crowbar, the recently open-sourced Dell-developed software, which provides the necessary tools and automation to manage the complete lifecycle of Hadoop environments. Crowbar manages the Hadoop deployment from the initial server boot to the configuration of the main Hadoop components allowing users to complete bare metal deployment of multi-node Hadoop environments in a matter of hours, as opposed to days. Once the initial deployment is complete, Crowbar can be used to maintain, expand, and architect a complete data analytics solution, including BIOS configuration, network discovery, status monitoring, performance data gathering, and alerting.
The solution also comes with a reference architecture and deployment guide, so you can assemble it yourself, or Dell can build and deploy the solution for you, including rack and stack, delivery and implementation.
I saw a great talk today here at OSCON Data up in Portland, Oregon. The talk was Practical Data Storage: MongoDB @ foursquare and was given by foursquare‘s head of server engineering, Harry Heymann. The talk was particularly impressive since, due to AV issues, Harry had to wing it and go slideless. (He did post his slides to twitter so folks with access could follow along).
After the talk I grabbed a few minutes with Harry and did the following interview:
Some of the ground Harry covers
What is foursquare and how it feeds your data back to you
Here is the final entry in my interview series from the Hadoop Summit.
The night before the summit, I was impressed when I heard Ken Krugler speak at the BigDataCamp unconference. Turns out Ken has been a part of the Hadoop scene even before there was a Hadoop, his 2005 start-up Krugle utilized Nutch which split and evolved into Hadoop. He now runs a Hadoop consulting practice, Bixo labs, and offers training.
I ran into Ken the next day at the summit and sat down with him to get his thoughts on Hadoop and the ecosystem around it.
Some of the ground Ken covers
How he first began using Hadoop many moons ago
(0:53) How Hadoop has crossed the chasm over the last half decade
(1:53) The classes he teaches, one very technical and the other an intro class
(2:23) What the heck is Hadoop anyway?
(3:30) What trends Ken has seen recently in the Hadoop world (the rise of the fat node)