Today Dell is announcing our new PowerEdge C8000 shared infrastructure chassis which allows you to mix and match compute, GPU/coprocessor and storage sleds all within the same enclosure. What this allows Web companies to do is to have one common building block that can be used to provide support across the front-, mid- and back end tiers that make up a web company’s architecture.
To give you a better feel for the C8000 check out the three videos below.
Why — Product walk thru: Product manager for the C8000, Armando Acosta takes you through the system and explains how this chassis and the accompanying sleds better server our Web customers.
Evolving — How we got here: Drew Schulke, marketing director for Dell Data Center solutions explains the evolution of our shared infrastructure systems and what led us to develop the C8000.
Super Computing — Customer Example: Dr. Dan Stanzione, deputy director at the Texas Advanced Computing Center talks about the Stampede supercomputer and the role the C8000 plays.
Here is part two of three of the Web glossary I complied. As I mentioned in my last two entries, in compiling this I pulled information from various and sundry sources across the Web including wikipedia, community and company web sites and the brain of Cote.
Structured data: Data that can be organized in a structure e.g. rows or columns so that it is identifiable. The most universal form of structured data is a database like SQL or Access.
Unstructured data: Data that has no identifiable structure. Unstructured data typically includes bitmap images/objects, text and other data types that are not part of a database. Most enterprise data today can actually be considered unstructured. An email is considered unstructured data.
Big Data: Data characterized by one or more of the following characteristics: Volume – A large amount of data, growing at large rates; Velocity – The speed at which the data must be processed and a decision made; Variety – The range of data, types and structure to the data
Relational Databases (RDBMS) Management Systems: These databases are the incumbents in enterprises today and store data in rows and columns. They are created using a special computer language, structured query language (SQL), that is the standard for database interoperability. Examples: IBM DB2, MySQL, Microsoft SQL Server, PostgreSQL, Oracle RDBMS, Informix, Oracle Rdb, etc.
NoSQL: refers to a class of databases that 1) are intended to perform at internet (Facebook, Twitter, LinkedIn) scale and 2) reject the relational model in favor of other (key-value, document, graph) models. They often achieve performance by having far fewer features than SQL databases and focus on a subset of use cases. Examples: Cassandra, Hadoop, MongoDB, Riak
Recommendation engine: A recommendation engine takes a collection of frequent itemsets as input and generates a recommendation set for a user by matching the current user’s activity against the discovered patterns. The recommendation engine is on-line process, therefore its efficiency and scalability are key, e.g. people who bought X often also bought Y.
Geo-spatial targeting: the practice of mapping advertising, offers and information based on geo location.
Behavioral targeting: a technique used by online publishers and advertisers to increase the effectiveness of their campaigns. Behavioral targeting uses information collected on an individual’s web-browsing behavior, such as the pages they have visited or the searches they have made, to select which advertisements to display to that individual.
Clickstream analysis: On a Web site, clickstream analysis is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order – which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). There are two levels of clickstream analysis, traffic analysis and e-commerce analysis.
Gluster: a software company acquired by Red Hat that provides an open source platform for scale-out Public and Private Cloud Storage.
MySQL: the most popular open source RDBMS. It represents the “M” in the LAMP stack. It is now owned by Oracle.
Drizzle: A version of MySQL that is specifically targeted the cloud. It is currently an open source project without a commercial entity behind it.
Percona: A MySQL support and consulting company that also supports Drizzle.
PostgreSQL: aka Postgres is is an object-relational database management system (ORDBMS) available for many platforms including Linux, FreeBSD, Solaris, Windows and Mac OS X.
Oracle DB – not used so much in new WebTech companies, but still a major database in the development world.
SQL Server – Microsoft’ s RDBMS
MongoDB: an open source, high-performance, database written in C++. Many Linux distros include a MongoDB package, including CentOS, Fedora, Debian, Ubuntu and Gentoo. Prominent users include Disney interactive media group, New York Times, foursquare, bit.ly, Etsy. 10gen is the commercial backer of MongoDB.
Riak: a NoSQL database/datastore written in Erlang from the company Basho. Originally used for the Content Delivery Network Akamai.
Couchbase: formed from the merger of CouchOne and Membase. It offers Couchbase server powered by Apache CouchDB and is available in both Enterprise and Community editions. The author of CouchDB was a prominent Lotus Notes architect.
Cassandra: A scalable NoSQL database with no single points of failure. A high-scale, key/value database originating from Facebook to handle their message inboxes. Backed by DataStax, which came out of Rackspace.
Mahout: A Scalable machine learning and data mining library. An analytics engine for doing machine learning (e.g., recommendation engines and scenarios where you want to infer relationships).
Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Hadoop acts as a platform for executing MapReduce. MapReduce came out of Google
HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.
Major Hadoop utilities:
HBase: The Hadoop database that supports structured data storage for large tables. It provides real time read/write access to your big data.
Hive: A data warehousing solution built on top of Hadoop. An Apache project
Pig: A platform for analyzing large data that leverages parallel computation. An Apache project
ZooKeeper: Allows Hadoop administrators to track and coordinate distributed applications. An Apache project
Oozie: a workflow engine for Hadoop
Flume: a service designed to collect data and put it into your Hadoop environment
Whirr: a set of libraries for running cloud services. It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
Sqoop: a tool designed to transfer data between Hadoop and relational databases. An Apache project
Hue: a browser-based desktop interface for interacting with Hadoop
Cloudera: a company that provides a Hadoop distribution similar to the way Red Hat provides a Linux distribution. Dell is using Cloudera’s distribution of Hadoop for its Hadoop solution.
Solr: an open source enterprise search platform from the Apache Lucene project. Backed by the commercial company Lucid Imagination.
Elastic Search: an open source, distributed, search engine built on top of Lucene (raw search middleware).
As I mentioned in my last post, one of the ways we are helping our teams get a better understanding of the wild and wacky world of the Web and Web developers is via a glossary we’ve created. In compiling this I pulled information from various and sundry sources across the Web including wikipedia, community and company web sites and the brain of Cote.
Over the next several entries I will be posting the glossary. Feel free to bookmark it, delete it, offer corrections, comments or additions.
Today I present to you, the Application tier.
Application framework : Provides re-usable templates, methods, and ways of programming applications. Often, these frameworks will provide “widgets” and “libraries” that developers use to create various parts of their application – they may also include the actual tools to create, deploy, and run the final application. Some application frameworks create whole sub-cultures of developers, such as Rails which supports the Ruby programming language. Most application frameworks are open source and free, though there are also many closed source, not-free ones.
Continuous code development lifecycle: releasing software at more frequent intervals (30 days or less) by (a.) doing smaller batches of code, and, (b.) using tools and processes that enable a more lean approach to development. Software released in such a cycle tends to release many small features instead of, in contrast, “traditional” development where 100s of features are bundled up in one version of the software and released every 1-2 years.
Java/.NET: The incumbent enterprise development languages. Very powerful but relatively difficult to learn and take time to program in.
PHP: a server-side scripting language originally designed for web development to produce dynamic web pages. WordPress is written in PHP, as well as Facebook and countless web sites. PHP is infamous for being very quick and easy to get started with (which it is) but turning into a mess of “spaghetti code” after years of work and different programmers. PHP is open source, though Zend, the patron company behind PHP, and others sell “commercial” versions.
Perl: One of the original programming languages of the web, Perl emphasizes a very “Unix way” of programming. Perl can be quick and elegant, but like PHP can result in a pile of hard to maintain code in the long term. While Perl was extremely popular in the first Internet bubble, it has sense taken a back-seat to more popular development worlds such as PHP, Java, and Rails. Perl is open source and there are few, if any, commercial companies behind it.
Python: Like all dynamic languages, Python emphasizes speed of development and code readability. Its an object-oriented language. Python is something of an evolution of Perl, but it not that closely tied to it. Python emphases broadness of functionality while at the same time being a proper, object oriented programing language (not just a way to write “scripts”). Python enjoys steady popularity; Google uses Python as one of its primary programming languages.
Ruby: Ruby and Python are very similar in ethos: emphasizing fast coding with a more human-readable syntax. Ruby became famous with the rise of Rails in the mid-2000s which was a rebellion against the “heavy weight” practices that Java imposed on web development. Ruby is still very popular. Ruby can also be run on-top of the Java virtual machine (via JRuby), providing a good bridge to the Java world. Salesforce’s acquired PaaS, Heroku, uses Ruby, and most modern development platforms use Ruby.
Ruby on Rails: a popular web application framework written in Ruby. Rails is frequently credited with making Ruby “famous”.
Scala: A somewhat exotic language, but it has quite a buzz around it. It’s good for massive scale systems that need to be concurrent (lots of people changing lots of things, often the same things, at the same time). Erlang is another language in this area. Scala runs on the Java Virtual Machine and Common Language Runtime. In April 2009 Twitter announced they had switched large portions of their backend from Ruby to Scala and intended to convert the rest. In addition, Foursquare uses Scala and Lift (Lift is a framework for Scala much in the same way Rails is a framework for Ruby.)
R: a programming language and software environment for statistical computing and graphics.
Clojure: A recent dialect of the Lisp programming language and is good for data intense applications. It runs on the Java Virtual Machine and Common Language Runtime
Runtimes and Platforms
Common Language Runtime (CLR): is the virtual machine component of Microsoft’s .NET framework and is responsible for managing the execution of .NET programs.
Java Virtual Machine (JVM) – the underlying execution engine that the Java language runs on-top of. It controls access to the hardware, networks, and other “infrastructure” and services outside of the main application written in Java. Of special note is that many languages other than Java can run on the JVM (as with the CLR), e.g., Scala, Ruby, etc. There are many JVMs and ISVs (IBM, Oracle, etc.) will use their custom JVMs as key differentiators for middle ware, mostly around performance, scale-out, and security.
Openshift: Red Hat’s Platform as a Service (PaaS) offering. More specifically, OpenShift is a PaaS software layer that Red Hat runs and manages on top of third party providers – Amazon first with more to follow.
Heroku: A Platform as a Service (PaaS) offering that was acquired by Salesforce.com. It supports development of Ruby on Rails, Java, PHP and Python.
CloudFoundry: A Platform as a Service (PaaS) offering and VMware-led project. Cloud Foundry provides a platform for building, deploying, and running cloud apps using the Spring Framework for Java developers, Rails and Sinatra for Ruby developers, Node.js and other JVM languages/frameworks including Groovy, Grails and Scala.
Joyent: Offers PaaS and IaaS capabilities through the public cloud. Dell resells this capability as turnkey solution under the name The Dell Cloud Solution for Web applications. Joyent also sponsors the development of node.js and employs its creator.
GitHub: a web-based hosting service for software development projects that use the Gitrevision control system. GitHub offers both commercial plans and free accounts for open source projects.
But wait there’s more…
Stay tuned for the next couple of entries when I will cover first the Database tier and then the Infrastructure tier.
I have “long” thought that data is the currency of the Internet, it is what is monetized and when aggregated, parsed and made accessible, where the value lies for businesses and individuals. I was interested then to read in Rich Miller’s article about another analogy that had recently been drawn:
“Data is the new oil,” said Andreas Weigend, social data guru and former chief scientist at Amazon.com. “Oil needs to be refined before it can be useful. Big data startups are the new refineries.”
Another way of putting this might be to say that the internet runs on data.
Just a few thoughts to share over my much needed morning coffee. Talk amongst yourselves.
Dell has been working for the last four plus years outfitting the biggest of the big web superstars like Facebook and Microsoft Azure with infrastructure. More recently we have been layering software such as Hadoop, OpenStack and crowbar on top of that infrastructure. This has not gone unnoticed by web pub GigaOm:
Want to become the next Amazon Web Services or Facebook? Dell could have sold you the hardware all along, but now it has the software to make those servers and storage systems really hum.
They also made the following observation:
Because [Dell] doesn’t have a legacy [software] business to defend, it can blaze a completely new trail that has its trailhead where Oracle, IBM and HP leave off.
Letting customers focus on what matters most
Its a pretty exciting time to be at Dell as we continue to move up the stack outfitting web players big and small. The idea is to get these players established and growing in an agile and elastic way so they can concentrate on serving customers rather than building out their underpinning software and systems.
Over the past three years Dell’s Data Center Solutions group has been designing custom microservers for a select group of web hosters. The first generation allowed one of France’s largest hosters, Online.net to enter a new market and gain double digit market share. The second generation brought additional capabilities to the original design along with greater performance.
A few months ago we announced that we were taking our microserver designs beyond our custom clients and making these systems available to a wider audience. Last month the AMD-based PowerEdge C5125 microserver became available and yesterday the Intel-based PowerEdge C5220 microserver made its debut. Both are ultra-dense 3U systems that pack up to twelve individual servers into one enclosure.
To get a great overview of both the 12 sled and 8 sled versions of the new C5220 system, let product manager Deania Davidson take you on a quick tour:
Target use-cases and environments
Hosting applications such as dedicated, virtualized, shared, static content, and cloud hosting
Web 2.0 applications such as front-end web servers
Power, space, weight and performance constrained data center environments such as co-los and large public organizations such as universities, and government agencies