Here is part two of three of the Web glossary I complied. As I mentioned in my last two entries, in compiling this I pulled information from various and sundry sources across the Web including wikipedia, community and company web sites and the brain of Cote.
- Structured data: Data that can be organized in a structure e.g. rows or columns so that it is identifiable. The most universal form of structured data is a database like SQL or Access.
- Unstructured data: Data that has no identifiable structure. Unstructured data typically includes bitmap images/objects, text and other data types that are not part of a database. Most enterprise data today can actually be considered unstructured. An email is considered unstructured data.
- Big Data: Data characterized by one or more of the following characteristics: Volume – A large amount of data, growing at large rates; Velocity – The speed at which the data must be processed and a decision made; Variety – The range of data, types and structure to the data
- Relational Databases (RDBMS) Management Systems: These databases are the incumbents in enterprises today and store data in rows and columns. They are created using a special computer language, structured query language (SQL), that is the standard for database interoperability. Examples: IBM DB2, MySQL, Microsoft SQL Server, PostgreSQL, Oracle RDBMS, Informix, Oracle Rdb, etc.
- NoSQL: refers to a class of databases that 1) are intended to perform at internet (Facebook, Twitter, LinkedIn) scale and 2) reject the relational model in favor of other (key-value, document, graph) models. They often achieve performance by having far fewer features than SQL databases and focus on a subset of use cases. Examples: Cassandra, Hadoop, MongoDB, Riak
- Recommendation engine: A recommendation engine takes a collection of frequent itemsets as input and generates a recommendation set for a user by matching the current user’s activity against the discovered patterns. The recommendation engine is on-line process, therefore its efficiency and scalability are key, e.g. people who bought X often also bought Y.
- Geo-spatial targeting: the practice of mapping advertising, offers and information based on geo location.
- Behavioral targeting: a technique used by online publishers and advertisers to increase the effectiveness of their campaigns. Behavioral targeting uses information collected on an individual’s web-browsing behavior, such as the pages they have visited or the searches they have made, to select which advertisements to display to that individual.
- Clickstream analysis: On a Web site, clickstream analysis is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order – which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). There are two levels of clickstream analysis, traffic analysis and e-commerce analysis.
- Gluster: a software company acquired by Red Hat that provides an open source platform for scale-out Public and Private Cloud Storage.
- Relational Databases
- MySQL: the most popular open source RDBMS. It represents the “M” in the LAMP stack. It is now owned by Oracle.
- Drizzle: A version of MySQL that is specifically targeted the cloud. It is currently an open source project without a commercial entity behind it.
- Percona: A MySQL support and consulting company that also supports Drizzle.
- PostgreSQL: aka Postgres is is an object-relational database management system (ORDBMS) available for many platforms including Linux, FreeBSD, Solaris, Windows and Mac OS X.
- Oracle DB – not used so much in new WebTech companies, but still a major database in the development world.
- SQL Server – Microsoft’ s RDBMS
- MongoDB: an open source, high-performance, database written in C++. Many Linux distros include a MongoDB package, including CentOS, Fedora, Debian, Ubuntu and Gentoo. Prominent users include Disney interactive media group, New York Times, foursquare, bit.ly, Etsy. 10gen is the commercial backer of MongoDB.
- Riak: a NoSQL database/datastore written in Erlang from the company Basho. Originally used for the Content Delivery Network Akamai.
- Couchbase: formed from the merger of CouchOne and Membase. It offers Couchbase server powered by Apache CouchDB and is available in both Enterprise and Community editions. The author of CouchDB was a prominent Lotus Notes architect.
- Cassandra: A scalable NoSQL database with no single points of failure. A high-scale, key/value database originating from Facebook to handle their message inboxes. Backed by DataStax, which came out of Rackspace.
- Mahout: A Scalable machine learning and data mining library. An analytics engine for doing machine learning (e.g., recommendation engines and scenarios where you want to infer relationships).
- Hadoop ecosystem
- Hadoop: An open source platform, developed at Yahoo that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is particularly suited to large volumes of unstructured data such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs.
- MapReduce: a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Hadoop acts as a platform for executing MapReduce. MapReduce came out of Google
- HDFS: Hadoop’s Distributed File system allows large application workloads to be broken into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing.
- Major Hadoop utilities:
- HBase: The Hadoop database that supports structured data storage for large tables. It provides real time read/write access to your big data.
- Hive: A data warehousing solution built on top of Hadoop. An Apache project
- Pig: A platform for analyzing large data that leverages parallel computation. An Apache project
- ZooKeeper: Allows Hadoop administrators to track and coordinate distributed applications. An Apache project
- Oozie: a workflow engine for Hadoop
- Flume: a service designed to collect data and put it into your Hadoop environment
- Whirr: a set of libraries for running cloud services. It’s ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs.
- Sqoop: a tool designed to transfer data between Hadoop and relational databases. An Apache project
- Hue: a browser-based desktop interface for interacting with Hadoop
- Cloudera: a company that provides a Hadoop distribution similar to the way Red Hat provides a Linux distribution. Dell is using Cloudera’s distribution of Hadoop for its Hadoop solution.
- Solr: an open source enterprise search platform from the Apache Lucene project. Backed by the commercial company Lucid Imagination.
- Elastic Search: an open source, distributed, search engine built on top of Lucene (raw search middleware).
Pau for now…
Thanks for putting this glossary together, really looking forward for the next parts. For me, the best way to understand things is by getting my hands dirty – but the most difficult question at the beginning is: “Where do I get started?” Hence, I would appreciate greatly if you could provide some “For Starters” info at the end of each chapter. Where do I get tutorials & workbooks? Which developer tools do I need and where do I get them? Which blogs should I read? Books? Which people / companies should I follow on Twitter? Events worth attending? Online-Communities to join?