The Path Beyond HADOOP
 
The Path Beyond HADOOP: Software Database Systems for Big Data Analytics
Feature Article: January 2013


The waves of big data are massive and arriving continuously, demanding a processing platform that is expansive, economical, and accessible. By “big data” we’re talking about traditional business data such as orders, transactions, customer profiles, as well as new data sources flowing from machines, sensors, and social networks. This is data that is measured in exabytes and beyond. If studies are correct, this is just the beginning: because according to IBM, 90% of the world’s digital data has been created in just the last two years.

The waves of big data are massive and arriving continuously, demanding a processing platform that is expansive, economical, and accessible. By “big data” we’re talking about traditional business data such as orders, transactions, customer profiles, as well as new data sources flowing from machines, sensors, and social networks. This is data that is measured in exabytes and beyond. If studies are correct, this is just the beginning: because according to IBM, 90% of the world’s digital data has been created in just the last two years.

wp16b1b635 05 06Heralding Hadoop
Hadoop offers an affordable, elastic, and truly scalable open-source software stack that supports data-intensive distributed applications. The combination of Hadoop’s distributed file system (HDFS) and the Map Reduce framework has improved how data is stored and manipulated and has helped organizations keep up with the explosive data growth. It provides affordable storage on large data volumes, enables fast data ingest and its schema-free orientation makes it suited for storing vast quantities of data in the most granular forms. Good for batch processing, Hadoop supports massive simple ETL (extract-transform-load) while providing the ability to readily scale up to handle more data or to shorten processing time. Whether data is unstructured, semi-structured, or structured, the Hadoop environment enables big data to be captured in its native format and parsed later, as necessary.

 

Businesses of all sizes have contributed to Hadoop's growth rate, with IDC predicting Hadoop to grow 60 percent through 2016.  The business case for storing all of this big data is that it will yield game-changing insights. A recent Harvard Business Review poll found 85 percent of the executives surveyed expect to gain substantial business and IT advantage from big data. The reality is that to deliver the organization must integrate and analyze big data, as well as collect it.  This is where many Hadoop users have begun to look for alternatives.
 
A growing portion of advanced big data practitioners are finding that Hadoop tools are marginal for big data integration and analytics of structured data. Relative to expectations of the business intelligence community, Hadoop query times are slow, its access methods are arcane, it isolates data subject areas from each other, and it lacks a rich third-party ecosystem of tools for analysis, reporting, and presentation.
Driving Business Value from Big Data: DBMS Analytics at Hadoop Scale

In comparison to Hadoop and its noSQL off-shoots such as Hive, HBase, Casandra, Pig, et al., big data DBMS (database management system) solutions – i.e., SQL engines -- are attractive for a variety of reasons. Compared to Hadoop solutions, the big data DBMS solutions have fast query response times and make it easy to join disparate data. SQL skills are ubiquitous, and the third party tool ecosystem is robust.

In order to derive maximum business value from big data, it needs to be integrated with existing legacy data from corporate systems and external data providers. These data sets, in overwhelming proportion, reside in SQL data stores, such as data warehouses, data hubs, and operational data stores.  But Hadoop has raised expectations about cost, scale, elasticity, and scalability. These are areas where Hadoop shines. By providing a platform for ingesting and storing big data, Hadoop has defined the features of the next generation analytics DBMS.

Software Database Management Systems
The data warehousing community has always made room for high performance database management systems (DBMSs) that used proprietary hardware because massive ingest rates and fast response times for big data analytics were not achievable on standard hardware. Now, however, today’s standard x86 hardware, combined with next generation software DBMSs, can deliver the goods at a much lower cost and with many other advantages that are inherent to software running on standard hardware.
 
With few exceptions, big data warehouse appliances – i.e., DBMS software configured on vendor-engineered hardware – use massively parallel (MPP) architectures to reliably deliver performance for big data analytics and information management. By “big” we’re talking about systems handling terabytes to petabytes of data. MPP architectures divvy up a workload to independent processors so that adding nodes delivers linear speed-up gains.

The secret sauce of the first generation of MPP data warehouse appliances in the 1990s (e.g., Teradata and Tandem) was their proprietary high speed networking between processing nodes. But today, affordable high speed network solutions, including 10gigE and InfiniBand, make proprietary network solutions unnecessary.

The next generation of turbo-charged appliances (e.g., Netezza and Kickfire) amped up performance even further with field programmable gate arrays (FPGAs), which are specialized processors for DBMS tasks. CPU evolution after early 2000s, however, turned towards multi-core CPUs. After a slow start, this movement picked up momentum first with two-core, then four-core and now eight-core CPUs shipping in volume. This is only the beginning as further advances are on the way. With more cores, the advantages of co-processors such as FPGAs for computation decrease, and the disadvantages of moving data to and from co-processors increase. Memory bandwidth has increased in step with multi-core processors, making the CPU a much more powerful computation engine.

Based on off-the-shelf and relatively inexpensive hardware, shipments of servers and storage to host Hadoop farms are growing at a rate of nearly 60% per year – rack upon rack of relatively inexpensive x86 servers running Linux. Much of this gear is being used to land and assimilate big data, both structured and unstructured. The newer class of software-only MPP DBMS can run on the same hardware. These solutions offer a variety of methods of sharing data between Hadoop and database processes. Like Hadoop, software-based DMBMs allow customers to scale at the level needed, for example, by adding a single node or 100 nodes.

Business Intelligence & Data Warehousing Moves to the Cloud
Cloud computing is the biggest IT game-changer in decades. Estimates of the market growth rates for public cloud products and services keep getting revised upward. In fact, conservative estimates predict 20% growth annually on roughly $110 billion in 2012 in the U.S. alone. On the face of it, the case for moving to the cloud is killer: pay for use, capacity on demand, always-on operations, economies of scale through resource pooling and sharing, price and performance competition between cloud providers (versus captive data centers), IT headcount reduction, et al.

Countless young companies are growing up knowing only public clouds as their IT infrastructure. Their IT departments are skeletal or non-existent, relying instead on software as a service (SaaS), platforms as a service (PaaS) and databases as service (DBaaS). Established companies are moving toward public clouds more cautiously but measurably, starting with one-time projects and shadow IT operations. Private clouds are growing rapidly as well as spending on private cloud infrastructure in 2011 was about $11 billion compared to $6 billion just a year earlier.

New BI tools are emerging that are strictly for cloud-based business intelligence and several established tools are pivoting hard into cloud business. For a data mart, a data warehouse or an analytic sandbox to run in the cloud, public or private, the DBMS must be software that runs on virtualized hardware. Data warehouse appliances don’t fit the cloud paradigm. The good news is that there are MPP software solutions out there to fit the bill.

A New Era for Big Data Warehouse DBMS
The combination of really big data – Hadoop scale in terms of variety, velocity, and volume – along with the move to cloud and virtualized platforms will cause heartburn for traditional MPP vendors whose solutions are tightly coupled with proprietary hardware.  Data warehouse appliances have generated rich revenue streams for their parent companies, and big data warehouses are usually worth even more than their costs. Now customers can look to a new generation of software MPP DBMS to deliver big data-driven solutions from standard hardware and on the cloud.  This new breed of SQL engine is radically scalable, cloud-enabled, and always on, allowing organizations to ride the wave of today's big data deluge.