bid data managementBig Data Management Platforms:

Architecting Heterogeneous Solutions

Ravi Chandran

Ravi ChandranRavi Chandran is the CTO of XtremeData, Inc., which provides massively scalable, highperformance DBMS for big data analytics. ravi.chandran@xtremedata.comAt the same time, no enterprise can afford to ignore the value of big data analysis. Big data has been democratized: where it was once the domain of global B2C enterprises, now businesses of all sizes have easy access to, and can benefit from, big data. The traditional IT model—purchasing excess capacity to meet the next few years’ growth—will no longer work.

 

Furthermore, CIOs are being challenged to do more with less. Data volumes are growing much faster than budgets can possibly increase, so incremental evolution of legacy technologies cannot meet demands. In recent years, market pressures have spawned a wide range of new technologies, many of which are narrowly focused on specific applications. This article focuses on structured data sets, essentially the traditional ecosystem of data warehouses and data marts extended to include new big data sources. Unstructured or loosely structured data such as text files, documents, audio, and video have their
own specialized solutions. The challenge of structured big data analytics is to deploy solutions that scale affordably in the context of loading, joining, aggregating, analyzing, and reporting on billions of records quickly enough to meet business demands. A one-size-fits-all approach is not possible today; consequently, a big data platform strategy needs to take a heterogeneous approach.

All components of a heterogeneous solution need to meet one common criterion: the ability to be deployed on today’s converged, sharable hardware infrastructure. This is the era of commodity (Linux on x86) hardware, scalable horizontally as clusters of servers and virtualized to enable sharing by multiple OSes and applications. This converged architecture applies equally to public clouds (such as Amazon Web Services and Google Compute Engine) and private clouds within the enterprise data center.

Figure-1

The technologies that apply to structured big data analysis may be thought of as falling into three broad classes: Hadoop, massively parallel processing (MPP) SQL engines, and specialized reporting solutions that include in-memory and column-store databases. This is shown in Figure 1. The data flow for big data travels naturally from left to right in a funnel shape, getting smaller in volume as it travels. This diagram illustrates one particular example of data flow; not all processes will require this same sequence, but the three classes of solutions are typically the necessary components. The strengths and weaknesses of each class are described in more detail later in this article.

As depicted, Hadoop is meant to include all variants and distributions, and incorporates all the core components: the Hadoop distributed file system (HDFS), MapReduce, Hive, Pig, Zookeeper, and so on.

Hadoop is designed for large scale, and it is not unusual  for systems to comprise thousands of nodes and store hundreds of terabytes. MPP SQL engines are also designed for scale, but are typically smaller in size: hundreds of nodes and tens of terabytes. The solutions for specialized reporting are the smallest: typically non-MPP, single-node implementations of one terabyte or less. Together, these three classes of solutions form a heterogeneous mix of tools in a big data management platform (BDMP), in contrast to a traditional, homogeneous data warehouse.

Figure-2

The SQL Cycle
One common thread linking the three classes of solutions is SQL. Except for Hadoop, the other two classes natively support SQL. Even though Hadoop initially kicked off the NoSQL movement a few years ago, things have come full circle, as shown in Figure 2.

Hadoop with MapReduce was the initial solution, but the market quickly realized that the skills for MapReduce programming were scarce in comparison to SQL skills. A “lite-SQL” API was then implemented on top of the MapReduce framework in the form of Pig and Hive. Performance became an issue because of the limitations of the underlying MapReduce framework.

Today, there are numerous efforts under way to bypass MapReduce, such as Cloudera’s Impala. However, “lite-SQL” is not enough; SQL programmers want more. The bottom line is that the market is asking for a full-featured MPP SQL engine that can complement Hadoop.

Understanding Solution Strengths and How They Coexist
As we mentioned, SQL skills are plentiful and SQL programmers have certain expectations about their environment, as shown in Table 1.

In the context of big data analytics, not all of these skills or expectations may be strictly necessary for a successful solution. In fact, one or more of these constraints may be relaxed in order to achieve better performance and scalability. These expectations are noted as a backdrop to the more detailed analysis of the three classes of solutions—Hadoop, MPP SQL engines, and specialized reporting solutions—we discuss now.

Figure-3

Hadoop
Hadoop solutions are ideal for the landing and staging of large data volumes, both structured and unstructured. The underlying HDFS brings excellent scalability and high availability to storage on inexpensive commodity hardware. Loading data into Hadoop is nothing more than writing a file into the file system, as opposed to the onerous checking and validation process involved in a “load” into a relational database management system (RDBMS). The higher layers of the Hadoop stack (MapReduce, Pig, Hive) do not insist on a rigid, predetermined schema. The data is interpreted on demand by the higher layers. This feature makes Hadoop a great environment to rapidly ingest and store large volumes of data.

However, Hadoop is not a relational database engine and was never intended to be. A SQL API and/or a SQL execution engine does not transform Hadoop into a true ACID-compliant RDBMS (atomicity, consistency, isolation, durability). Maintaining complex relationships between tables and managing user access permissions and concurrent accesses is difficult in the Hadoop environment. These solutions also struggle when facing data-intensive processing such as grouping, aggregating, and non-trivial joins between large tables. The pros and cons are summarized in Table 2.

For more sophisticated metadata management and complex SQL-based operations, true relational DBMS solutions offer a richer environment and much faster performance. Faster performance translates to smaller hardware footprints and, therefore, cost savings. DBMS solutions offer a well-understood, easy-to-use environment for the rapid development of analytics applications.

MPP SQL Engines and Specialized Reporting Solutions
Both of the last two classes of solutions (MPP SQL engines and specialized reporting solutions) are typically true RDBMS engines, differentiated only by row-store, column-store, or in-memory approaches.

Let’s look at the last approach first. In-memory solutions offer extremely fast response, but are limited by the amount of physical memory installed. High-performance, in-memory DBMS solutions are typically built on the assumption of using a single shared-memory space. Other specialized solutions  offer sharable distributed memory (such as Memcached, Druid, and Hazelcast), but these solutions are non-RDBMS and do not easily meet the requirements of the business intelligence reporting ecosystem (presentation and analysis tools such as MicroStrategy, Cognos, BusinessObjects, and so on).

Today, in-memory RDBMS solutions are mostly restricted to the size of memory available on a single server. This constrains in-memory DBMS solutions to relatively small sizes. A new twist to the in-memory versus on-disk alternatives is the rise of solid-state devices (SSDs), discussed later.

In summary, DBMS solutions that have been specifically engineered for in-memory operation have memory-resident data structures, and make certain assumptions about latency and speed of memory access. If these assumptions are violated, for example, by non-uniform memory access (NUMA) in a distributed system, then performance will tend to be unpredictable and erratic. On the other hand, a natively parallel (MPP) DBMS engine, designed for a shared-nothing, distributed architecture, will much better leverage the resources in a modern cluster.

The advantages of a column-store
database are obvious when input
data has many columns and queries
touch only a few.

The advantages of a column-store database are obvious when input data has many columns and queries touch only a few.aThat leaves us with row-store versus column-store database engines. Column-store engines have been around for decades (such as Sybase IQ), but have not succeeded in grabbing major market share. The reasons for this are fairly straightforward. Most incoming data is naturally organized as rows (records) with multiple columns (fields). Most query results return rows (groupings) with multiple columns (aggregates).

The advantages of a column-store are obvious when input data has many columns and queries touch only a few: Performance is much faster because only those few columns are accessed from the disk. However, there are costs inherent in the decomposition of the row into columns during load, and the necessity for some mechanism to reconstruct the row when needed. The mechanism for row reconstruction typically involves either maintaining the columns in some known order or maintaining keys/indexes. The overhead of maintaining columns results in column-stores typically being unable to support write-heavy operations, such as rapid, continuous ingests, inserts/updates/deletes, and iterative creation and modification of temporary tables.

Column-stores deliver efficient data compression and are best suited for “write once/read many” workloads with predetermined queries, as shown in Table 3. The flip side to this summary of column-store databases is that row-store databases do well at everything that the column-stores are not good at; the designs complement each other.

Many big data analytic processes require data-intensive processing (joins, groups, aggregates), as well as frequent or continuous data ingest and iterative write operations (temporary staging tables). Row-oriented database solutions fit the bill. Traditionally, these types of data-intensive processing were performed either using an ETL tool or via batch SQL within an RDBMS.

With the advent of big data, this is becoming increasingly more difficult to perform using tools and environments that are separate from the converged scalable hardware infrastructure. These ETL-like processes are now being ported to Hadoop MapReduce, and when complex processing is required, the processes are directed to an MPP row-based RDBMS that is also deployed on the
same hardware environment.

Cohabitation Brings Real-World Results
As we mentioned, Hadoop-based solutions are great for landing, staging, and transforming massive volumes of structured and unstructured data. Column-oriented databases are ideal for write once/read many data stores that require fast querying. Row-oriented MPP databases are good for mixed read/write workloads, and ongoing, heavy-lifting ELT (extract, load, then transfer) within the database.

Consider a real-world example of a rapidly growing firm in the exploding business of online digital advertising. The advantages of a column-store database are obvious when input data has many columns and queries touch only a few.

The company is building out a data management platform that relies on all three of these solutions to maximize productivity at a reasonable cost. This company collects data about online advertising activity, aligns the data with offline customer information, and uses analytics to optimize ad placement, pricing, and campaigns. The data management platform ingests nearly 10 TB of granular data each day and uses Hadoop to stage the data.

The platform also uses an MPP row-oriented database for its strengths: complicated data integration that relates online impression and clickthrough activity to offline customer information and performs various roll-ups, both on a daily and an intra-hour basis. The database engine runs on the same inexpensive Hadoop cluster hardware. Finally, the system creates multiple data marts to be published to a column-store database. These data marts exploit their fast query response times to enable external customer access for reporting purposes. Each weekend the column-store data marts are rebuilt with new data.

By using best-of-breed heterogeneous tools rather than one-size-fits-all tools, this company runs a near-real-time, analytics-driven business, providing customers with subsecond query responses at a manageable cost. They expect that today’s 20 TB data warehouse will be 200 TB in 18 months. As a sign of the times, this digital advertiser is moving portions of its system to a public cloud where some of its data sources also reside.

Cloud on the Horizon
Cloud-based computing is dramatically changing the enterprise data center landscape. A 2011 survey by North Bridge Venture Partners projected that cloud spending will increase at a compound annual growth rate of 67 percent through 2016 (Skok, et al, 2011). The report forecast that 36 percent of data center budgets will be spent on cloud infrastructures by 2016. More than half the respondents in this survey expected to deploy hybrid clouds consisting of both public and private components. According to this study and others like it, the top three reasons reported for cloud adoption are agility, scalability, and cost (in that order). Heavyweights such as HP, Google, and AT&T are making major pushes into the public cloud market.

Big data platforms are already embracing public cloud computing in two increasingly common cases. The first is where an intercompany data supply chain (e.g., digital adverting chains linking website publishers to online ad networks or healthcare chains linking providers to payers) involves companies whose data is already cloud-based, so moving it off the cloud creates an extra hop. In another common case, cloud platforms are being used to create agile and economical hot backup environments. These use cases are the tip of the arrow for the growth of cloud computing in big data analytics.

Surveys reveal the top three
reasons reported for cloud adoption
are agility, scalability, and cost
(in that order).

Freeware at What Price?
Amid their efforts to contain costs, today’s architects and planners need to be cautious of the allure of freeware and open source software. The major caution is not about the need for support, for which there are many commercial solutions. Instead, there is a hidden cost; most free software solutions were designed to provide functionality rather than high performance. It is not unusual for optimized commercial MPP RDBMS engines to outperform Hadoop-like solutions by factors of 10x to 50x. This translates to needing 10x to 50x the hardware to achieve the same performance. This is often overlooked.

Because Hadoop solutions are relatively new, CIOs and other decision makers do not have established yardsticks to measure relative costs. As long as the Hadoop solution functions correctly, the costs go unchallenged. This would be unimaginable in a traditional data warehouse vendor evaluation, where even a 20 percent cost differential would be significant, much less a factor of 10x to 50x!

Technology Trends and Column-Stores
As noted earlier, non-volatile memory (SSD) technology is rapidly evolving: storage capacity is going up and prices are going down. In the hardware comparison of volatile memory versus non-volatile memory versus disk, the most dramatic changes are occurring in the middle with SSDs. How will this impact the three DBMS solutions of in-memory, column-store, and row-store? In-memory solutions rely on a uniformly accessible read/write memory space (main memory). SSDs, unlike main memory, have a block structure and different read and write characteristics. Data structures designed for main memory do not port over to SSD easily, making it challenging to leverage SSDs without major effort.

Because Hadoop solutions are
relatively new, CIOs and other
decision makers do not have
established yardsticks to measure
relative costs.

Column-stores were fundamentally designed to work around the penalty of low disk bandwidth and the penalty of random disk read/writes (the “seek” penalty). Accessing only columns of interest (compressed) reduces both the bandwidth and the random-seek requirements of the disk system. However, SSDs fundamentally remove these same penalties, as they offer very high bandwidth (about 10 times higher than disks) and no penalty for random seeks.

Of course, column-stores can also use SSDs and get even faster performance, but at some point (perhaps sub-second), faster performance may not provide value because microsecond responses cannot be digested by human clients. Furthermore, network latencies will become greater and more significant than query times. This same logic also applies to main memory: very high bandwidth (about 100 times that of disks) and no penalty for seeks.

As main memory sizes continue to increase (modestly) and SSD sizes increase (greatly), the solutions that will benefit enterprises the most are those that can use these storage layers in a distributed, shared-nothing, non-uniform manner. This applies equally to in-memory, row-stores, or column-stores.

Conclusion
In the time it took to read this article, tens of billions of new business transactions have taken place and many exabytes of machine-readable data have been generated. This phenomenal growth is driving major shifts in the computing marketplace. The market has responded with a wide spectrum of new technologies, and today the arena is still immature and in a state of rapid evolution. Consolidation and convergence is to be expected over the next few years. In the meantime, for structured big data analytics, a heterogeneous architecture is the best approach. ■