Massively Scalable Big Data Analytics in the Cloud: A Q&A with Mike Lamble of XtremeData
December 17, 2012, by Ron Powell. Business Intelligence & Analytics
This BeyeNETWORK Spotlight features Ron Powell's interview with Mike Lamble, President of XtremeData. Ron and Mike discuss XtremeData’s massively scalable SQL database engine for the cloud or on premise that provides the high performance required for industrial-strength analytic applications.
Mike, for our readers who may not be familiar with XtremeData, could you give an us an overview of the company and your market focus?
Mike Lamble: We offer a massively scalable database management system (DBMS) for big data analytics. That means we're the DBMS for big and/or rapidly growing data warehouses, data marts or analytic sandboxes. These would typically be in the range of a half a terabyte to hundreds of terabytes. They’re used for all kinds of purposes, for cross business unit reporting, for investigative analytics or for predictive analytics, in virtually every industry – healthcare, digital media, communications, and financial services. Companies that want to be fast-paced and predictive rather than reactive are using data-driven architectures, integrating data from lots of different pockets of the business to see what's going on and forecast what's likely to go on.
What is your definition of big data?
Mike Lamble: The concept is that big data is big enough that it requires specialized and non-generic tools, techniques, and technologies. Twelve years ago big data was, say, a terabyte to 10 terabytes. Now big data is probably a terabyte to a petabyte or more. The idea of big data has been around for a long time. It's only recently that it made the cover of Harvard Business Review. To make big data applicable today, you have to be working with technologies that scale linearly yet are still affordable at staggering volumes. It also means as you add computing resources, you get a proportionate gain in performance that will either allow you to cut processing times down in proportion to other resources, or handle increased data in the same amount of time. Massively parallel architectures are the time-tested type of solution for big data allocations.
Some have come to equate big data with Hadoop, but my colleagues and I have been using the term “big data” since the year 2000 at least, long before Hadoop was popular. In many, if not most cases, what's being done with Hadoop involves big data, but most big data applications are developed outside of Hadoop in data management systems for big data like XtremeData, Teradata, and Netezza.
Sometimes we're talking about structured data, sometimes we're talking about unstructured data, and sometimes we're talking about semi-structured data. In our view of the world, big data involves all of that.
So it seems to me the integration of unstructured, structured, and semi-structured data would give you the best value from an ROI perspective. Would you agree?
Mike Lamble: Well, yes. In almost all cases, you have to turn unstructured to structured data before you can perform analytics. The number of cases where people are doing analytics on unstructured data, pattern recognition on bit maps, for example, are a relatively small number of cases compared to the more abundant business cases that involve parsing data, creating structure, and then performing analytics on it.
There are several other big data trends that we're seeing, and one of the newest involves the cloud. Can big data really be effectively handled with a cloud architecture?
Mike Lamble: Absolutely. There are some kinks to be worked out, but it seems that every week cloud providers are offering new configurations of storage disk and network capacity that are removing the limitations. We know of specific situations that are working toward a 100-terabyte data warehouse that will be run on the cloud. We have situations where we can create a virtual data warehouse appliance in the cloud that creates the kind of capability that you can get from the major appliance vendors at a fraction of the cost. The bigger limitations right now aren't technically oriented, but more policy-oriented involving company policy limitations about customer proprietary information or client proprietary information in a public environment.
It’s fascinating to see that we can create a virtual data warehouse appliance on Amazon in about 15 minutes that would support a 20-terabyte data warehouse with reliable and efficient throughput.
That seems absolutely amazing.
Mike Lamble: That’s how I feel every time I see it done. With a hardware-based appliance, it can take three months or more to order, manufacture, ship, configure, and install the box – compared to minutes on the cloud!
You mentioned policy limitations with regard to the cloud. We've really seen a shift over the last three to four years where those limitations become less and less as people trust the cloud more. Is that correct?
Mike Lamble: Absolutely. I was with a prospect, a major Wall Street bank, working with them over the past quarter on a data mart application. We asked if they’d be interested in doing this in the cloud. At that point, somebody in the room said, "Over my dead body." Now a quarter later, they still can't get the hardware they want into their data center because there are standards debates. That fellow is now saying, "Maybe we should pilot this on the cloud."
By the way, that's a major category of growth for the cloud. The first adopters were small companies, and some of those companies only know the cloud. They don't have an internal data center. But a major category of cloud growth involves projects from Fortune 1000 companies that are either doing projects on a one-at-a-time basis or a shadow IT basis because they can't get those projects into the data center.
XtremeData makes the claim of having the fastest SQL engine and delivering linear scalability at any size. Is it really possible? And if so, how are you the fastest?
Mike Lamble: It’s possible and true. It comes from rewriting the Postgres execution engine. That is the relational database management codebase that most major players started with. We rewrote the backend to make it truly peer-to-peer, thoroughly multi-threaded, and not having the overhead that rival Postgres MPP implementations had.
What several of the MPP players for the big data warehouse did is to create federations of Postgres instances. That approach is not capable of generating the kind of linear scalability that we deliver. It's not sufficiently multithreaded, has inefficiencies on the data nodes, and creates bottlenecks on the head node. It took us years to get this right, but we did it. What also makes us different from some of our competitors is that we process data – reads and writes – in big blocks. We do things in thousands of records at a time rather than a record at a time.
Now from an installation perspective, how difficult is it to implement XtremeData?
Mike Lamble: On cloud, on Amazon AWS, for example, it is just really simple. You can use a tool we have out there to set up a database of arbitrary size in terms of processors and storage variations that Amazon offers, and set this up in literally minutes. It's just a breakthrough capability. At that point, you need to load your schemas and load your tables. It could take an hour or a couple days depending on how big your data warehouse is, whether it's 10 tables or 10,000 tables. But that's just breakthrough capability. I've been in this business for a long time, and I'm just anxious for the market to see what a difference this makes to the lifecycle of their project.
On bare metal, it's not too hard. To set up a multi-node configuration, say a 12-node configuration, it takes an hour or so. Then you're off and ready to go.
What do you feel differentiates XtremeData from all of these other major players that are addressing big data analytics?
Mike Lamble: That is a great question. As an early stage company, I'm challenged every day with questions about what differentiates us from the pack.
XtremeData is the only row-oriented MPP DBMS that runs on the cloud and bare metal at all scales. Teradata, Exadata, Netezza all require proprietary hardware. Running on commodity hardware, we scale one node or more nodes at a time. That is, you’re not facing a “fork lift” when you’ve outgrown your current box.
XtremeData is for workloads that require a mix of both read and write processing; for example, for customers with significant ELT (extract, load, transform) processing requirements. Column- store DMBS solutions – ParAccel, Vertica, InfoBright, for example – are meant for “write once, read many” data marts.
In contrast to Hadoop, our product is ready for industrial-strength requirements that involve heavy SQL usages, integration with the third-party BI ecosystem of tools for statistics and reporting, and require low latency performance. Hadoop is aiming in that direction but has a long way to go.
You’re right: for the less initiated, it appears to be a crowded field. However, when you pare it down, when we're at the table competing for the business, we're typically up against the data warehouse appliances.
Mike, could you share some specific customer examples that illustrate the benefits of XtremeData?
Mike Lamble: We have household name consumer packaged goods companies using a “demand signal repository” (DSR) that uses our DBMS. The DSR monitors how each company’s products are selling across national retailers and gives them dashboards to show how product sales are responding to demand-generating events. It might be a price change, it might be a coupon drop, or it might be an ad campaign – continuously reporting around the world how sales are moving in response to stimulation events. XtremeData is running four multi-terabyte data warehouses 24 hours a day, 365 days a year and providing the analytics, the dashboards, and the data sets that it needs on-demand – at a fraction of the cost of alternatives.
XtremeData is also used by a major bioinformatics lab that does iterative investigative ad hoc analytics on terabytes of clinical trial results. A third example is a client in the digital advertising community that is using us on a real-time basis to assimilate information from customers’ activity and activity on various websites to create real-time views of customers that are then passed on to different advertising platforms to drive the pricing of ad placements.
Mike, thank you for taking the time to educate our readers about massively scalable SQL databases for big data analytics.