Parallel Everything Architecture
Like many competitors, XtremeData began with an open-source database software package. But unlike others, we then re-engineered the core query execution code with a truly parallel, vectorized SQL engine developed from first principles. The reasons for this are simple. Legacy database software, including all open-source packages, were developed decades ago and are not optimized for the key computing resources of today: many-core CPUs, large amounts of memory and high-speed networks.
Unlike “federated” systems, where multiple complete instances of a database run in parallel, XtremeData offers a single instance of a database that within itself contains a truly parallel SQL execution engine. The core software layer manages all peer-to-peer communication and data exchange between nodes. It has been designed to excel at what other databases find difficult or impossible to do: handle big data issues of complex SQL against complex schema. XtremeData is data model agnostic and does not require careful data partitioning or placement to deliver performance. This enables us to excel at performing complex n-way joins and aggregates against multiple big tables, at scales of 1-100's of TB.
At XtremeData we have benchmarked our engine against federation-based competitors and also against “NoSQL” solutions like Hive, and measured performance gains of 10x. What does this mean? Put simply, the federated systems will need 10x the hardware resources in order to match XtremeData.
SQL Acceleration Model
In addition to the vector-oriented execution model, XtremeData implements acceleration of SQL operators using modern techniques such as real-time code generation and just-in-time compilation. Highly optimized libraries have been built for the key operators required to implement SQL query plans.
These optimized libraries take full advantage of modern CPUs:
- many cores
- multi-level cache architecture
- and internal SIMD (Single Instruction Multiple Data) vector units
Dynamic Data Redistribution
Legacy databases were architected to minimize data movement and exchange. This legacy remains in today's market solutions and has manifested as a significant performance penalty when data is exchanged between nodes in a parallel system. This penalty has imposed implementation constraints: data models that are query aware, and data placement that is sympathetic to queries. This has serious consequences and has resulted in large amounts of time, effort and money being expended in developing and supporting one off point solutions. What’s worse is these point solutions are still unable to support newer types of analyses and ad-hoc queries.
XtremeData implements an innovative and highly efficient system for dynamically redistributing data as needed, using industry standard network technology. Data exchanges between nodes occur as peer-to-peer transient transfers at runtime for query needs. The data exchanges are carefully pipelined with processing stages, such that the transfer times on the network are effectively hidden and do not significantly affect query execution time. XtremeData ensures that all joins perform at near the speeds of co-located joins. Dynamic data redistribution eliminates the need for query aware data models and sympathetic placement of data, thus significantly reducing implementation time, labor and costs.
XtremeData allows users to simply "load and go" with any data model and any placement. No longer does a team of DBAs need to fully understand the usage patterns and try to match the placement with queries to obtain performance. XtremeData provides high performance out of the box, at all scales.
Automatic Load Balancing
In today's data centers, it is not unusual to find large clusters of Linux computers totaling in the100's to even 1000's. Scaling the hardware is relatively simple, but scaling software across this hardware infrastructure can be extremely challenging. One of the critical factors that limit scalability of parallel processing systems is load balancing – the ability to ensure all nodes are performing an equal amount of work. If the load is unbalanced, a few nodes end up performing all the work while the rest remain idle. This is especially true for large parallel databases, where the load distribution for a particular SQL query depends on the profile of the data. The profiles of intermediate data at stages within a query execution plan are largely unknown to the database software. Therefore the database engine makes some rough guesses and tries to implement load balancing across nodes in a primitive manner. More often than not, these attempts fail, resulting in highly unbalanced load distribution and poor performance.
XtremeData implements automatic load balancing by collecting detailed statistics in real-time on all data being processed, and then uses these statistics to dynamically distribute the workload. This is a unique strength of our technology and enables dbX to scale without side effects; providing predictable and consistent performance at all scales.
Most database solutions in the market today lack support for multi-tenancy which is the ability to isolate data and resources from different groups of users. Such isolation is extremely important as it is often needed in order to meet today’s confidentiality requirements. XtremeData was designed to natively support multi-tenancy, enabling complete isolation of data and resources. This provides administrators the flexibility to carve out dedicated database systems to meet differing service level agreements (SLAs), and yet manage all systems as a single entity via a single administration tool.
XtremeData supports the concept of "node sets" within the cluster of nodes. This means that a 100 node system can act like a single 100 node system, two different 50 node systems, fifty different 2 node systems, or any combination you desire. The administrator can easily carve up the system to isolate certain users from others. In cloud environments, you can now separate higher paying users (on more nodes) from lower paying clients (on fewer nodes). Both types of users get the same feature sets, but different performance and service levels can be applied based on usage fees.