The architectural changes introduced with multi-core processors have triggered a redesign of main-memory join algorithms for relational database systems. Join processing in database systems is a complex and a demanding operation. Traditionally, there have been sorting and hashing based approaches for implementing joins. However, in the last few years several diverging views have appeared regarding how to design and implement main-memory joins on multi-core processors.
In this project we have investigated an extensive set of join approaches, both in terms of algorithms and implementations on a broad range of recent multi-core platforms. Our investigations have provided conclusive answers to the existing controversies in the literature.
On the one hand, there are two camps on how to implement main-memory hash joins on multi-core. Hardware-oblivious hash join variants do not depend on hardware-specific parameters. Rather, they consider qualitative characteristics of modern hardware and are expected to achieve good performance on any technologically similar platform. The assumption that these algorithms make is that hardware has now become good enough at hiding its own limitations—through automatic hardware prefetching, out-of-order execution or simultaneous multi- threading (SMT)—to make hardware-oblivious algorithms competitive without the overhead of carefully tuning to the underlying hardware. Hardware-conscious implementations, such as (parallel) radix join, aim to maximally exploit a given architecture by tuning the algorithm parameters (e.g., hash table sizes) to the particular features of that architecture. The assumption here is that explicit parameter tuning yields enough performance advantages to warrant the effort required.
In this project, we compare the two approaches for hash joins under a wide range of workloads (relative table sizes, tuple sizes, effects of sorted data, etc.) and configuration parameters (VM page sizes, number of threads, number of cores, SMT, SIMD, prefetching, etc.). The results show that hardware-conscious algorithms generally outperform hardware-oblivious ones. However, on specific workloads and special architectures with aggressive simultaneous multi-threading, hardware-oblivious algorithms become competitive. As an answer to the controversy on how to implement hash joins on existing multi-core architectures, the project shows that it is still important to carefully tailor algorithms to the underlying hardware to get the necessary performance.
With the advent of modern multi-core architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide), and NUMA awareness (sort-merge is superior to hash join in NUMA architectures). Through extensive experiments on the original and optimized versions of these algorithms, we show that, contrary to these claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved. Thus, in this project we resolve another controversy in the literature regarding the relative performance of sort and hash based join algorithms. The project provides the fastest implementations of these algorithms, and covers many aspects of modern hardware architectures relevant not only for joins but for any parallel data processing operator.
Cagri Balkesen (now with Oracle Labs; e-mail)