Parallel Radix Sort - SIGMOD 2019 Programming Contest

This post documents my participation in the programming contest at the ACM SIGMOD conference 2019. The contest was first hosted at SIGMOD 2009 and has been held annually since then. Our team made the finalist list this year and won a prize to present our design and implementation of Leyenda in Amsterdam. The following is a brief breakdown and summary of our approach.

Setup

The topic of this year’s programming contest was sorting. Contestants were asked to sort 10/20/60 GB files on the testbed machine. Initially, the dataset sizes were something like 10/20/40/100 GB, but sorting 100 GB files on the testbed took too long for any team to make last-minute changes, so the organizers decided to increase the third large dataset to 60 GB with skewed data and eliminate the fourth one.

For the testbed, there were roughly 30 GB of memory and 20 cores with 40 hyperthreads. Sounds normal, right? What we weren’t told is that the devil was in the details—all memory actually resided on NUMA node 0, which meant threads on other nodes would have longer access latency to memory. This became our own kind of Achilles’ heel when every millisecond counted.

Radix Sort

Radix sort is a round-based, non-comparative sorting algorithm that iterates through sorting keys digit by digit to separate them into buckets, grouping records by individual digits that share the same position and value. For example, records “abc”, “bcd”, “cde”, “cfe”, and “dfg” can be separated into four buckets according to the first character: “a”, “b”, “c”, and “d” in the first round of radix sort. In subsequent rounds, buckets are further divided until all records are in sorted order.

Each round of radix sort has O(n) time complexity, and there can be at most w rounds, where w is the word length of the sorting key. Therefore, the overall time complexity of the radix sort algorithm is O(w × n). Radix sort performs better than comparative sorting algorithms like QuickSort or MergeSort on workloads with fixed-size keys per record. There are variations such as least-significant digit radix sort and most-significant digit radix sort for different data distributions.

Key Points

TL;DR

NUMA awareness

Page cache optimization

Custom radix sort implementation

For small and medium datasets, an important observation is that you don’t need to write a single byte to disk. There’s enough memory to write directly to the page cache, which is much faster. For large datasets, it’s more complex, but a vanilla external sort with merging of intermediate results from disk could achieve decent results, assuming you overlap reading input with sorting intermediate results and keep the last intermediate result in memory without writing to disk.

The same principle applies to small and medium datasets: every byte saved in memory reduces I/O cost, so aggressively unmapping/freeing allocated memory as soon as it’s no longer needed, on a small page granularity, is beneficial. It’s also worth writing intermediate results to disk from end to start (in reverse order) so that the last result stays in the page cache when you begin merging, assuming you read from the beginning.

For more general optimizations, huge pages for large buffers might help, and NUMA awareness is crucial. Pinning all threads to NUMA node 0, where all memory resides, can provide significant improvements on small/medium datasets if there’s no other use for the threads; otherwise, they can hurt memory bandwidth. We also ended up writing our own radix sorting implementation, which gave us a decent advantage on small and medium workloads.

In a Nutshell

Some points are quoted from our discussion forum, as all implementations from the finalist teams share some similarities. A big thanks to Zhaoxing—without whom our two-person team could never have made it to the finalist list. Even though one of the professors, who wasn’t much help, thought our work wasn’t “research” enough to be a senior-level CS graduate class project, we nailed it anyway.

It’s interesting to see that many renowned research projects are spin-offs from programming contests, such as Apache Spark, AlexNet, and the Heterogeneous Parallel Virtual Machine. It’s been a fun experience. Also if you find this post helpful, feel free to cite this work at

@misc{shi2019leyendaadaptivehybridsorting,
      title={Leyenda: An Adaptive, Hybrid Sorting Algorithm for Large Scale Data with Limited Memory}, 
      author={Yuanjing Shi and Zhaoxing Li},
      year={2019},
      eprint={1909.08006},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/1909.08006}, 
}