HotSort

HotSort is a high-performance GPU-accelerated integer sorting library for Vulkan, CUDA and OpenCL compute APIs.

HotSort's advantages include:

  • Ultra-fast sorting of 32‑bit or 64‑bit keys
  • Reaches peak throughput on small arrays
  • In-place sorting for low-memory environments
  • Strong scaling with number of multiprocessors
  • Low memory transactions relative to array size
  • A concurrency-friendly dense kernel grid
  • Support for GPU post-processing of sorted results

HotSort is typically significantly faster than other GPU-accelerated implementations when sorting arrays of smaller than 500K-2M keys.

Benchmarks

Throughput

Here are results for HotSort on Vulkan and CUDA sorting 32-bit and 64-bit keys on a 640-core Quadro M1200:

Here are results for HotSort on Vulkan and OpenCL on an Intel HD 630:

Execution time

Note that these sorting rates translate to sub-millisecond to multi-millisecond execution times on small GPUs:

Usage

There are HotSort implementations for Vulkan, CUDA and OpenCL.

Note that HotSort is a comparison sort and supports in-place sorting.

There are also benchmarking examples for the Vulkan, CUDA and OpenCL implementations.

Not all targeted architectures have been tested.

Vulkan

The following architectures are supported:

VendorArchitecture32‑bit64‑bit32+32‑bitNotes
NVIDIAsm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70:white_check_mark::white_check_mark::x:Performance matches CUDA
NVIDIAsm_30,sm_32,sm_53,sm_62:x::x::x:Need to generate properly shaped kernels
AMDGCN:x::x::x:TODO
IntelGEN8+:white_check_mark::white_check_mark::x:Good but the assumed best kernels are not being generated at this time due to a compiler issue
IntelAPL/GLK using a 2x9 or 1x12 thread pool:x::x::x:Need to generate properly shaped kernels

Add an arch-specific HotSort algorithm (aka "target") to your project by including a .c source and .h header file:

Key SizeSourceHeader
32‑biths/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.chs/vk/<vendor>/<arch>/u32/hs_target.h
64‑biths/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.chs/vk/<vendor>/<arch>/u64/hs_target.h

To sort count keys on Vulkan:

#include "hs/vk/intel/gen8/u32/hs_target.h"

...

// create the Vulkan HotSort target
struct hs_vk * hs = hs_vk_create(<address of target>,...);

// allocate a descriptor set from a pool
VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool);

...

// command buffer begin

...

// bind buffer(s) to a command buffer
hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting

// see how much padding may be required
hs_vk_pad(hs,count,&count_padded_in,&count_padded_out);

// append compute shaders to command buffer
hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args

...

// command buffer end and queue submit

...

// release the Vulkan HotSort target
hs_vk_release(hs,...);

The hs_vk.h header file describes these functions in greater detail.

CUDA

The following architectures are supported:

VendorArchitecture32‑bit64‑bit32+32‑bitNotes
NVIDIAsm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70:white_check_mark::white_check_mark::x:
NVIDIAsm_30,sm_32,sm_53,sm_62:x::x::x:Need to generate properly shaped kernels

Add an arch-specific HotSort target to your project by including a .cu CUDA source and .h header file:

Key SizeSourceHeader
32‑biths/cuda/sm_35/u32/hs_cuda_u32.cuhs/cuda/sm_35/u32/hs_cuda.h
64‑biths/cuda/sm_35/u64/hs_cuda_u64.cuhs/cuda/sm_35/u64/hs_cuda.h

Usage on CUDA is very simple.

For example, to sort count keys:

#include "hs/cuda/sm_35/u32/hs_cuda.h"

...

uint32_t count_padded_in, count_padded_out;

hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in);

...

hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting
                 count,count_padded_in,count_padded_out,
                 true,
                 stream0,stream1,stream2);

HotSort on CUDA requires two auxilary streams in order to maximize concurrency.

The algorithm is guaranteed to complete on stream0.

OpenCL

The following architectures are supported:

VendorArchitecture32‑bit64‑bit32+32‑bitNotes
IntelGEN8+:white_check_mark::white_check_mark::x:Due to a fragile compiler, the assumed best kernels are not being generated at this time
IntelAPL/GLK using a 2x9 or 1x12 thread pool:x::x::x:Need to generate properly shaped kernels

Add an arch-specific HotSort target to your project by including a .c source and .h header file:

Key SizeSourceHeader
32‑biths/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.chs/cl/<vendor>/<arch>/u32/hs_target.h
64‑biths/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.chs/cl/<vendor>/<arch>/u64/hs_target.h

Note that if the symbol HS_DUMP_SOURCE is not defined then the pre-compiled GEN9 binaries will be included. These binaries may not be compatible with all GEN8+ devices and drivers.

To sort count keys on OpenCL:

// create the OpenCL HotSort target
struct hs_cl * hs = hs_cl_create(<address of target>,...);

...

// see how much padding may be required
hs_cl_pad(hs,count,&count_padded_in,&count_padded_out);

// enqueue compute kernels
hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting

...

// release the OpenCL HotSort target
hs_cl_release(hs,...);

The hs_cl.h header file describes these functions in greater detail.

Background

The HotSort sorting algorithm was created in 2012 and generalized in 2015 to support GPUs that benefit from non-power-of-two workgroups.

The goal of HotSort was to achieve high throughput as early as possible on small GPUs when sorting modestly-sized arrays ― 1,000s to 100s of thousands of 64‑bit keys.

HotSort uses both well-known and obscure properties of bitonic sequences to create a novel mapping of keys onto data-parallel devices like GPUs.

Overview

The overall flow of the HotSort algorithm is below. Kernel launches are in italics.

  1. For each workgroup of slabs:
    1. For each slab in the workgroup:
      1. Slab Load
      2. Slab Sort
    2. Until all slabs in the workgroup are merged:
      1. Multi-Slab Flip Merge
    3. Slab Store
  2. Until all slabs are merged:
    1. Streaming Flip Merge
    2. If necessary, Streaming Half Merge
    3. If necessary, Multi-Slab Half Merge
    4. If necessary, Slab Half Merge
    5. If complete:
      1. Optionally, Report Key Changes
      2. Optionally, Slab Transpose & Store
    6. Otherwise: Slab Store
  3. Done

Sorting

The algorithm begins with a very dense per-multiprocessor block sorting algorithm that loads a "slab" of keys into a subgroup's registers, sorts the slabs, merges all slabs in the workgroup, and stores the slabs back to global memory.

In the slab sorting phase, each lane of a subgroup executes a bitonic sorting network on its registers and successively merges lanes until the slab of registers is sorted in serpentine order.

Merging

HotSort has several different merging strategies.

The merging kernels leverage the multiprocessor's register file by loading, merging and storing a large number of strided slab rows without using local memory.

The merging kernels exploit the bitonic sequence property that interleaved subsequences of a bitonic sequence are also bitonic sequences. This property also holds for non-power-of-two sequences.

As an example, the Streaming Flip Merge kernel is illustrated below:

Future Enhancements

Hybrid improved merging

HotSort's initial sorting and merging phases are performed on bitonic sequences. Because of this, throughput decreases as the problem size increases.

A hybrid algorithm that combined HotSort's block sorter and several rounds of merging with a state-of-the-art GPU merging algorithm would probably improve the algorithm's performance on larger arrays.

Reenable support for devices lacking shuffle functions

The original version of HotSort ran on pre-Kepler GPUs without intra-warp/inter-lane shuffling ― reenable this capability.

Metal support

Modify the HotSort generator to support Metal targets.