Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

Authors: Jeff Johnson, Matthijs Douze, Hervé Jégou 2017

–

Paper Summary

Introduction

Past resarch has focused on deriving complex vectors to represent text, images etc. using GPUs
However, how to manipulate that data effectively for indexing and searching using GPUs is unclear
Numerical similarity vs. structured relations: numerical simlarity is more suitable (memory, dimensionality etc)
Vector compression: binary codes vs. quantization methods - this paper uses product quantization as it was shown to be more effective

Contributions

a GPU k-selection algorithm, operating in fast register memory and flexible enough to be fusable with other kernels, for which we provide a complexity analysis;
a near-optimal algorithmic layout for exact and approximate k-nearest neighbor search on GPU;
a range of experiments that show that these improvements outperform previous art by a large margin on mid- to large-scale nearest-neighbor search tasks, in single or multi-GPU configurations.

Problem Statement

$L = k-argmin_{i=0:l}||x-y_{i}||_{2}$

$x$: query vector
$y$: collection
We search the k nearest neighbors of $x$ in terms of L2 distance.
The lowest distances are collected by k-selection.
Batching: searches are performed in batches of $n_q$ query vectors
Exact Search: The exact solution computes the full pair- wise distance matrix D
Compressed-domain Search: uses the IVFADC indexing structure which approxiamtes databse vector $y$ using two quantizers (functions that output an element from a finite set.). Asymmetric Distance Computation (ADC) search method returns an approximate result.

GPU Overview and K-Selection

Architecture

The Nvidia GPU is a general-purpose computer that executes instruction streams using a 32-wide vector of CUDA threads (the warp); individual threads in the warp are referred to as lanes, with a lane ID from 0 – 31.
A user-configurable collection of 1 to 32 warps comprises a block or a co-operative thread array (CTA). Each block has a high speed shared memory, up to 48 KiB in size. Individual CUDA threads have a block-relative ID, called a thread id, which can be used to partition and assign work. Each block is run on a single core of the GPU called a streaming multiprocessor (SM).
Blocks are organized in a grid of blocks in a kernel.
The number of blocks executing concurrently depends upon shared memory and register resources used by each block. Per-CUDA thread register usage is determined at compilation time, while shared memory usage can be chosen at runtime.
Different blocks and kernels communicate through global memory, typically 4 – 32 GB in size, with 5 – 10× higher bandwidth than CPU main memory. Shared memory is analogous to CPU L1 cache in terms of speed.

GPU register file usage

As the GPU register file is very large, storing structured data (not just temporary operands) is useful. A single lane can use its (scalar) registers to solve a local task, but with limited parallelism and storage.
A common pattern to achieve this is a lane-stride register array.

k-selection on CPU versus GPU

In k-selection, selection via max-heap is a typical choice on the CPU, but heaps do not expose much data parallelism (due to serial tree update) and cannot saturate SIMD execution units.
Heaps can be similarly implemented on a GPU. However, a straightforward GPU heap implementation suffers from high warp divergence and irregular, data-dependent memory movement, since the path taken for each inserted element depends upon other values in the heap.
Other more novel GPU algorithms are available for small k, namely the selection algorithm in the fgknn library.
We take inspiration from this particular algorithm through the use of parallel merges as seen in their merge queue structure.

Fast K-Selection on the GPU

For input from global memory, k-selection cannot run faster than the time required to scan the input once at peak memory bandwidth. We aim to get as close to this limit as possible.
Solution: WARPSELECT

Computation Layout

Exact Search: used for IVFADC coarse quantizer $q_1$
IVFADC indexing: At its core, the IVFADC requires computing the distance from a vector to a set of product quantization reproduction values.
GPU Implementation: A kernel is responsible for scanning the $τ$ closest inverted lists for each query, and calculating the per-vector pair distances using the lookup tables $T_i$. The $T_i$ are stored in shared memory. Each $nq × τ$ pairs of query against inverted list can be processed independently.
Multi-GPU parallelism: If an index instance fits in the memory of a single GPU, it can be replicated across $R$ different GPUs. If an index instance does not fit in the memory of a single GPU, an index can be sharded across $S$ different GPUs. Replication and sharding can be used together ($S$ shards, each with $R$ replicas for $S × R$ GPUs in total).

Experimentation & Applications

k-selection performance: all state is maintained in registers (no shared memory), no inter-warp synchronization or buffering is used, no “hierarchical partition”, the k- selection can be fused into other kernels, and it uses odd-size networks for efficient merging and sorting.
k-means clustering: our implementation is more than 2× faster than that of BIDMach when applied to MNIST8m images
Exact nearest neighbor search: Sift1M dataset used for evaluation; shows more efficient computation using our methods

My Thoughts

Sort of skimmed this paper becuase I was interested in seeing how faiss actually worked (instead of just using the package) but I found the paper to be a little bit too detailed and not enough high-level for me. Still, it was interesting to read about the GPU structure and how their algorithms take that into very careful consideration. I think this may be due to the fact that the paper is realtively old… Or are all algorithm papers written like this? Very different from usual ML model papers.
To be honest, I think the content I was looking for is actually in their blog, which I looked up halfway through the paper because I wasn’t really getting the information that I needed. I also linked the Medium article that I referenced in my application. The package itself is very straightforward and easy to use.

Billion-scale similarity search with GPUs