Billion-scale similarity search with GPUs

 · 6 mins read

Abstract: Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy. We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach for the sake of comparison and reproducibility.

Original Paper

Blog

Tutorial(Medium)

Authors: Jeff Johnson, Matthijs Douze, Hervé Jégou 2017

Paper Summary

Introduction

  • Past resarch has focused on deriving complex vectors to represent text, images etc. using GPUs
  • However, how to manipulate that data effectively for indexing and searching using GPUs is unclear
  • Numerical similarity vs. structured relations: numerical simlarity is more suitable (memory, dimensionality etc)
  • Vector compression: binary codes vs. quantization methods - this paper uses product quantization as it was shown to be more effective

Contributions

  • a GPU k-selection algorithm, operating in fast register memory and flexible enough to be fusable with other kernels, for which we provide a complexity analysis;
  • a near-optimal algorithmic layout for exact and approximate k-nearest neighbor search on GPU;
  • a range of experiments that show that these improvements outperform previous art by a large margin on mid- to large-scale nearest-neighbor search tasks, in single or multi-GPU configurations.

Problem Statement

$L = k-argmin_{i=0:l}||x-y_{i}||_{2}$

  • $x$: query vector
  • $y$: collection
  • We search the k nearest neighbors of $x$ in terms of L2 distance.
  • The lowest distances are collected by k-selection.
  • Batching: searches are performed in batches of $n_q$ query vectors
  • Exact Search: The exact solution computes the full pair- wise distance matrix D
  • Compressed-domain Search: uses the IVFADC indexing structure which approxiamtes databse vector $y$ using two quantizers (functions that output an element from a finite set.). Asymmetric Distance Computation (ADC) search method returns an approximate result.

GPU Overview and K-Selection

Architecture

  • The Nvidia GPU is a general-purpose computer that executes instruction streams using a 32-wide vector of CUDA threads (the warp); individual threads in the warp are referred to as lanes, with a lane ID from 0 – 31.
  • A user-configurable collection of 1 to 32 warps comprises a block or a co-operative thread array (CTA). Each block has a high speed shared memory, up to 48 KiB in size. Individual CUDA threads have a block-relative ID, called a thread id, which can be used to partition and assign work. Each block is run on a single core of the GPU called a streaming multiprocessor (SM).
  • Blocks are organized in a grid of blocks in a kernel.
  • The number of blocks executing concurrently depends upon shared memory and register resources used by each block. Per-CUDA thread register usage is determined at compilation time, while shared memory usage can be chosen at runtime.
  • Different blocks and kernels communicate through global memory, typically 4 – 32 GB in size, with 5 – 10× higher bandwidth than CPU main memory. Shared memory is analogous to CPU L1 cache in terms of speed.

GPU register file usage

  • As the GPU register file is very large, storing structured data (not just temporary operands) is useful. A single lane can use its (scalar) registers to solve a local task, but with limited parallelism and storage.
  • A common pattern to achieve this is a lane-stride register array.

k-selection on CPU versus GPU

  • In k-selection, selection via max-heap is a typical choice on the CPU, but heaps do not expose much data parallelism (due to serial tree update) and cannot saturate SIMD execution units.
  • Heaps can be similarly implemented on a GPU. However, a straightforward GPU heap implementation suffers from high warp divergence and irregular, data-dependent memory movement, since the path taken for each inserted element depends upon other values in the heap.
  • Other more novel GPU algorithms are available for small k, namely the selection algorithm in the fgknn library.
  • We take inspiration from this particular algorithm through the use of parallel merges as seen in their merge queue structure.

Fast K-Selection on the GPU

  • For input from global memory, k-selection cannot run faster than the time required to scan the input once at peak memory bandwidth. We aim to get as close to this limit as possible.
  • Solution: WARPSELECT

Computation Layout

  • Exact Search: used for IVFADC coarse quantizer $q_1$
  • IVFADC indexing: At its core, the IVFADC requires computing the distance from a vector to a set of product quantization reproduction values.
  • GPU Implementation: A kernel is responsible for scanning the $τ$ closest inverted lists for each query, and calculating the per-vector pair distances using the lookup tables $T_i$. The $T_i$ are stored in shared memory. Each $nq × τ$ pairs of query against inverted list can be processed independently.
  • Multi-GPU parallelism: If an index instance fits in the memory of a single GPU, it can be replicated across $R$ different GPUs. If an index instance does not fit in the memory of a single GPU, an index can be sharded across $S$ different GPUs. Replication and sharding can be used together ($S$ shards, each with $R$ replicas for $S × R$ GPUs in total).

Experimentation & Applications

  • k-selection performance: all state is maintained in registers (no shared memory), no inter-warp synchronization or buffering is used, no “hierarchical partition”, the k- selection can be fused into other kernels, and it uses odd-size networks for efficient merging and sorting.
  • k-means clustering: our implementation is more than 2× faster than that of BIDMach when applied to MNIST8m images
  • Exact nearest neighbor search: Sift1M dataset used for evaluation; shows more efficient computation using our methods

My Thoughts

  • Sort of skimmed this paper becuase I was interested in seeing how faiss actually worked (instead of just using the package) but I found the paper to be a little bit too detailed and not enough high-level for me. Still, it was interesting to read about the GPU structure and how their algorithms take that into very careful consideration. I think this may be due to the fact that the paper is realtively old… Or are all algorithm papers written like this? Very different from usual ML model papers.
  • To be honest, I think the content I was looking for is actually in their blog, which I looked up halfway through the paper because I wasn’t really getting the information that I needed. I also linked the Medium article that I referenced in my application. The package itself is very straightforward and easy to use.