05 Apr 2025 6 min read

Searching billions of vectors in a millisecond, and an application

Deep Dive

Recent foundation models are quite effective at generating semantically meaningful embeddings for arbitrary objects. Whether it is a word, document, image, product, or database table, most things can be summarized as a vector.

This compression is possible because most objects of our interest, despite the infinite complexities of reality, tend to have some structure and sparsity. As such, an embedding can be used as an approximate address to describe an object.

Due to these developments, vector embeddings are now equally as central to data management as tuples in relations. But, of course, the universe is vast, which leads to the challenge of wrangling an equally large number of vector embeddings.

Vector databases

Vectors cannot be managed and queried in the same manner as relational data. Instead, vectors are managed by vector databases and queried through the framework of vector similarity search.

Let's use Alibaba as an example. Each product listing is encoded into a vector by an in-house machine learning model. Whenever a user searches for some item, their query is encoded into the same vector space. Then, a vector database searches for a few dozen or a few hundred listings that are semantically similar to the query. Then, a reranking model performs a more fine-grained, exhaustive evaluation of all returned items to remove false positives and send the most relevant items to the top. The entire pipeline happens in the span of milliseconds, lest the user's retention might plummet.

Vector ANN indexes

The backbone of the blazing-fast nearest-neighbor search is a wide array of vector approximate nearest neighbor (ANN) indexes, highly optimized for the similarity search query model.

There are three families of such indexes: locality sensitive hashing (LSH), trees, and graphs. LSH is a family of hash functions that, with high probability, place similar items into the same hash bucket. Trees, on the other hand, split the search space in half iteratively.

LSH and trees excel when the vectors live in low-dimensional space with high density. However, modern embeddings range from hundreds to thousands of dimensions. Simultaneously, the vectors themselves are sparsely populated, taking up a relatively low doubling dimension. To visualize this, imagine a line segment, a one-dimensional object, living in three-dimensional space. This sparsity is often enforced by the model training process through contrastive representation learning.

As such, most production vector databases use graph-based indexes, which aim to build navigable graphs.

A graph \(G = (V, E)\) is navigable if, given some query \(q\) and a starting node \(s \in V\), a greedy traversal from \(s\) returns the nearest neighbor to \(q\).

A greedy traversal is a simple algorithm: Look at a node's out-neighbors, move to the one closest to \(q\), and repeat. Of course, a complete graph is navigable. However, it has a very high number of edges, which makes traversal inefficient.

In the worst case, a navigable graph has at least \(O(\sqrt{n})\) neighbors for each node. That is a lot of edges if we have millions or billions of vectors. As such, most practical graph ANN indexes aim to approximate a navigable graph.

Two graph indexes graphs are the most popular: hierarchical navigable small world (HNSW) and DiskANN.

Hierarchical Navigable Small World (HNSW)

The original HNSW paper's illustration of the hierarchical greedy traversal of their index.

The HNSW index, as the name suggests, is hierarchical. In the base layer, they construct an approximately navigable graph by retaining only a small subset of the edges required to reach full navigability.

This in itself had been done before HNSW. The main bottleneck is that keeping only short edges leads to very long navigation times if the starting node is far away from the query. As such, HNSW adds long edges in higher layers of the hierarchy.

To do so, they select a random subset of vertices to keep in the next layer. Then, they construct an approximately navigable graph with these vertices. This process repeats: keeping a smaller random subset of vertices to construct yet another layer.

The result of this iterative procedure is a birthday cake of sorts. At the top, there are only a few nodes and a few long edges. At the bottom, there are many nodes with numerous short edges. By starting the greedy search at the top layer to rapidly approach the general vicinity, and then traversing down the graph, the search procedure is accelerated greatly.

DiskANN

HNSW is widely successful, but it has a downside: the data structure is large. As such, for large-scale vector data, terabytes of RAM are required.

We could instead try storing HNSW into fast disk drives like NVME SSDs. But this balloons the latency, as HNSW was not designed to be used as such. DiskANN is a disk-resident graph that offers an alternative for memory-constrained environments.

If you are not familiar with SSD I/O characteristics, just remember this: It's slow to issue one page of read request at a time, but it's nearly free to request a large number of pages at once.

The key idea behind DiskANN is to aggressively reduce the maximum number of iterations that the greedy procedure would take. Instead of long search sequences, it relies on parallel beam search to reach satisfactory recall. This change optimizes SSD I/O latency.

To achieve this goal, DiskANN takes a flat, approximately navigable graph, and then systematically replaces less useful short edges with very long edges. It eschews the hierarchy, as each layer of the hierarchy is yet another disk read request.

One noteworthy characteristic of DiskANN is that it has a worst-case theoretical performance guarantee, which was disproven for HNSW. As such, the construction process of DiskANN is more principled in some sense.

Applications

Both HNSW and DiskANN are useful in practice, reaching nearly perfect recall within milliseconds on billions of vectors. The most common use case for these indexes is online search. Some versions of these indexes power traditional search engines' question-answering, as well as product recommendations for the world's largest retailers.

Big tech need not be the only users of vector ANN indexes. I will highlight one application within my domain: dataset search. Data repositories run by governments, research centers, or data vendors have the daunting task of managing an enormous collection of tables, typically CSV files, and serving the right dataset to users' queries.

Recent results suggest that HNSW and DiskANN crush the competition, regardless of whether the dataset search is done using some natural language queries or some other criterion like joinability.

Paper Highlight

Incremental SliceLine for Iterative ML Model Debugging under Updates

A slice is an interpretable subgroup for tabular data, defined as all rows that satisfy some column values. Some examples of slices include products with prices between 10 and 20 USD or Caucasian female students. SliceLine is an algorithm for efficiently finding relevant slices among an exponential lattice of search space, which I have used in one of my papers. The algorithm's creators have updated their work to allow for incremental tuple additions and deletions.

LLM-Powered Proactive Data Systems

There's much debate on how best to incorporate LLMs into data management systems within academia and industry. This paper proposes one major line of thought, that says that the query interface should become completely natural language-based. Not only that, the data management system should ask the user proactively to clarify user intent and to better answer the user's question.

Personal Notes

I will be taking the qualifying exam in less than a month. If I pass, I will receive a master's degree in computer science and officially become a Ph.D. candidate. New writings might be sparse until summer for that reason.

References

Diwan et al. Navigable Graphs for High-Dimensional Nearest
Neighbor Search: Constructions and Limits. NeurIPS 2024.
Indyk and Xu. Worst-case Performance of Popular Approximate
Nearest Neighbor Search Implementations:
Guarantees and Limitations. NeurIPS 2023.
Malkov and Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. TPMAI 2018.
Subramanya et al. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. NeurIPS 2019.
Ibraheem et al. A Study on Efficient Indexing for Table Search in Data Lakes. ICSC 2024.