Deep Dive: How do Vector Databases Work

Machine Learning System Design

Jul 13, 2023

∙ Paid

Today we dive into the subject of vector databases. Those databases are often used in search engines by using the vector representations of the items we are trying to search. We dig into the different algorithms that allow us to search for vectors among billions or trillions of documents. We cover:

What is a Vector database?
- The rise of Vector databases
- Different vector databases
Indexing and searching a vector space
- Product Quantization
- Locality-sensitive hashing
- Hierarchical Navigable Small World
Similarity Measures
- Euclidean distance
- Dot product
- Cosine similarity
Beyond indexing

What is a Vector database?

The rise of Vector databases

We have seen recently a surge in vector databases in this era of generative AI. The idea behind vector databases is to index the data with vectors that relate to that data

Vector databases are often used for recommender engines where we learn vector representations of users and items we want to recommend

This allows to quickly find similar items by using an approximate nearest neighbors search

As long as we can learn a vector representation of a piece of data, we can index it in a vector database. With the recent advent of LLMs, it became easier to compute vector representations of text documents capturing the semantic meaning of that text

Vector databases make it easier to find semantically similar text documents

Different vector databases

There are tons of vector database providers. Here is a small list of such databases:

Pinecone: A vector database that is designed for machine learning applications. It supports a variety of machine learning algorithms and is built on top of Faiss, an open source library by Meta for efficient similarity search of dense vectors.
Deep Lake: Deep Lake is a Database for AI powered by a unique storage format optimized for deep-learning and Large Language Model (LLM) based applications. It simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage for all workloads, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more.
Milvus: Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment.
Qdrant: is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload Qdrant is tailored to extended filtering support. It makes it useful for all sorts of neural-network or semantic-based matching, faceted search, and other applications.
Weaviate: Weaviate is an open source vector database that is robust, scalable, cloud-native, and fast. With Weaviate, you can turn your text, images and more into a searchable vector database using state-of-the-art ML models.

These are just a small sample of the available vector databases. I found that looking at LangChain vectorstore integration documentation provides a wider list of available vector databases.

Indexing and searching a vector space

Indexing a vector database is very different from indexing most databases. The goal of a query is to return the nearest neighbors as measured by a similarity metric. The typical time complexity for the K-nearest neighbor algorithm is O(ND), where N is the number of vectors and D is the vector dimension. The time complexity scales with the number of items which makes it unmanageable to build a scalable and fast database. The typical approach is to rely on Approximate Nearest Neighbor algorithms (ANN) to make the search blasting fast.

We here go over 3 different algorithms used for vector search: Product Quantization, Locality-sensitive hashing and Hierarchical Navigable Small World. Those are typical algorithms used in most vector databases. They are likely to be used in a combined manner for optimal retrieval speed.

Product Quantization

When looking for the nearest neighbors, it is often not important to be perfectly accurate. Product Quantization (PQ) is a way to quantize the vector space to represent vectors with less precision. The idea is a to cluster vectors and index the cluster centroids instead of the vectors themselves. When looking for the nearest neighbors to a query vector, we just need to pull the vectors from the closest cluster. It is a faster search and indexing the vectors takes much less memory space.

We first need to partition each vector into smaller vectors

and run a k-means algorithm on each partition

Instead of indexing the vectors, we index the centroid of the clusters they belong to. If we use 2 clusters per partition and have 6 vectors, that's 3X data compression. Obviously, the compression would be much higher with more vectors. Each vector now maps to a set of clusters and their related centroids.

If we want to find nearest neighbors from a query vector, we measure the squared Euclidean distance for each cluster in each partition and return the vectors with the lowest summed squared Euclidean distances

Instead of having to iterate through each vector, we just need to iterate through the clusters’ centroids. There is a balance between search latency and accuracy. The more clusters we use, the better will be the hash and more accurate will be the returned nearest neighbors, but it will increase the search latency as we will need to iterate through more clusters.

This is still a brute force approach as the algorithm scales with the number of clusters but it can be used in combination with other algorithms.

Locality-sensitive hashing

Locality-sensitive hashing (LSH) aims to group vectors together based on similarity. For example, we could partition the vector space into multiple buckets

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.