Joe Barrow VOL_01 / N1

tinyhnsw / introduction

Build Your own Vector Database · part 1 of 1

Introduction to TinyHNSW

The first post in the TinyHNSW series, introducing the tutorial and the library.

By Joe Barrow

Build Your Own Vector Database

Do you wonder how a “generative search” feature like Google’s works? How Spotify finds similar music to recommend? How you can search images on the web with just a description? All of these applications are built on top of vector databases.

Vector databases have rapidly grown in popularity recently, thanks to concepts like retrieval augmented generation (RAG). This has led to a flurry of start-ups in the space, like Pinecone, Weaviate, etc., which come on top of an already rich and thriving open source ecosystem.

As vector databases grow in popularity, I think it’s worthwhile to take a step back and understand how the tools we’re employing work. The goal of this tutorial is to help you understand what vector databases are, why they’re important, and how they work.

In particular, we will walk through an hierarchical navigable small worlds (HNSW), a new(-ish) approach to fast approximate nearest neighbor search. After spending a few evenings going through this, you will have a deep grasp of HNSW and a fully working vector database of your own.

⚠️ Much like cryptography, it's probably best that you stick
with current, known implementations (e.g. FAISS, hnswlib,
uSearch, etc.) rather than rolling your own for production.

What this tutorial is

This tutorial will walk you through the process of implementing a simple vector database, and then making it more efficient by understanding an implementing an approximate nearest neighbors (ANN) algorithm. We build our vector database around Hierarchical Navigable Small Worlds (HNSW) [1], a very popular ANN algorithm

What this tutorial isn’t

Although I will try to cover the introductory concepts at a high-level, this tutorial probably isn’t the best introduction to concepts like: “what is a vector”.

Similarly, I will not go into advanced optimization, serialization, or memory management strategies. Once you’ve finished building tinyhnsw you’ll have a small, working, and reasonably efficient vector database. However, you should see this as a launching off point more than anything! You could make it more efficient, or you could add different approximate nearest neighbors algorithm, or seek to optimize the serialization and space requirements, or make it resistant to crashes, or so many other things! The neat thing about small libraries that you can fully understand is that they’re wonderful experimental test-beds.

Some Concepts

References

  1. Yu A Malkov, Dmitry A Yashunin. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence.