g. vector databases

Chinou Gea
3 min readJul 15, 2023

g1) We have seen recently a surge in vector databases in this era of generative AI. The idea behind vector databases is to index the data with vectors that relate to that data. Hierarchical Navigable Small World (HNSW) is one of the most efficient ways to build indexes for vector databases. The idea is to build a similarity graph and traverse that graph to find the nodes that are the closest to a query vector.

Navigable Small World (NSW) is a process to build efficient graphs for search. We build a graph by adding vectors one after the others and connecting each new node to the most similar neighbors.

When building the graph, we need to decide on a metric for similarity such that the search is optimized for the specific metric used to query items. Initially, when adding nodes, the density is low and the edges will tend to capture nodes that are far apart in similarity. Little by little, the density increases and the edges start to be shorter and shorter. As a consequence the graph is composed of long edges that allow us to traverse long distances in the graph, and short edges that capture closer neighbors. Because of it, we can quickly traverse the graph from one side to the other and look for nodes at a specific location in the vector space.

When we want to find the nearest neighbor to a query vector, we initiate the search by starting at one node (i.e. node A in that case). Among its neighbors (D, G, C), we look for the closest node to the query (D). We iterate over that process until there are no closer neighbors to the query. Once we cannot move anymore, we found a close neighbor to the query. The search is approximate and the found node may not be the closest as the algorithm may be stuck in a local minima.

The problem with NSW, is we spend a lot of iterations traversing the graph to arrive at the right node. The idea for Hierarchical Navigable Small World is to build multiple graph layers where each layer is less dense compared to the next. Each layer represents the same vector space, but not all vectors are added to the graph. Basically, we include a node in the graph at layer L with a probability P(L). We include all the nodes in the final layer (if we have N layers, we have P(N) = 1) and the probability gets smaller as we get toward the first layers. We have a higher chance of including a node in the following layer and we have P(L) < P(L + 1).

The first layer allows us to traverse longer distances at each iteration where in the last layer, each iteration will tend to capture shorter distances. When we search for a node, we start first in layer 1 and go to the next layer if the NSW algorithm finds the closest neighbor in that layer. This allows us to find the approximate nearest neighbor in less iterations in average.

最近,我们看到在这个生成人工智能时代,矢量数据库激增。矢量数据库背后的想法是使用与该数据相关的矢量来索引数据。分层可导航小世界(HNSW)是为矢量数据库构建索引的最有效方法之一。这个想法是构建一个相似度图并遍历该图以查找最接近查询向量的节点。

可导航小世界(Navigable Small World, NSW)是一个构建高效搜索图的过程。我们通过依次添加向量并将每个新节点连接到最相似的邻居来构建一个图。

构建图表时,我们需要决定相似性指标,以便针对用于查询项目的特定指标优化搜索。最初,当添加节点时,密度较低,边缘倾向于捕获相似度相距较远的节点。渐渐地,密度增加,边缘开始越来越短。因此,该图由允许我们在图中遍历长距离的长边和捕获更近邻居的短边组成。正因为如此,我们可以快速地将图从一侧遍历到另一侧,并在向量空间中的特定位置查找节点。

当我们想要找到查询向量的最近邻居时,我们从一个节点(即这种情况下的节点A)开始搜索。在其邻居(D、G、C)中,我们寻找最接近查询的节点 (D)。我们迭代该过程,直到没有更接近查询的邻居为止。一旦我们不能再移动,我们就找到了查询的近邻。搜索是近似的,找到的节点可能不是最接近的,因为算法可能陷入局部最小值。

NSW的问题是我们花费大量迭代来遍历图表才能到达正确的节点。分层可导航小世界的想法是构建多个图形层,其中每一层的密度都低于下一层。每层代表相同的向量空间,但并非所有向量都添加到图中。基本上,我们在图中的 L 层包含一个概率为P(L)的节点。我们将所有节点包含在最后一层中(如果我们有 N层,则P(N)=1),并且随着我们进入第一层,概率会变小。我们有更高的机会将节点包含在下一层中,并且P(L) < P(L + 1)。

第一层允许我们在每次迭代中遍历更长的距离,而在最后一层中,每次迭代将倾向于捕获更短的距离。当我们搜索节点时,我们首先从第 1 层开始,如果NSW算法找到该层中最近的邻居,则转到下一层。这使我们能够平均以较少的迭代次数找到近似的最近邻。

– –

信源:TheAiEdge.io时事通讯,达米安·本文尼斯特(DAMIEN BENVENISTE);下一期机器学习工程大师班(MasterClass.TheAiEdge.io)将于7月29日开始。Credit: newsletter on TheAiEdge.io; Next ML engineering Masterclass starting July 29th: MasterClass.TheAiEdge.io. Share & Translate: Chinou Gea (秦陇纪) @2023 DSS-SDS, IFS-AHSC. Data Simplicity Community Facebook Group https://m.facebook.com/groups/290760182638656/ #DataSimp #DataScience #DataComputing #computer #program #AI #ArtificialIntelligence #MachineLearning #ML #VectorDB.

--

--

Chinou Gea
Chinou Gea

Written by Chinou Gea

Chinou Gea Studio -- open academic researching and sharing in information and data specialties by Chinou Gea; also follow me at www.facebook.com/aaron.gecai