大模型-向量检索算法HNSW

2025-03-09 | 阅读：次

介绍

HNSW, 即Hierarchical Navigable Small World graphs(分层-可导航-小世界-图)的缩写,

可以说是在工业界影响力最大的基于图的近似最近邻搜索算法

后续可优化点

2025年1月7日：在 HNSW 中提前终止，以便更快地进行近似 KNN 搜索

elasticsearch博客

https://www.elastic.co/search-labs/blog/hnsw-knn-search-early-termination

JavaLib实现：

https://github.com/jelmerk/hnswlib

    <dependency>
        <groupId>com.github.jelmerk</groupId>
        <artifactId>hnswlib-core</artifactId>
        <version>1.2.0</version>
    </dependency>

    <dependency>
        <groupId>com.github.jelmerk</groupId>
        <artifactId>hnswlib-utils</artifactId>
        <version>1.2.0</version>
    </dependency>

数据结构

class Node<TItem>{

        final int id; // 自己的Node#id

        final int[] connections; // 关联的相邻的Node#id， 数组的长度为 randomLevel()

        volatile TItem item; // 原始的业务数据

        volatile boolean deleted; // 软删除，俗称逻辑删除，延迟删除。

}

每一行业务记录，都会包装为Node<TItem>对象。

connections.length 最大的那个Node会被作为查询入口，每次查询都从它开始。

主要参数

HNSW的主要参数

m 参数：
- 对应代码：Node对象中connections数组的容量，即 new connections[m]
- 含义：每个节点在HNSW图中的最大连接数。
- 默认值：16
- 调优建议：
  - 较小的m值可以减少索引的大小，但可能会降低搜索精度。
  - 较大的m值可以提高搜索精度，但会增加索引的大小和构建时间。
ef_construction 参数：
- 对应代码：searchBaseLayer(currObj, item.vector(), efConstruction, level);
- 含义：在构建索引时，每个节点的候选邻居数量。
- 默认值：100
- 调优建议：
  - 较大的ef_construction值可以提高索引的质量，但会增加索引构建时间。
  - 较小的ef_construction值可以加快索引构建速度，但可能会降低索引质量。
ef_search 参数：
- 对应代码：searchBaseLayer(currObj, destination, Math.max(ef_search, k), 0);
- 对应代码：搜索时候，跳表遍历图查出max(k,efSearch)
- 含义：在搜索时，每个节点的候选邻居数量。
- 默认值：100
- 调优建议：
  - 较大的ef_search值可以提高搜索精度，但会增加搜索时间。
  - 较小的ef_search值可以加快搜索速度，但可能会降低搜索精度。

查询

先从查询入口的Node进行跳表查询出最相似的那条记录

最多循环最大进行N次入口Node.connections.length-1次，

直到找不到比connections[i–]还相似的。

找出最相似的k条

从最相似的那条记录为起点，从前往后，即从下标0（第0层），跳表遍历图查出max(k,efSearch)条最相似的记录

遍历时，算法逐层进行，探索和扩展候选节点，直到用尽所有可能的路径。

返回k条最相似的记录

插入

参数： m： connections的容量，即 new connections[m]

参数： efConstruction：每次变更时，更新关联相邻节点的个数

每次有新业务记录插入时，

先从查询入口的Node进行跳表查询出最相似的那条记录
从最相似那条记录为起点，进行双向关联

layer = min(新.connections.length-1, 起点.connections.length - 1)

从后往前，进行layer– 次跳表遍历图查出efConstruction条最相似的记录

找出efConstruction条最相似记录,双向关联相邻节点至connections数组中

删除

删除动作复杂度是O1，非常快

直接从idMap中取出引用Node,将deleted改为true做软删除，后续由应用做合并。

如何使用

    public static void main(String[] args) throws Exception {

        Path file = TMP_PATH.resolve("cc.en.300.vec.gz");

        if (!Files.exists(file)) {
            downloadFile(WORDS_FILE_URL, file);
        } else {
            System.out.printf("Input file already downloaded. Using %s%n", file);
        }

        List<Word> words = loadWordVectors(file);

        System.out.println("Constructing index.");

        HnswIndex<String, float[], Word, Float> hnswIndex = HnswIndex
                .newBuilder(300, DistanceFunctions.FLOAT_INNER_PRODUCT, words.size())
                .withM(16)
                .withEf(200)
                .withEfConstruction(200)
                .build();

        long start = System.currentTimeMillis();

        hnswIndex.addAll(words, (workDone, max) -> System.out.printf("Added %d out of %d words to the index.%n", workDone, max));

        long end = System.currentTimeMillis();

        long duration = end - start;

        System.out.printf("Creating index with %d words took %d millis which is %d minutes.%n", hnswIndex.size(), duration, MILLISECONDS.toMinutes(duration));

        Index<String, float[], Word, Float> groundTruthIndex = hnswIndex.asExactIndex();

        Console console = System.console();

        int k = 10;

        while (true) {
            System.out.println("Enter an english word : ");

            String input = console.readLine();

            List<SearchResult<Word, Float>> approximateResults = hnswIndex.findNeighbors(input, k);

            List<SearchResult<Word, Float>> groundTruthResults = groundTruthIndex.findNeighbors(input, k);

            System.out.println("Most similar words found using HNSW index : %n%n");

            for (SearchResult<Word, Float> result : approximateResults) {
                System.out.printf("%s %.4f%n", result.item().id(), result.distance());
            }

            System.out.printf("%nMost similar words found using exact index: %n%n");

            for (SearchResult<Word, Float> result : groundTruthResults) {
                System.out.printf("%s %.4f%n", result.item().id(), result.distance());
            }

            int correct = groundTruthResults.stream().mapToInt(r -> approximateResults.contains(r) ? 1 : 0).sum();

            System.out.printf("%nAccuracy : %.4f%n%n", correct / (double) groundTruthResults.size());
        }
    }

elasticsearch索引文件预读优化

    PUT /my_index/_settings
    {
        "index": {
            "store": {
                "preload": ["vec", "vem", "vex"]
            }
        }
    }