-
Notifications
You must be signed in to change notification settings - Fork 90
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
210 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,210 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e328c745-2057-4d4e-b8c9-9c1b085d2b5e", | ||
"metadata": {}, | ||
"source": [ | ||
"# 10分钟入门faiss相似向量索引" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ec98c697-d7d1-4a0d-9770-886d79688fc1", | ||
"metadata": {}, | ||
"source": [ | ||
"公众号算法美食屋后台回复关键词:源码,获取本文notebook源代码。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "29f4e643-fabc-4cf0-ab5f-4ef66f5a300d", | ||
"metadata": {}, | ||
"source": [ | ||
"## 一,faiss简介\n", | ||
"\n", | ||
"faiss全称 Facebook AI Similarity Search,是FaceBook的AI团队针对大规模向量 进行 TopK 相似向量 检索 的一个工具,使用C++编写,有python接口,对10亿量级的索引可以做到毫秒级检索的性能。\n", | ||
"\n", | ||
"使用faiss 搭配合适的model和embedding函数,可以帮助我们 构建 人脸识别,相似图片检索,LLM知识库问答 等应用。\n", | ||
"\n", | ||
"faiss的主要原理是构建base vectors向量数据的index索引,然后利用索引对search vectors 实现 TopK 相似向量检索。\n", | ||
"\n", | ||
"faiss支持许多不同的构建索引的方式,以下是一些较推荐使用的类型。\n", | ||
"\n", | ||
"1,Flat:暴力精确检索,全局最优,适合数十万级。\n", | ||
"\n", | ||
"2,IVF100,Flat:倒排暴力检索(100聚类后暴力检索),非全局最优但召回高,适合数百万级。\n", | ||
"\n", | ||
"3, HNSW64: 图网络检索,Hierarchical NSW(Navigable Small World),每个节点64条边,检索复杂度log(logn),适合千万上亿规模以及更大规模的图索引,缺点是构建索引过程较慢,占用很大的存储。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "65f6bf79-b571-4e20-b1ba-8b3d9e22d06f", | ||
"metadata": {}, | ||
"source": [ | ||
"## 二,Faiss安装" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fb672c21-a9ba-4deb-9dd6-e22810027844", | ||
"metadata": {}, | ||
"source": [ | ||
"```\n", | ||
"pip install faiss-cpu # faiss-gpu (一般来说cpu足够快了)\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "b34b86d6-dc41-4da4-b05e-db449f0b6287", | ||
"metadata": {}, | ||
"source": [ | ||
"## 三,代码范例" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "0bcee28c-a3bc-41b0-9de8-7456c435fc9f", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"============================== 1,base vector ==============================\n", | ||
"xb.shape = (100000, 64) \n", | ||
"\n", | ||
"============================== 2,query vector ==============================\n", | ||
"xq.shape = (1000, 64) \n", | ||
"\n", | ||
"============================== 3,create&train ==============================\n", | ||
"index.is_trained= True\n", | ||
"index.ntotal= 100000 \n", | ||
"\n", | ||
"============================== 4, search ==============================\n", | ||
"nearest vector ids:\n", | ||
" [[ 207 381 1394 1019]\n", | ||
" [ 300 911 142 526]\n", | ||
" [ 838 1541 527 148]\n", | ||
" [ 196 359 184 466]\n", | ||
" [ 526 120 917 765]] \n", | ||
"\n", | ||
"metric(distances/scores) to query:\n", | ||
" [[0.87687665 0.861282 0.8566772 0.85451 ]\n", | ||
" [0.8702938 0.86668825 0.8593493 0.8523142 ]\n", | ||
" [0.862915 0.8580744 0.8538497 0.84994495]\n", | ||
" [0.8691999 0.86600447 0.8647547 0.8634621 ]\n", | ||
" [0.8539624 0.8491495 0.84744585 0.8432568 ]] \n", | ||
"\n", | ||
"============================== 5, add&remove ==============================\n", | ||
"after add, index.ntotal= 110000\n", | ||
"after remove, index.ntotal= 109889 \n", | ||
"\n", | ||
"============================== 6, write&read ==============================\n", | ||
"index_loaded.ntotal= 109889\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import faiss\n", | ||
"import numpy as np \n", | ||
"\n", | ||
"# 〇,基本参数设置\n", | ||
"d = 64 # 向量维度\n", | ||
"nb = 100000 # index向量库的数据量\n", | ||
"nq = 1000 # 待检索query的数目\n", | ||
"index_type = 'Flat' # index 类型\n", | ||
"metric_type = faiss.METRIC_INNER_PRODUCT # 度量(相似度/距离)类型\n", | ||
"\n", | ||
"\n", | ||
"# 一,准备向量库向量\n", | ||
"print('============================== 1,base vector ==============================')\n", | ||
"np.random.seed(1234) \n", | ||
"xb = np.random.random((nb, d)).astype('float32')\n", | ||
"xb[:, 0] += np.arange(nb) / 1000. # index向量库的向量\n", | ||
"faiss.normalize_L2(xb)\n", | ||
"print('xb.shape = ',xb.shape,'\\n')\n", | ||
"\n", | ||
"\n", | ||
"# 二,准备查询向量\n", | ||
"print('============================== 2,query vector ==============================')\n", | ||
"xq = np.random.random((nq, d)).astype('float32')\n", | ||
"xq[:, 0] += np.arange(nq) / 1000. # 待检索的query向量\n", | ||
"faiss.normalize_L2(xq)\n", | ||
"print('xq.shape = ',xq.shape,'\\n')\n", | ||
"\n", | ||
"# 三,构建向量库索引\n", | ||
"print('============================== 3,create&train ==============================')\n", | ||
"index = faiss.index_factory(d,index_type,metric_type) #等价于 faiss.IndexFlatIP(d) \n", | ||
"print('index.is_trained=',index.is_trained) # 输出为True,代表该类index不需要训练,只需要add向量进去即可\n", | ||
"index.train(xb)\n", | ||
"index.add(xb) # 将向量库中的向量加入到index中\n", | ||
"print('index.ntotal=',index.ntotal,'\\n') # 输出index中包含的向量总数,为100000 \n", | ||
"\n", | ||
"\n", | ||
"# 四,相似向量查询\n", | ||
"print('============================== 4, search ==============================')\n", | ||
"k = 4 # topK的K值\n", | ||
"D, I = index.search(xq, k) # xq为待检索向量,返回的I为每个待检索query最相似TopK的索引list,D为其对应的距离\n", | ||
"\n", | ||
"print('nearest vector ids:\\n',I[:5],'\\n')\n", | ||
"print('metric(distances/scores) to query:\\n',D[-5:],'\\n')\n", | ||
"\n", | ||
"# 五,增删索引向量\n", | ||
"print('============================== 5, add&remove ==============================')\n", | ||
"xa = np.random.random((10000, d)).astype('float32')\n", | ||
"xa[:, 0] += np.arange(len(xa)) / 1000. \n", | ||
"faiss.normalize_L2(xa)\n", | ||
"index.add(xa)\n", | ||
"print('after add, index.ntotal=',index.ntotal) \n", | ||
"index.remove_ids(np.arange(1000,1111))\n", | ||
"print('after remove, index.ntotal=',index.ntotal,'\\n') \n", | ||
"\n", | ||
"# 六,保存加载索引\n", | ||
"print('============================== 6, write&read ==============================')\n", | ||
"faiss.write_index(index, \"large.index\")\n", | ||
"index_loaded = faiss.read_index('large.index')\n", | ||
"print('index_loaded.ntotal=', index_loaded.ntotal)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5dd53f5b-38da-4236-9790-1357b0a1742c", | ||
"metadata": {}, | ||
"source": [ | ||
"参考文章:\n", | ||
"1,《Faiss入门以及应用经验记录》\n", | ||
"https://zhuanlan.zhihu.com/p/357414033\n", | ||
"\n", | ||
"2,《ANN召回算法之HNSW》\n", | ||
"https://zhuanlan.zhihu.com/p/379372268\n", | ||
"\n", | ||
"公众号算法美食屋后台回复关键词:源码,获取本文notebook源代码。" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |