Skip to content

Commit

Permalink
add faiss
Browse files Browse the repository at this point in the history
  • Loading branch information
lyhue1991 authored Aug 11, 2023
1 parent b0947dd commit 1832137
Showing 1 changed file with 210 additions and 0 deletions.
210 changes: 210 additions & 0 deletions 10分钟入门faiss相似向量检索.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e328c745-2057-4d4e-b8c9-9c1b085d2b5e",
"metadata": {},
"source": [
"# 10分钟入门faiss相似向量索引"
]
},
{
"cell_type": "markdown",
"id": "ec98c697-d7d1-4a0d-9770-886d79688fc1",
"metadata": {},
"source": [
"​公众号算法美食屋后台回复关键词:源码,获取本文notebook源代码。"
]
},
{
"cell_type": "markdown",
"id": "29f4e643-fabc-4cf0-ab5f-4ef66f5a300d",
"metadata": {},
"source": [
"## 一,faiss简介\n",
"\n",
"faiss全称 Facebook AI Similarity Search,是FaceBook的AI团队针对大规模向量 进行 TopK 相似向量 检索 的一个工具,使用C++编写,有python接口,对10亿量级的索引可以做到毫秒级检索的性能。\n",
"\n",
"使用faiss 搭配合适的model和embedding函数,可以帮助我们 构建 人脸识别,相似图片检索,LLM知识库问答 等应用。\n",
"\n",
"faiss的主要原理是构建base vectors向量数据的index索引,然后利用索引对search vectors 实现 TopK 相似向量检索。\n",
"\n",
"faiss支持许多不同的构建索引的方式,以下是一些较推荐使用的类型。\n",
"\n",
"1,Flat:暴力精确检索,全局最优,适合数十万级。\n",
"\n",
"2,IVF100,Flat:倒排暴力检索(100聚类后暴力检索),非全局最优但召回高,适合数百万级。\n",
"\n",
"3, HNSW64: 图网络检索,Hierarchical NSW(Navigable Small World),每个节点64条边,检索复杂度log(logn),适合千万上亿规模以及更大规模的图索引,缺点是构建索引过程较慢,占用很大的存储。"
]
},
{
"cell_type": "markdown",
"id": "65f6bf79-b571-4e20-b1ba-8b3d9e22d06f",
"metadata": {},
"source": [
"## 二,Faiss安装"
]
},
{
"cell_type": "markdown",
"id": "fb672c21-a9ba-4deb-9dd6-e22810027844",
"metadata": {},
"source": [
"```\n",
"pip install faiss-cpu # faiss-gpu (一般来说cpu足够快了)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "b34b86d6-dc41-4da4-b05e-db449f0b6287",
"metadata": {},
"source": [
"## 三,代码范例"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "0bcee28c-a3bc-41b0-9de8-7456c435fc9f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"============================== 1,base vector ==============================\n",
"xb.shape = (100000, 64) \n",
"\n",
"============================== 2,query vector ==============================\n",
"xq.shape = (1000, 64) \n",
"\n",
"============================== 3,create&train ==============================\n",
"index.is_trained= True\n",
"index.ntotal= 100000 \n",
"\n",
"============================== 4, search ==============================\n",
"nearest vector ids:\n",
" [[ 207 381 1394 1019]\n",
" [ 300 911 142 526]\n",
" [ 838 1541 527 148]\n",
" [ 196 359 184 466]\n",
" [ 526 120 917 765]] \n",
"\n",
"metric(distances/scores) to query:\n",
" [[0.87687665 0.861282 0.8566772 0.85451 ]\n",
" [0.8702938 0.86668825 0.8593493 0.8523142 ]\n",
" [0.862915 0.8580744 0.8538497 0.84994495]\n",
" [0.8691999 0.86600447 0.8647547 0.8634621 ]\n",
" [0.8539624 0.8491495 0.84744585 0.8432568 ]] \n",
"\n",
"============================== 5, add&remove ==============================\n",
"after add, index.ntotal= 110000\n",
"after remove, index.ntotal= 109889 \n",
"\n",
"============================== 6, write&read ==============================\n",
"index_loaded.ntotal= 109889\n"
]
}
],
"source": [
"import faiss\n",
"import numpy as np \n",
"\n",
"# 〇,基本参数设置\n",
"d = 64 # 向量维度\n",
"nb = 100000 # index向量库的数据量\n",
"nq = 1000 # 待检索query的数目\n",
"index_type = 'Flat' # index 类型\n",
"metric_type = faiss.METRIC_INNER_PRODUCT # 度量(相似度/距离)类型\n",
"\n",
"\n",
"# 一,准备向量库向量\n",
"print('============================== 1,base vector ==============================')\n",
"np.random.seed(1234) \n",
"xb = np.random.random((nb, d)).astype('float32')\n",
"xb[:, 0] += np.arange(nb) / 1000. # index向量库的向量\n",
"faiss.normalize_L2(xb)\n",
"print('xb.shape = ',xb.shape,'\\n')\n",
"\n",
"\n",
"# 二,准备查询向量\n",
"print('============================== 2,query vector ==============================')\n",
"xq = np.random.random((nq, d)).astype('float32')\n",
"xq[:, 0] += np.arange(nq) / 1000. # 待检索的query向量\n",
"faiss.normalize_L2(xq)\n",
"print('xq.shape = ',xq.shape,'\\n')\n",
"\n",
"# 三,构建向量库索引\n",
"print('============================== 3,create&train ==============================')\n",
"index = faiss.index_factory(d,index_type,metric_type) #等价于 faiss.IndexFlatIP(d) \n",
"print('index.is_trained=',index.is_trained) # 输出为True,代表该类index不需要训练,只需要add向量进去即可\n",
"index.train(xb)\n",
"index.add(xb) # 将向量库中的向量加入到index中\n",
"print('index.ntotal=',index.ntotal,'\\n') # 输出index中包含的向量总数,为100000 \n",
"\n",
"\n",
"# 四,相似向量查询\n",
"print('============================== 4, search ==============================')\n",
"k = 4 # topK的K值\n",
"D, I = index.search(xq, k) # xq为待检索向量,返回的I为每个待检索query最相似TopK的索引list,D为其对应的距离\n",
"\n",
"print('nearest vector ids:\\n',I[:5],'\\n')\n",
"print('metric(distances/scores) to query:\\n',D[-5:],'\\n')\n",
"\n",
"# 五,增删索引向量\n",
"print('============================== 5, add&remove ==============================')\n",
"xa = np.random.random((10000, d)).astype('float32')\n",
"xa[:, 0] += np.arange(len(xa)) / 1000. \n",
"faiss.normalize_L2(xa)\n",
"index.add(xa)\n",
"print('after add, index.ntotal=',index.ntotal) \n",
"index.remove_ids(np.arange(1000,1111))\n",
"print('after remove, index.ntotal=',index.ntotal,'\\n') \n",
"\n",
"# 六,保存加载索引\n",
"print('============================== 6, write&read ==============================')\n",
"faiss.write_index(index, \"large.index\")\n",
"index_loaded = faiss.read_index('large.index')\n",
"print('index_loaded.ntotal=', index_loaded.ntotal)"
]
},
{
"cell_type": "markdown",
"id": "5dd53f5b-38da-4236-9790-1357b0a1742c",
"metadata": {},
"source": [
"参考文章:\n",
"1,《Faiss入门以及应用经验记录》\n",
"https://zhuanlan.zhihu.com/p/357414033\n",
"\n",
"2,《ANN召回算法之HNSW》\n",
"https://zhuanlan.zhihu.com/p/379372268\n",
"\n",
"​公众号算法美食屋后台回复关键词:源码,获取本文notebook源代码。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 1832137

Please sign in to comment.