Xorbits (#46)

* xorbits
godaai · May 9, 2024 · ed9fcab · ed9fcab
1 parent c445c1f
commit ed9fcab
Show file tree

Hide file tree

Showing 6 changed files with 99 additions and 11 deletions.
diff --git a/_toc.yml b/_toc.yml
@@ -52,15 +52,16 @@ subtrees:
     - file: ch-ray-data/data-load-inspect-save
     - file: ch-ray-data/data-transform
     - file: ch-ray-data/preprocessor
+    - file: ch-ray-data/modin
   - file: ch-ray-ml/index
     entries:
     - file: ch-ray-ml/ray-train
     - file: ch-ray-ml/ray-tune
     - file: ch-ray-ml/ray-serve
-  - file: ch-modin-xorbits/index
+  - file: ch-xorbits/index
     entries:
-    - file: ch-modin-xorbits/modin
-    - file: ch-modin-xorbits/xorbits
+    - file: ch-xorbits/xorbits
+    - file: ch-xorbits/xinference
   - file: ch-mpi/index
     entries:
     - file: ch-mpi/mpi-intro

diff --git a/ch-modin-xorbits/modin.ipynb → ch-ray-data/modin.ipynb b/ch-modin-xorbits/modin.ipynb → ch-ray-data/modin.ipynb
@@ -7,13 +7,15 @@
     "(sec-modin)=\n",
     "# Modin\n",
     "\n",
-    "Modin 是一款专门加速 pandas DataFrame 的框架。它对大数据进行了切分，使 DataFrame 运算分布到多核和集群上。它底层使用了 Ray 或 Dask 作为分布式执行引擎。因此，在安装 Modin 时，还要安装对应的执行引擎（Ray、Dask 或 [unidist](https://github.com/modin-project/unidist/)），比如 `pip install \"modin[ray]\"` 或 `pip install \"modin[dask]\"`。Modin 默认使用 Ray 作为其执行引擎。\n",
+    "Ray Data 提供的各类数据处理工具相对比较简单，只能做一些比较简单的转换，无法胜任复杂的数据处理任务。Modin 是一款专门加速 pandas 的框架。它对大数据进行了切分，使 DataFrame 分布到多核和集群上。早期，它底层使用了 Ray 作为分布式执行引擎，又被称为 Ray 上的 pandas（pandas on Ray）。之后又添加了 Dask 和 [unidist](https://github.com/modin-project/unidist/)，unidist 是 Modin 团队自己开发的分布式执行引擎。\n",
+    "\n",
+    "在安装 Modin 时，要安装对应的执行引擎（Ray、Dask 或 unidist），比如 `pip install \"modin[ray]\"` 或 `pip install \"modin[dask]\"`。Modin 默认使用 Ray 作为其执行引擎。\n",
     "\n",
     "## API 兼容性\n",
     "\n",
     "Dask DataFrame 与 pandas DataFrame 其实有不少差异，很多 pandas 工作流并不能快速迁移到 Dask DataFrame 上。Modin 更看重与 pandas 的兼容性，用户只需要 `import modin.pandas as pd`，绝大多数 pandas 工作流可以快速迁移到 Modin 上。\n",
     "\n",
-    "Dask DataFrame 只按列对大数据进行切分，且没有记录每个 Partition 有多少数据，Modin 在多维度对数据进行切分，保留行标签和列标签。Modin 支持行索引 `iloc()`；记录了每个数据块的数据量，可以支持`median()`、`quantile()`；支持行和列的转换（比如，`pivot()`、`transpose()`）等操作。"
+    "Dask DataFrame 只按列对大数据进行切分，且没有记录每个 Partition 有多少数据，Modin 在多维度对数据进行切分，保留行标签和列标签。Modin 支持行索引 `iloc()`；记录了每个数据块的数据量，可以支持`median()`、`quantile()`；支持行和列的转换（比如，`pivot()`、`transpose()`）等操作。有关 Modin 的设计，可以参考其两篇论文 {cite}`petersohn2020Scalable` {cite}`petersohn2021Flexible`。"
    ]
   },
   {

diff --git a/ch-modin-xorbits/index.md → ch-xorbits/index.md b/ch-modin-xorbits/index.md → ch-xorbits/index.md
@@ -1,4 +1,4 @@
-# Modin 与 Xorbits
+# Xorbits
 
 ```{tableofcontents}
 ```
diff --git a/ch-xorbits/xinference.md b/ch-xorbits/xinference.md
@@ -0,0 +1,48 @@
+(sec-xinference)=
+# Xinference
+
+Xorbits Inference (Xinference) 是一款面向大模型的推理平台，支持大语言模型、向量模型、文生图模型等。它底层基于 [Xoscar](https://github.com/xorbitsai/xoscar) 提供的分布式能力，使得模型可以在集群上部署，上层提供了类 OpenAI 的接口，用户可以在上面部署和调用开源大模型。Xinference 将对外服务的 API、推理引擎和硬件做了集成，不需要像 Ray Serve 编写代码来管理模型推理服务。
+
+## 推理引擎
+
+Xinference 可适配不同推理引擎，包括 Hugging Face Transformers、vLLM、GGML（llama.cpp）等，因此在安装时也要安装对应的推理引擎，比如 `pip install "xinference[transformers]"`。Transformers 完全基于 PyTorch，适配的模型最快最全，但性能较差；其他推理引擎，比如 vLLM、GGML 专注于性能优化，但模型覆盖度没 Transformers 高。
+
+## 集群
+
+使用之前需要先启动一个 Xinference 推理集群，可以是单机多卡，也可以是多机多卡。单机上可以在命令行里这样启动：
+
+```shell
+xinference-local --host 0.0.0.0 --port 9997
+```
+
+集群场景与 Xorbits Data 类似，先启动一个 Supervisor，再启动 Worker：
+
+```shell
+# 启动 Supervisor
+xinference-supervisor -H <supervisor_ip>
+
+# 启动 Worker
+xinference-worker -e "http://<supervisor_ip>:9997" -H <worker_ip>
+```
+
+之后就可以在 http://<supervisor_ip>:9997 访问 Xinference 服务。
+
+## 使用模型
+
+Xinference 可以管理模型部署的整个生命周期：启动模型、使用模型、关闭模型。
+
+启动 Xinference 服务后，我们就可以启动模型并调用模型，Xinference 集成了大量开源模型，用户可以在网页中选择一个启动，Xinference 会在后台下载并启动这个模型。每个启动的模型既有网页版对话界面，又兼容 OpenAI 的 API。比如，使用 OpenAI 的 API 与某个模型交互：
+
+```
+from openai import OpenAI
+client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")
+
+response = client.chat.completions.create(
+    model="my-llama-2",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "What is the largest animal?"}
+    ]
+)
+print(response)
+```
diff --git a/ch-modin-xorbits/xorbits.ipynb → ch-xorbits/xorbits.ipynb b/ch-modin-xorbits/xorbits.ipynb → ch-xorbits/xorbits.ipynb
@@ -4,14 +4,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "(sec-xorbits)=\n",
-    "# Xorbits\n",
+    "(sec-xorbits-data)=\n",
+    "# Xorbits Data\n",
     "\n",
-    "Xorbits 是一款面向数据科学的分布式计算框架，功能上与 Dask 或 Modin 有些类似，用来加速 pandas DataFrame、NumPy。它与 Dask 和 Modin 类似，对大数据进行切分，切分之后使用 pandas 或 NumPy 来执行。它底层使用自己研发的 Actor 编程框架 [Xoscar](https://github.com/xorbitsai/xoscar)，而不是依赖 Ray 或者 Dask。\n",
+    "Xorbits Data 是一款面向数据科学的分布式计算框架，功能上与 Dask 或 Modin 有些类似，用来加速 pandas DataFrame、NumPy。它与 Dask 和 Modin 类似，对大数据进行切分，切分之后使用 pandas 或 NumPy 来执行。它底层使用自己研发的 Actor 编程框架 [Xoscar](https://github.com/xorbitsai/xoscar)，而不是依赖 Ray 或者 Dask。\n",
     "\n",
     "## Xorbits 集群\n",
     "\n",
-    "在进行计算前，Xorbits 需要初始化一个集群，单机上可以直接 `xorbits.init()`，如果有一个集群，可以按照下面的方式启动：先启动一个管理进程（Supervisor），再在不同的计算节点上启动 Worker：\n",
+    "在进行计算前，Xorbits 需要初始化一个集群，单机上可以直接 `xorbits.init()`，如果有一个集群，可以按照下面的方式启动：先启动一个管理进程（Supervisor），再在不同的计算节点上启动 Worker。\n",
     "\n",
     "```shell\n",
     "# 先在管理节点启动 Supervisor\n",
@@ -290,6 +290,13 @@
     "b.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "同样使用计算图，Xorbits 用户可以不用像 Dask DataFrame 那样关注计算图的细节，也不需要使用 `repartition()`。Xorbits 在后台计算图构建和执行时做了优化，出现数据倾斜时，Xorbits 会自动优化计算图，避免计算图过大。"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/references.bib b/references.bib
@@ -36,7 +36,7 @@ @article{harris2020array
 
 @book{zhou2016machine,
     title={机器学习},
-    author={Zhou, Zhihua},
+    author={周志华},
     publisher={清华大学出版社},
     year={2016}
 }
@@ -168,3 +168,33 @@ @misc{jaderberg2017Population
   archiveprefix = {arxiv}
 }
 
+@article{petersohn2020Scalable,
+  title = {Towards Scalable Dataframe Systems},
+  author = {Petersohn, Devin and Macke, Stephen and Xin, Doris and Ma, William and Lee, Doris and Mo, Xiangxi and Gonzalez, Joseph E. and Hellerstein, Joseph M. and Joseph, Anthony D. and Parameswaran, Aditya},
+  year = {2020},
+  month = aug,
+  journal = {Proceedings of the VLDB Endowment},
+  volume = {13},
+  number = {12},
+  pages = {2033--2046},
+  issn = {2150-8097},
+  doi = {10.14778/3407790.3407807},
+  urldate = {2023-02-02},
+  langid = {english}
+}
+
+@article{petersohn2021Flexible,
+  title = {Flexible Rule-Based Decomposition and Metadata Independence in Modin: A Parallel Dataframe System},
+  shorttitle = {Flexible Rule-Based Decomposition and Metadata Independence in Modin},
+  author = {Petersohn, Devin and Tang, Dixin and Durrani, Rehan and {Melik-Adamyan}, Areg and Gonzalez, Joseph E. and Joseph, Anthony D. and Parameswaran, Aditya G.},
+  year = {2021},
+  month = nov,
+  journal = {Proceedings of the VLDB Endowment},
+  volume = {15},
+  number = {3},
+  pages = {739--751},
+  issn = {2150-8097},
+  doi = {10.14778/3494124.3494152},
+  urldate = {2023-03-23},
+  langid = {english}
+}