Skip to content

Commit

Permalink
incorporate #118 from redhat-et/ilab-on-ocp
Browse files Browse the repository at this point in the history
Signed-off-by: Michael Clifford <mcliffor@redhat.com>

Co-authored-by: Michael Clifford <mcliffor@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
  • Loading branch information
2 people authored and openshift-merge-bot[bot] committed Oct 22, 2024
1 parent cbf4a40 commit f3697a9
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 1 deletion.
138 changes: 138 additions & 0 deletions instructlab/standalone/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,144 @@ of models without relying on centralized orchestration tools like KubeFlow.
The `standalone.py` tool provides support for fetching generated SDG (Synthetic Data Generation) data from an AWS S3 compatible object store.
While AWS S3 is supported, alternative object storage solutions such as Ceph, Nooba, and MinIO are also compatible.

## Overall end-to-end workflow

```text
+-------------------------------+
| Kubernetes Job |
| "data-download" |
+-------------------------------+
| Init Container |
| "download-data-object-store" |
| (Fetches data from object |
| storage) |
+-------------------------------+
| Main Container |
| "sdg-data-preprocess" |
| (Processes the downloaded |
| data) |
+-------------------------------+
|
v
+-------------------------------+
| "watch for completion" |
+-------------------------------+
|
v
+-----------------------------------+
| PytorchJob CR training phase 1 |
| |
| +---------------------+ |
| | Master Pod | |
| | (Trains and | |
| | Coordinates the | |
| | distributed | |
| | training) | |
| +---------------------+ |
| | |
| v |
| +---------------------+ |
| | Worker Pod 1 | |
| | (Handles part of | |
| | the training) | |
| +---------------------+ |
| | |
| v |
| +---------------------+ |
| | Worker Pod 2 | |
| | (Handles part of | |
| | the training) | |
| +---------------------+ |
+-----------------------------------+
|
v
+-------------------------------+
| "wait for completion" |
+-------------------------------+
|
v
+-----------------------------------+
| PytorchJob CR training phase 2 |
| |
| +---------------------+ |
| | Master Pod | |
| | (Trains and | |
| | Coordinates the | |
| | distributed | |
| | training) | |
| +---------------------+ |
| | |
| v |
| +---------------------+ |
| | Worker Pod 1 | |
| | (Handles part of | |
| | the training) | |
| +---------------------+ |
| | |
| v |
| +---------------------+ |
| | Worker Pod 2 | |
| | (Handles part of | |
| | the training) | |
| +---------------------+ |
+-----------------------------------+
|
v
+-------------------------------+
| "wait for completion" |
+-------------------------------+
|
v
+-------------------------------+
| Kubernetes Job |
| "eval-mt-bench" |
+-------------------------------+
| Init Container |
| "run-eval-mt-bench" |
| (Runs evaluation on MT Bench)|
+-------------------------------+
| Main Container |
| "output-eval-mt-bench-scores"|
| (Outputs evaluation scores) |
+-------------------------------+
|
v
+-------------------------------+
| "wait for completion" |
+-------------------------------+
|
v
+-------------------------------+
| Kubernetes Job |
| "eval-final" |
+-------------------------------+
| Init Container |
| "run-eval-final" |
| (Runs final evaluation) |
+-------------------------------+
| Main Container |
| "output-eval-final-scores" |
| (Outputs final evaluation |
| scores) |
+-------------------------------+
|
v
+-------------------------------+
| "wait for completion" |
+-------------------------------+
|
v
+-------------------------------+
| Kubernetes Job |
| "trained-model-upload" |
+-------------------------------+
| Main Container |
| "upload-data-object-store" |
| (Uploads the trained model to|
| the object storage) |
+-------------------------------+
```

## Requirements

The `standalone.py` script is designed to run within a Kubernetes environment. The following requirements must be met:
Expand Down
2 changes: 1 addition & 1 deletion instructlab/standalone/standalone.py
Original file line number Diff line number Diff line change
Expand Up @@ -3212,4 +3212,4 @@ def upload_trained_model(ctx: click.Context):
logger.info("Failed to load kube config. Trying in-cluster config")
kubernetes.config.load_incluster_config()

cli()
cli()

0 comments on commit f3697a9

Please sign in to comment.