Kubeflow Training Operator is currently at v1.
- Go (1.20 or later)
Create a symbolic link inside your GOPATH to the location you checked out the code
mkdir -p $(go env GOPATH)/src/github.com/kubeflow
ln -sf ${GIT_TRAINING} $(go env GOPATH)/src/github.com/kubeflow/training-operator
- GIT_TRAINING should be the location where you checked out https://github.com/kubeflow/training-operator
Install dependencies
go mod tidy
Build it
go install github.com/kubeflow/training-operator/cmd/training-operator.v1
Running the operator locally (as opposed to deploying it on a K8s cluster) is convenient for debugging/development.
First, you need to run a Kubernetes cluster locally. There are lots of choices:
local-up-cluster.sh
runs a single-node Kubernetes cluster locally, but Minikube runs a single-node Kubernetes cluster inside a VM. It is all compilable with the controller, but the Kubernetes version should be 1.8
or above.
Notice: If you use local-up-cluster.sh
, please make sure that the kube-dns is up, see kubernetes/kubernetes#47739 for more details.
We can configure the operator to run locally using the configuration available in your kubeconfig to communicate with a K8s cluster. Set your environment:
export KUBECONFIG=$(echo ~/.kube/config)
export KUBEFLOW_NAMESPACE=$(your_namespace)
- KUBEFLOW_NAMESPACE is used when deployed on Kubernetes, we use this variable to create other resources (e.g. the resource lock) internal in the same namespace. It is optional, use
default
namespace if not set.
After the cluster is up, the TFJob CRD should be created on the cluster.
make install
Now we are ready to run operator locally:
make run
To verify local operator is working, create an example job and you should see jobs created by it.
cd ./examples/tensorflow/dist-mnist
docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
kubectl create -f ./tf_job_mnist.yaml
On ubuntu the default go package appears to be gccgo-go which has problems see issue golang-go package is also really old so install from golang tarballs instead.
To generate Python SDK for the operator, run:
./hack/python-sdk/gen-sdk.sh
This command will re-generate the api and model files together with the documentation and model tests.
The following files/folders in sdk/python
are auto-generated and should not be modified directly:
sdk/python/docs
sdk/python/kubeflow/training/models
sdk/python/kubeflow/training/*.py
sdk/python/test/*.py
The Training Operator client and public APIs are located here:
sdk/python/kubeflow/training/api
-
Use
black
to format Python code -
Run the following to install
black
:pip install black==23.9.1
-
To check your code:
black --check --exclude '/*kubeflow_org_v1*|__init__.py|api_client.py|configuration.py|exceptions.py|rest.py' sdk/