Skip to content

Kubernetes Operator for MPI-based applications (distributed training like Horovod, etc.)

License

Notifications You must be signed in to change notification settings

FFFFFaraway/MPI-Operator

Repository files navigation

License Go Reference Go Go Report Card

MPI Operator

A big part of this project is based on MPI Operator in Kubeflow. This project is a stripped down version written according to my own understanding using kubebuilder.

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out this blog post for an introduction to MPI Operator and its industry adoption.

Installation

You’ll need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against a remote cluster. You’ll need kustomize installed. Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

You can deploy the operator by running the following commands. By default, we will create a namespace 'sw-mpi-operator' and deploy everything in it.

git clone https://github.com/FFFFFaraway/MPI-Operator
cd mpi-operator
make deploy

You can check whether the MPI Job custom resource is installed via:

kubectl get crd

The output should include mpijobs.batch.test.bdap.com like the following:

NAME                                       AGE
...
mpijobs.batch.test.bdap.com                4d
...

You can check whether the MPI Job Operator is running via:

kubectl get pod -n sw-mpi-operator

Creating an MPI Job

You can create an MPI job by defining an MPIJob config file. For example:

apiVersion: batch.test.bdap.com/v1
kind: MPIJob
metadata:
  name: simple-train-cpu
  namespace: sw-mpi-operator
spec:
  numWorkers: 5
  launcherTemplate:
    spec:
      containers:
        - args:
            - mkdir sample-python-train &&
              cd sample-python-train &&
              horovodrun -np 2 --hostfile $OMPI_MCA_orte_default_hostfile python generate_data.py &&
              horovodrun -np 2 --hostfile $OMPI_MCA_orte_default_hostfile python main.py
          command:
            - /bin/sh
            - -c
          image: farawaya/horovod-torch-cpu
          name: horovod-master
      restartPolicy: Never
  workerTemplate:
    spec:
      containers:
        - args:
            - git clone https://github.com/FFFFFaraway/sample-python-train.git &&
              cd sample-python-train &&
              pip install -r requirements.txt &&
              touch /ready.txt &&
              sleep infinity
          command:
            - /bin/sh
            - -c
          image: farawaya/horovod-torch-cpu
          name: horovod-worker
          readinessProbe:
            exec:
              command:
                - cat
                - /ready.txt
            initialDelaySeconds: 30
            periodSeconds: 5

Deploy the MPIJob resource:

kubectl apply -f config/samples/training_job_cpu.yaml

Note that the launcher pod will use all workers (numWorkers in spec), the -npparameter after horovodrun does not seem to work.

Monitoring an MPI Job

You can inspect the logs to see the training progress. When the job starts, access the logs from the launcher pod:

kubectl logs simple-train-cpu-launcher -n sw-mpi-operator

Editing MPI Job

Modify and apply the MPIJob yaml file.

  • However, if the Launcher is modified, then you need to manually delete the existing Launcher Pod to trigger the update.
  • If the Worker is modified, there is no need to delete Worker Pod manually. It will be automatically updated.

Deleting MPI Job

Delete the MPIJob yaml file. And all pods, configmaps, rbac will be automatically deleted.

You need to manually delete the MPIJob task to avoid occupying GPU resources.

Uninstall

make undeploy

TODO List

  • Add MPIJob Status
  • Add Defaulter and Validator Webhook
  • Add scheduler

Docker Images

About

Kubernetes Operator for MPI-based applications (distributed training like Horovod, etc.)

Topics

Resources

License

Stars

Watchers

Forks