This sample demonstrates how to debug a pipeline with ParallelRunStep and with distributed Tensorflow steps using MPI.
With ParallelRunStep, to make sure we only debug it on a single node, we check that AZ_BATCH_IS_CURRENT_NODE_MASTER
environment variable equals true
.
With MPIConfiguration, it depends on the distributed training framework. Ultimately, you need identify an instance with *rank equal to zero. Look into this guide to understand how to detect the rank.
For example, in Horovod MPI Tensorflow steps in would be horovod.tensorflow.rank()
which must be zero.
We debug each step using a separate port (5678, 5679, 5680).
VS Code compound configuration Python: AML Advanced 3 Listeners
starts 3 listeners.
With that, we can even debug 3 simultaneously running nodes,
even though in this sample it is only a matter of convenience.
If you need to debug a distributed step across multiple nodes or processes per node,
you may need to add your own code for picking individual debugging ports for every instance of your training steps. Instead of passing port number as a parameter, steps can choose ports and "reserve" them by adding an Azure ML Run property - if a property with a certain name was already added, the port has been utilized and therefore cannot be used.
Multiple processes per node require first starting process to initialize a DebugRelay
object with a list of ports to connect to.
You may need to employ a shared data structure and a locking mechanism to make sure processes know when DebugRelay
has been already initialized (checking for azrelay process also works).
Create an environment (see .env.sample
) or set the following environment variables:
WORKSPACE_NAME
- Azure Machine Learning Workspace name (will be created if doesn't exist)TENANT_ID
- Azure Tenant IdSUBSCRIPTION_ID
- Existing Azure Subscription Id (in Azure Tenant above)RESOURCE_GROUP
- Existing Azure Resource Group (in the subscription above)APP_ID
- Azure Active Directory Registered Application Id (Service Principal)APP_SECRET
- Service Principal Password for the App Id aboveREGION
- An Azure region to create a workspace in (if one doesn't exist)COMPUTE_NAME
- name of Azure Machine Learning Compute Cluster (will be created if doesn't exist)PIPELINE_NAME
- name of an Azure Machine Learning Pipeline to publishDEBUG_GLOBAL_AZRELAY_CONNECTION_STRING
- Azure Relay Shared Access Policy connection string (must haveListen
andSend
permissions)DEBUG_GLOBAL_CONNECTION_SECRET_NAME
- AML Key Vault secret name to store connection string in.
- Start debugging with
Python: AML Advanced 3 Listeners
configuration. - Run
python3 samples/azure_ml_advanced/remote_pipeline_demo.py --is-debug true --debug-relay-connection-name <hybrid-connection-name>
in terminal on the same machine. Here hybrid-connection-name is a name of Azure Relay Hybrid Connection which Azure Relay Shared Access Policy above hasListen
andSend
permissions on.