Meta Reinforcement Learning for steering tasks
+Use case: AWAKE beamline at CERN
+Implementation example for the RL4AA'24 workshop
+Simon Hirländer, Jan Kaiser, Chenran Xu, Andrea Santamaria Garcia
+Today!
+In this tutorial notebook we will implement all the basic components of a Meta Reinforcement Learning (MLRL) algorithm to solve steering task in a linear accelerator. + +-
+
- Getting started +
- Part I: Quick introduction +
- Part II: Running PPO on our problem +
- Part III: Running MAML on our problem +
- Part IV: Model-based RL +
Getting started
+-
+
- The code is tested on Python 3.9 and 3.10. +
- You should have conda installed. +
- Start by cloning locally the repository of the tutorial: +
git clone https://github.com/RL4AA/rl4aa24-tutorial.git
+
Getting started
+Using Conda¶
conda env create -f environment.yml
+
This should create an environment named rl-tutorial
and install the necessary packages inside.
Afterwards, activate the environment using
+conda activate rl-tutorial
+
Getting started
+Using venv¶
If you don't have conda installed:
+Alternatively, you can create the virtual env with
+python venv -n rl-tutlrial
+
and activate the env with $ source <venv>/bin/activate
(bash) or C:> <venv>/Scripts/activate.bat
(Windows)
Then, install the packages with pip
within the activated environment
python -m pip3 install -r requirements.txt
+
Afterwards, you should be able to run the provided scripts.
+Part I: Quick introduction
+AWAKE (The Advanced Proton Driven Plasma Wakefield Acceleration Experiment)
+AWAKE is an accelerator R&D project based at CERN. It investigates the use of plasma wakefields driven by a proton bunch to accelerate charged particles.
+-
+
- The proton beam from the SPS is used to drive wakefields in a plasma cell. +
- The wakefields in the plasma accelerate electrons coming from another beamline, like a surfer is accelerated by ocean waves. +
- Plasmas can support extremely strong electric fields, with accelerating gradients of GV/m over meter-scale distances, which can reduce the size of future accelerators. +
AWAKE (The Advanced Proton Driven Plasma Wakefield Acceleration Experiment)
++
+ +-
+
- Momentum: 10-20 MeV/c +
- Electrons per bunch: 1.2e9 +
- Bunch length: 4 ps +
- Pulse repetition rate: 10 Hz +
Reference
+"Acceleration of electrons in the plasma wakefield of a proton bunch" - Nature volume 561, pages363–367 (2018) +The accelerator problem we want to solve
+The goal is to minimize the distance $\Delta x_i$ of an initial beam trajectory to a target trajectory at different points $i$ (here marked as "position") in the accelerator in the least amount of steps.
+ +Reference
+"Ultra fast reinforcement learning demonstrated at CERN AWAKE" - IPAC (2023) +Formulating the RL problem
+The problem is formulated in an episodic manner.
+Actions
+The actions are the strenghts of 10 corrector magnets that can steer the beam. +They are normalized to [-1, 1], corresponding to $\pm$ 100 mm. + +States/Observations
+The observations are the readings of ten beam position monitors (BPMs), which read the position of the beam at a particular point in the beamline. + +Reward
+The reward is the negative RMS value of the distance to the target trajectory. + + +Formulating the RL problem
+Convergence condition
+If a threshold RMS (-10 mm in our case, 0.1 in normalized scale) is surpassed, +the episode ends successfully. + +Termination (safety) condition
+If the beam hits the wall (any state ≤ -1 or ≥ 1 in normalized scale, 10 cm), the episode is terminated unsuccessfully. + +Episode initialization
+All episodes are initialised such that the RMS of the distance to the target trajectory is large. This ensures that the task is not too easy and relatively close to the boundaries to probe the safety settings. + +Agents
+In this tutorial we will use: + +-
+
- PPO (Proximal Policy Optimization) +
- MAML (Model Agnostic Meta Learning) +
Formulating the RL problem
+Environments/Tasks
+-
+
- In this tutorial we will use a variety of environments or tasks.
-
+
- Fixed tasks for evaluation. +
- Randomly sampled tasks for meta-training. +
+ - They are defined by the particular strength of the quadrupole magnets of the AWAKE beamline. This means that different magnet strength combinations will give place to different environments/tasks.
-
+
- A set of fixed quadrupole strengths is called "optics". +
- We generate them from the original optics, adding a random scaling factor. +
+
Optics might be different from what we expect in real life.
+Formulating the RL problem
+Environments/Tasks
+The environment dynamics are determined by the response matrix, which in linear systems can encapsulate the dynamics of the problem.
+More specifically: given the response matrix is $\mathbf R$, the change in actions $\Delta a$ (corrector magnet strength), and the change in states $\Delta s$ (BPM readings), we have:
+\begin{align} + \Delta s &= \mathbf{R}\Delta a\\ +\end{align}
++ +
Defining a benchmark policy
+During this tutorial we want to compare the trained policies we obtain with different methods to a benchmark policy.
+For this problem, our "benchmark policy" is just the inverse of the environment's response matrix.
+More specifically, we have: +\begin{align} + \Delta a &= \mathbf{R}^{-1}\Delta s +\end{align}
+$\implies$ The resolution of the problem can be theoretically achieved by applying the inverse response matrix, $\mathbf{R}^{-1}$, directly to the system.
+Cheatsheet on RL training 🧐
+Consult at will during the tutorial if you are not familiar with training RL agents! + +Training stage
+During the training phase experience is gathered in a buffer that is used to update the weights of the policy through gradient descent. +The samples in the buffer can be passed to the gradient descent algorithm in batches, and gradient descent is performed a number of epochs. This is how the agent "learns". + + +Evaluation/validation stage
+The policy is fixed (no weight updates) and only forwards passes are performed. + + +So how do we compare policies in the evaluation stage? 🧐
+-
+
- At the beginning of each episode we reset the environment to suboptimal corrector strengths in a random way. +
- For each step within the episode we use the inverse of the response matrix (benchmark) or the trained policy to compute the next action (forward passes) until the episode ends (convergence or termination). +
- This will be performed for different evaluation tasks, just to assess how the policy performs in different lattices. +
Side note:
+-
+
- The benchmark policy will not immediately find the settings for the target trajectory, because the actions are scaled down for safety reasons. +
- We can then compare metrics of both policies. +
So how do we compare policies in the evaluation stage? 🧐
+-
+
- There are 5 fixed evaluation tasks. +
- We can choose to evaluate our policy to one of them, several, or all of them. +
Part II: Running PPO on our problem
+Files relevant to the PPO agent
+-
+
ppo.py
: runs the training and evaluation stages sequentially.
+configs/maml/verification_tasks.pkl
: contains 5 tasks (environments/optics) upon which the policies will be evaluated.
+
PPO agent settings 🧐
+-
+
n_env
= 1
+n_steps
= 2048 (default params)
+buffer_size
= n_steps x n_env = 2048
+n_epochs
= 10 (default params)
+- We backpropagate 10 times everytime we fill the buffer
-
+
- backprops =
int((total_timesteps / buffer_size)) * n_epochs
+
+ - backprops =
Questions 💻
+Go to ppo.py
and change the total_timesteps
to 100. This can be done by providing the command line argument --steps [num_steps]
Run it in the terminal with python ppo.py --steps 100
$\implies$ Considering the PPO agent settings: will we fill the buffer? what do you expect that happens?
+$\implies$ What is the difference in episode length between the benchmark policy and PPO?
+$\implies$ Look at the cumulative episode length, which policy takes longer?
+$\implies$ Compare both cumulative rewards, which reward is higher and why?
+$\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the convergence (in red) and termination conditions mentioned before. What can you say about how the episode was ended?
+If you didn't manage to run here is the plot
++ +
Questions 💻
+Set total_timesteps
to 50,000 this time. Run it in the terminal with python ppo.py --steps 50000
$\implies$ What are the main differences between the untrained and trained PPO policies?
+Questions
++ +
Part III: Running MAML on our problem
+Meta RL
+Meta-learning occurs when one learning system progressively adjusts the operation of a second learning system, such that the latter operates with increasing speed and efficiency
+This scenario is often described in terms of two ‘loops’ of learning, an ‘outer loop’ that uses its experiences over many task contexts to gradually adjust parameters that govern the operation of an ‘inner loop’, so that the inner loop can adjust rapidly to new tasks (see Figure 1). Meta-RL refers to the case where both the inner and outer loop implement RL algorithms, learning from reward outcomes and optimizing toward behaviors that yield maximal reward.
+There are MANY flavors of meta RL
+ +Optimization-based meta RL in this tutorial
+In this tutorial we will adapt the parameters of our model (policy) through gradient descent with the MAML algorithm.
+-
+
- We have a meta policy $\phi(\theta)$, where $\theta$ are the weights of a neural network. The meta policy starts untrained $\phi_0$. +
Step 1: outer loop
+We randomly sample a number of tasks $i$ (in our case $i\in \{1,\dots,8\}$ different lattices, calledmeta-batch-size
in the code) from a task distribution, each one with its particular initial task policy $\varphi_{0}^i=\phi_0$.
+
+Step 2: inner loop
+For each task, we gather experience for several episodes, store the experience, and use it to perform gradient descent and update the weights of each task policy $\varphi_{0}^i \rightarrow \varphi_{k}^i$ for $k$ gradient descent steps. + + +Optimization-based meta RL in this tutorial
+Step 3: outer loop
+We sum the losses calculated for each task policy and perform gradient descent on the meta policy +$\phi_0 \rightarrow \phi_1$ + + +$\beta$ is the meta learning rate, $\alpha$ is the fast learning rate (for inner loop gradient updates)
+ +Important files
+-
+
train.py
: performs the meta-training on AWAKE problem
+test.py
: performs the evaluation of the trained policy
+configs/
: stores the yaml files for training configurations
+
Evaluation of random policy 💻
+-
+
- We will look at the inner loop only. +
- We consider only 1 task for now (task 0), from 5 fixed evaluation tasks. +
- The policy $\varphi_0^0$ starts as random and adapts for 500 steps (and show the progress every 50 steps). +
Run the following code to train the task policy $\varphi_0^0$ for 500 steps:
+python test.py --experiment-name tutorial --experiment-type adapt_from_scratch --num-batches=500 --plot-interval=50 --task-ids 0
Once it has run, you can look at the adaptation progress by running:
+python read_out_train.py --experiment-name tutorial --experiment-type adapt_from_scratch
You can run now several tasks.
+Evaluation of random policy
+-
+
- If the code didn't work for you, this is the plot you should get (see below). +
- We can see that it fails at the beginning, but it learns with time. +
Meta training
+Training
+We will now train the meta policy $\phi_0$ using randomly sampled tasks. + +You can run the meta-training via (but don't run it now!):
+python train.py --experiment-name <give_a_meaningful_name>
+
Note: The meta-training takes about 30 mins for the current configuration. +Therefore we have provided a pre-trained policy which can be used for evaluation later.
+Evaluation of the trained meta-policy 💻
+We will now use a pre-trained policy located in awake/pretrained_policy.th
and evalulate it against a certain number of fixed tasks.
python test.py --experiment-name tutorial --experiment-type test_meta --use-meta-policy --policy awake/pretrained_policy.th --num-batches=500 --plot-interval=50 --task-ids 0 1 2 3 4
-
+
- use
--task-ids 0 1 2 3 4
to run evaluation against all 5 tasks, or e.g.--task-ids 0
to evaluate only for task 0.
+ - here we set the flag
--use-meta-policy
so that it uses the pre-trained policy.
+
Afterwards, you can look at the adaptation progress by running:
+python read_out_train.py --experiment-name tutorial --experiment-type test_meta
Evaluation of the trained meta-policy
+$\implies$ What difference can you see compared to the untrained policy?
+ + +We can observe that the meta policy can solve the problem for different tasks (i.e. lattices)!
++ +
Overall, meta RL has a better performance from the start
+ +MAML logic 🧐
+This part is important if you want to have a deeper understanding of the MAML algorithm.
+-
+
maml_rl/metalearners/maml_trpo.py
implements the TRPO algorithm for the outer-loop.
+maml_rl/policies/normal_mlp.py
implements a simple MLP policy for the RL agent.
+maml_rl/utils/reinforcement_learning.py
implements the Reinforce algorithm for the inner-loop.
+maml_rl/samplers/
handles the sampling of the meta-trajectories of the environment using the multiprocessing package.
+maml_rl/baseline.py
A linear baseline for the advantage calculation in RL.
+maml_rl/episodes.py
A custom class to store the results and statistics of the episodes for meta-training.
+
Further Resources
+Getting started in RL¶
-
+
- OpenAI Spinning Up - Very understandable explainations on RL and the most popular algorithms acompanied by easy-to-read Python implementations. +
- Reinforcement Learning with Stable Baselines 3 - YouTube playlist giving a good introduction on RL using Stable Baselines3. +
- Build a Doom AI Model with Python - Detailed 3h tutorial of applying RL using DOOM as an example. +
- An introduction to Reinforcement Learning - Brief introdution to RL. +
- An introduction to Policy Gradient methods - Deep Reinforcement Learning - Brief introduction to PPO. +
Further Resources
+Papers¶
-
+
- Learning-based optimisation of particle accelerators under partial observability without real-world training - Tuning of electron beam properties on a diagnostic screen using RL. +
- Sample-efficient reinforcement learning for CERN accelerator control - Beam trajectory steering using RL with a focus on sample-efficient training. +
- Autonomous control of a particle accelerator using deep reinforcement learning - Beam transport through a drift tube linac using RL. +
- Basic reinforcement learning techniques to control the intensity of a seeded free-electron laser - RL-based laser alignment and drift recovery. +
- Real-time artificial intelligence for accelerator control: A study at the Fermilab Booster - Regulation of a gradient magnet power supply using RL and real-time implementation of the trained agent using field-programmable gate arrays (FPGAs). +
- Magnetic control of tokamak plasmas through deep reinforcement learning - Landmark paper on RL for controling a real-world physical system (plasma in a tokamak fusion reactor). +
Further Resources
+Literature¶
-
+
- Reinforcement Learning: An Introduction - Standard text book on RL. +
Packages¶
-
+
- Gym - Defacto standard for implementing custom environments. Also provides a library of RL tasks widely used for benchmarking. +
- Stable Baslines3 - Provides reliable, benchmarked and easy-to-use implementations of the most important RL algorithms. +
- Ray RLlib - Part of the Ray Python package providing implementations of various RL algorithms with a focus on distributed training. +