diff --git a/img/learn_to_learn.png b/img/learn_to_learn.png new file mode 100644 index 0000000..7ad6b90 Binary files /dev/null and b/img/learn_to_learn.png differ diff --git a/img/mdp_distribution.png b/img/mdp_distribution.png new file mode 100644 index 0000000..a985e49 Binary files /dev/null and b/img/mdp_distribution.png differ diff --git a/tutorial.ipynb b/tutorial.ipynb index 6f0bc61..35e1cae 100644 --- a/tutorial.ipynb +++ b/tutorial.ipynb @@ -186,7 +186,19 @@ "

Actions

\n", "The actuators are the strengths of 10 corrector magnets that can steer the beam.\n", "They are normalized to [-1, 1]. \n", - "In this tutorial, we apply the action by adding a delta change $\\Delta a$ to the current magnet strengths .\n", + "In this tutorial, we apply the action by adding a delta change $\\Delta a$ to the current magnet strengths.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "

Formulating the RL problem

\n", + "\n", "\n", "

States/Observations

\n", "The observations are the readings of ten beam position monitors (BPMs), which read the position of the beam at a particular point in the beamline. The states are also normalized to [-1,1], corresponding to $\\pm$ 100 mm in the real accelerator.\n", @@ -210,10 +222,10 @@ "The reward is the negative RMS value of the distance to the target trajectory. \n", "\n", "$$\n", - "r(x) = - \\sqrt{ \\frac{1}{10} \\sum_{i=1}^{10} (x_{i} - x^{\\text{target}}_{i})^2},\n", + "r(x) = - \\sqrt{ \\frac{1}{10} \\sum_{i=1}^{10} \\Delta x_{i}^2} \\,, \\ \\ \\ \\Delta x_{i} = x_{i} - x^{\\text{target}}_{i}\n", "$$\n", "\n", - "where $x^{\\text{target}}=\\vec{0}$ for a centered orbit.\n", + "where $x^{\\text{target}}=\\vec{0}$ for a centered trajectory.\n", "\n", "
\n", "\n", @@ -230,17 +242,21 @@ "source": [ "

Formulating the RL problem

\n", "\n", - "

Convergence condition

\n", + "

Successful termination condition

\n", + "\n", "If a threshold RMS (-10 mm in our case, 0.1 in normalized scale) is surpassed,\n", - "the episode ends successfully. \n", + "the episode ends successfully. We cannot measure _exactly_ 0 because of the resolution of the BPMs.\n", + "\n", + "

Unsucessful termination (safety) condition

\n", "\n", - "

Termination (safety) condition

\n", - "If the beam hits the wall (any state ≤ -1 or ≥ 1 in normalized scale, 10 cm), the episode is terminated unsuccessfully. \n", + "If the beam hits the wall (any state ≤ -1 or ≥ 1 in normalized scale, 10 cm), the episode is terminated unsuccessfully. In this case, the agent receives a large negative reward (all BPMs afterwards are set to the largest value) to discourage the agent.\n", "\n", "

Episode initialization

\n", + "\n", "All episodes are initialised such that the RMS of the distance to the target trajectory is large. This ensures that the task is not too easy and relatively close to the boundaries to probe the safety settings.\n", "\n", "

Agents

\n", + "\n", "In this tutorial we will use:\n", "\n", "- PPO (Proximal Policy Optimization)\n", @@ -261,21 +277,24 @@ "\n", "

\n", "

\n", - " \n", - " 1 task / 1 environment = 1 set of fixed quadrupole strengths\n", + " \n", + " 1 task or 1 environment = 1 set of fixed quadrupole strengths = 1 MDP\n", " \n", "
\n", "

\n", "\n", + "\n", + "\n", + "\n", + "


\n", + "\n", "In this tutorial we will use a variety of environments or tasks:\n", "-

Fixed tasks for evaluation ❗

\n", "-

Randomly sampled tasks from a task distribution for meta-training ❗

\n", "\n", "We generate them from the original, nominal optics, adding a random scaling factor to the quadrupole strengths.\n", "\n", - "
\n", - "\n", - "
" + "\n" ] }, { @@ -323,7 +342,7 @@ " \\Delta a &= \\mathbf{R}^{-1}\\Delta s\n", "\\end{align}\n", "\n", - "$\\implies$ Actions from **RL policy**:\n", + "$\\implies$ Actions from deep **RL policy**:\n", "With the policy we get the actions:\n", "
\n", "\n", @@ -364,7 +383,7 @@ "- This will be performed for different evaluation tasks, just to assess how the policy performs in different lattices.\n", "\n", "Side note:\n", - "- The benchmark policy will not immediately find the settings for the target trajectory, because the actions are scaled down for safety reasons so that the maximum step is within $[-1,1]$ in the normalized space.\n", + "- The benchmark policy will not immediately find the settings for the target trajectory, because the actions are limited so that the maximum step is within $[-1,1]$ in the normalized space.\n", "- We can then compare the metrics of both policies.\n", "
\n", "\n", @@ -455,7 +474,7 @@ "

$\\implies$ What is the difference in episode length between the benchmark policy and PPO?

\n", "

$\\implies$ Look at the cumulative episode length, which policy takes longer?

\n", "

$\\implies$ Compare both cumulative rewards, which reward is higher and why?

\n", - "

$\\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the convergence (in red) and termination conditions mentioned before. What can you say about how the episode was ended?

" + "

$\\implies$ Look at the final reward (-10*RMS(BPM readings)) and consider the sucessful (in red) and unsuccessful termination conditions mentioned before. What can you say about how the episode was ended?

" ] }, { @@ -555,10 +574,14 @@ "- We have a meta policy $\\phi(\\theta)$, where $\\theta$ are the weights of a neural network. The meta policy starts untrained $\\phi_0$.\n", "\n", "

Step 1: outer loop

\n", - "We randomly sample a number of tasks $i$ (in our case $i\\in \\{1,\\dots,8\\}$ different lattices, called meta-batch-size in the code) from a task distribution, each one with its particular initial task policy $\\varphi_{0}^i=\\phi_0$.\n", + "\n", + "We randomly sample a number of tasks $i$ (in our case $i\\in \\{1,\\dots,8\\}$ different lattices, called `meta-batch-size` in the code) from a task distribution, each one with its particular initial task policy $\\varphi_{0}^i=\\phi_0$.\n", "\n", "

Step 2: inner loop (adaptation)

\n", - "For each task, we gather experience for several episodes, store the experience, and use it to perform gradient descent and update the weights of each task policy $\\varphi_{0}^i \\rightarrow \\varphi_{k}^i$ for $k$ gradient descent steps." + "\n", + "For each task, we gather experience for several episodes, store the experience, and use it to perform gradient descent and update the weights of each task policy $\\varphi_{0}^i \\rightarrow \\varphi_{1}^i$\n", + "\n", + "This is repeated for $k$ gradient descent steps to generate $\\varphi_{k}^i$." ] }, { @@ -573,7 +596,7 @@ "\n", "

Step 3: outer loop (meta training)

\n", "\n", - "We sum the losses calculated for each **task policy** and perform gradient descent on the **meta policy**\n", + "We generate episodes with the adapted **task policies** $\\varphi_{k}^i$. We sum the losses calculated for each task $\\tau_{i}$ and perform gradient descent on the **meta policy**\n", "$\\phi_0 \\rightarrow \\phi_1$\n", "\n", "
\n", @@ -600,10 +623,10 @@ "We start with a random meta policy, and we initialize the task policies with it: $\\phi_0 = \\varphi_{0}^i$\n", "\n", "```python\n", - "meta_step 0:\n", + "1 meta_step: # Outer loop\n", " sample 8 tasks\n", - " for t in tasks:\n", - " for i in num_steps:\n", + " for task in tasks:\n", + " for fast_step in num_steps: # Inner loop\n", " for fast_batch in fast_batch_size:\n", " rollout 1 episode:\n", " reset corrector_strength\n", @@ -751,7 +774,7 @@ } }, "source": [ - "

We can observe that the meta policy can solve the problem for different tasks (i.e. lattices)!

\n", + "

We can observe that the pre-trained meta policy can solve the problem for different tasks (i.e. lattices) within a few adaptation steps!

\n", "\n", "
\n", "\n",