SARSA or State-Action-Reward-State-Action is an algorithm based on on-policy TD(0) control method in reinforcement learning. It follows Generalised Policy Iteration strategy: as the policy π becomes greedy with respect to the state-action value function, the state-action value function becomes more optimal. Our aim is to estimate Qπ(s, a) for the current policy π and all state-action (s-a) pairs.
-
We learn the state-action value function Q(s,a) rather than state-value V(s).
-
Here, qπ(s,a) is the estimate for the current behavior policy π for all the state-actions pairs (s,a).
-
Initialising a suitable state s (s should not be a terminal state).
-
Choose an appropriate action A under the policy epsilon-greedy or epsilon-soft.
-
Record the values of the state S' and the reward R.
-
Update the function -> Q(S, A) ← Q(S, A) + αR + γQ(S′, A′) − Q(S, A)
-
This loop runs till it encounters a terminal state where Q(s',a') = 0.
Q-learning similar to SARSA, is based on off-policy TD(0) control method. Both the algorithms aim to estimate the Qπ(s, a) value for all the state-action pairs invlved in the task.
The only difference is that in SARSA the action a' to go from current state to the next state is selected by the same policy π (behavioral policy). Whereas in Q-learning, the action a' to go from present state to next state is selected in greedy manner, i.e., there are fewer chances of choosing a random action in a state. Hence, it involves more explotaiton than exploration.