DQN is an algorithm that addresses the question : How do we make reinforcement learning look more like Supervised learning ? Common problems that consisitently show up in value-based reinforcement learning are:
- Data is not independent and identically distributed (The IID Assumption )
- Non-stationarity of targets
It's obvious that we needed to make the neccessary tweaks in the algorithms to overcome these problems i.e to make the data look more IID and the targets fixed.
Solutions (which form the part of the DQN's we see today):
In order to make the target values more stationary we can have a separate network that we can fix for multiple steps and reserve it for calculating more stationary targets i.e making use of a Target network
Use "replay" of already seen experiences (Experience Replay) , often referred to as the replay buffer or a replay memory and holds experience samples for several steps, allowing the sampling of mini-batches from a broad-set of past experiences.
DQN with Replay memory