This is an attempt of Neural Architecture Search in Deep Reinforcement Learning and for a start I tried it with Lunar Lander and DQN network approximation.
- Reinforce rule with baseline for purpose of loss function from this paper on NAS Paper link
- Implemented ENAS by Google Brain for this. Paper link
- Also this DQN Tutorial was extremely helpful!
- Have modified DQN with manual weights initialisation which is key in ENAS.
Enviroment is solved if you reach 200 score for 100 episodes!
Two layered feedforward neural networks with no skip connections possible from dense layers of sizes (64,128,256,1024,2048) and activation functions (sigmoid,relu) => 100 architectures
(Didn't do much tuning into this, kind of arbitirary except that I have used average score/30 for normalisation purpose)
- If the model converged then that trained model is run on different env seed for 500 episodes and average score is calculated, reward = average_score/30
- If model doesn't converge, reward = 1e-5
In our example after ~15 iterations it starts sampling set of only good performing architectures since our controller sampling policy is improved.Their performance will be comparable. I don't get a single best model obviously because our policy is stochastic and its higly unlikely that it will sample one best model after the end of training. Plot -
~12 hrs on single NVIDIA GTx 1050 Ti => 83.33% faster than brute force (exhaustive search would have taken ~ 22 hrs) Advantage is not quite big here because this is a simple example with search space including just 100 architectures, however this difference would have been huge if search space was big itself.
python enas_contoller.py
python plot_controller_performance.py
Directly see one of the best model from our search space performing i.e, controller samples one of this at end of training
sudo apt-get install ffmpeg
cd best_model
python play_lunar_video.py
*have been tested across various enviornment seeds
2) NAS techniques till now are not good enough for finding a global optimal or state of the art architecture and it also have human bias but it does help in finding the best from a given search space efficiently and since neural networks used in RL applications till now are not overly complex this can be a good option!
- Randomness in training data due to epsilon greedy approach to account for exploitaion-exploration trade off.
- Scalability issue of this for more complex & large networks like CNNs still exist where training a single CNN takes lot of time.