Benchmark ChainerRL library in OpenAI Gym Environments
- Benchmarking RL algorithms: Deterministic Policy Gradient DDPG, Trust Region Policy Optimization TRPO and Proximal Policy Optimization PPO algorithms.
-
OpenAI Gym Open source interface to reinforcement learning tasks. The gym library provides an easy-to-use suite of reinforcement learning tasks.
-
Open AI Gym has several environments, We Use classical control environments Pendulum and Bipedal Walker2D environmens.
- States: cosine and sine of angle between center and pendelum.
- 14 Observations: hull angle, hull angular velocity, hip joint angle, hip joint speed, knee joint angle, knee joint speed, etc
- Joint effort
- 4 Actions: Hip_1 (Torque / Velocity), Hip_2 (Torque / Velocity), Knee_1 (Torque / Velocity) and Knee_2 (Torque / Velocity)
- 300+ points up to the far end. If the robot falls, it gets -100
-
DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.DDPG is based on the deterministic policy gradient (DPG) algorithm. it combines the actor-critic approach with insights from the recent success of Deep Q Network (DQN).
-
PPO is a policy optimization method that use multiple epochs of stochastic gradient ascent to perform each policy update.
-
TRPO is a model free, on-policy optimization method that effective for optimizing large nonlinear policies such as neural networks.
- Pendelum
TRPO | PPO | DDPG | |
---|---|---|---|
Mean Reward | -1216 | -1252 | -594 |
Maximum Reward | -986 | -489 | -371 |
- Bipedal Walker2D
TRPO | PPO | DDPG | |
---|---|---|---|
Mean Reward | 120 | 163 | -96 |
Maximum Reward | 183 | 262 | -25 |
- Random Actions
-
DDPG algorithm achieves the best reward in Pendelum because it designed for high dimensions continuous space environments and it uses the replay buffer.
-
PPO and TRPO algorithms achieve the best reward in Bipedal Walker2D.
-
PPO Reachs the best reward faster than uses TRPO because it use gradient algorithm approximation instance of the conjugate. gradient algorithm.
Install OpenAI Gym Envirnment
pip3 install gym
Install ChainerRL libary
pip3 install chainerrl