Skip to content

montaserFath/Benchmark-ChainerRL-library-in-Gym-Environments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark-ChainerRL-library-in-Gym-Environments

Benchmark ChainerRL library in OpenAI Gym Environments

Objectives

  • Benchmarking RL algorithms: Deterministic Policy Gradient DDPG, Trust Region Policy Optimization TRPO and Proximal Policy Optimization PPO algorithms.

OpenAI Gym Enviroment

  • OpenAI Gym Open source interface to reinforcement learning tasks. The gym library provides an easy-to-use suite of reinforcement learning tasks.

  • Open AI Gym has several environments, We Use classical control environments Pendulum and Bipedal Walker2D environmens.

OpenAI_Gym

Codes:

Observations

Pendelum

  • States: cosine and sine of angle between center and pendelum.

Bipedal Walker2D

  • 14 Observations: hull angle, hull angular velocity, hip joint angle, hip joint speed, knee joint angle, knee joint speed, etc

Actions

Pendelum

  • Joint effort

Bipedal Walker2D

  • 4 Actions: Hip_1 (Torque / Velocity), Hip_2 (Torque / Velocity), Knee_1 (Torque / Velocity) and Knee_2 (Torque / Velocity)

Reward

Pendelum

reward_fun

Bipedal Walker2D

  • 300+ points up to the far end. If the robot falls, it gets -100

Algorithms and Hyperparameters

  • DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.DDPG is based on the deterministic policy gradient (DPG) algorithm. it combines the actor-critic approach with insights from the recent success of Deep Q Network (DQN).

  • PPO is a policy optimization method that use multiple epochs of stochastic gradient ascent to perform each policy update.

  • TRPO is a model free, on-policy optimization method that effective for optimizing large nonlinear policies such as neural networks.

Results

  • Pendelum
TRPO PPO DDPG
Mean Reward -1216 -1252 -594
Maximum Reward -986 -489 -371

Pendelum_result

  • Bipedal Walker2D
TRPO PPO DDPG
Mean Reward 120 163 -96
Maximum Reward 183 262 -25

Bipedal_results

Demo

  • Random Actions

Random

TRPO

PPO

DDPG

Discussion

  • DDPG algorithm achieves the best reward in Pendelum because it designed for high dimensions continuous space environments and it uses the replay buffer.

  • PPO and TRPO algorithms achieve the best reward in Bipedal Walker2D.

  • PPO Reachs the best reward faster than uses TRPO because it use gradient algorithm approximation instance of the conjugate. gradient algorithm.

Installing

Install OpenAI Gym Envirnment

pip3 install gym

Install ChainerRL libary

pip3 install chainerrl

About

Benchmark ChainerRL library in Gym Environments

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published