Benchmark-ChainerRL-library-in-Gym-Environments

Benchmark ChainerRL library in OpenAI Gym Environments

Objectives

Benchmarking RL algorithms: Deterministic Policy Gradient DDPG, Trust Region Policy Optimization TRPO and Proximal Policy Optimization PPO algorithms.

OpenAI Gym Enviroment

OpenAI Gym Open source interface to reinforcement learning tasks. The gym library provides an easy-to-use suite of reinforcement learning tasks.
Open AI Gym has several environments, We Use classical control environments Pendulum and Bipedal Walker2D environmens.

Codes:

Observations

Pendelum

States: cosine and sine of angle between center and pendelum.

Bipedal Walker2D

14 Observations: hull angle, hull angular velocity, hip joint angle, hip joint speed, knee joint angle, knee joint speed, etc

Actions

Pendelum

Joint effort

Bipedal Walker2D

4 Actions: Hip_1 (Torque / Velocity), Hip_2 (Torque / Velocity), Knee_1 (Torque / Velocity) and Knee_2 (Torque / Velocity)

Reward

Pendelum

Bipedal Walker2D

300+ points up to the far end. If the robot falls, it gets -100

Algorithms and Hyperparameters

DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.DDPG is based on the deterministic policy gradient (DPG) algorithm. it combines the actor-critic approach with insights from the recent success of Deep Q Network (DQN).
PPO is a policy optimization method that use multiple epochs of stochastic gradient ascent to perform each policy update.
TRPO is a model free, on-policy optimization method that effective for optimizing large nonlinear policies such as neural networks.

Results

Pendelum

	TRPO	PPO	DDPG
Mean Reward	-1216	-1252	-594
Maximum Reward	-986	-489	-371

Bipedal Walker2D

	TRPO	PPO	DDPG
Mean Reward	120	163	-96
Maximum Reward	183	262	-25

Demo

Random Actions

TRPO

PPO

DDPG

Discussion

DDPG algorithm achieves the best reward in Pendelum because it designed for high dimensions continuous space environments and it uses the replay buffer.
PPO and TRPO algorithms achieve the best reward in Bipedal Walker2D.
PPO Reachs the best reward faster than uses TRPO because it use gradient algorithm approximation instance of the conjugate. gradient algorithm.

Installing

Install OpenAI Gym Envirnment

pip3 install gym

Install ChainerRL libary

pip3 install chainerrl

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
BipedalWalder2d		BipedalWalder2d
Demo		Demo
Pendulum		Pendulum
Results		Results
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark-ChainerRL-library-in-Gym-Environments

Objectives

OpenAI Gym Enviroment

Codes:

Observations

Pendelum

Bipedal Walker2D

Actions

Pendelum

Bipedal Walker2D

Reward

Pendelum

Bipedal Walker2D

Algorithms and Hyperparameters

Results

Demo

Discussion

Installing

About

Releases

Packages

Languages

License

montaserFath/Benchmark-ChainerRL-library-in-Gym-Environments

Folders and files

Latest commit

History

Repository files navigation

Benchmark-ChainerRL-library-in-Gym-Environments

Objectives

OpenAI Gym Enviroment

Codes:

Observations

Pendelum

Bipedal Walker2D

Actions

Pendelum

Bipedal Walker2D

Reward

Pendelum

Bipedal Walker2D

Algorithms and Hyperparameters

Results

Demo

Discussion

Installing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages