Offline Reinforcement Learning Algorithms

Simple Conservative Q-Learning (CQL) with GridWorld

Conservative Q-Learning (CQL) is an offline RL algorithm designed to address the overestimation problem in standard Q-learning when learning from a fixed dataset.

Key features:

Conservatism: CQL adds a regularization term to the standard Q-learning loss, which penalizes Q-values of out-of-distribution actions.
Offline Learning: It learns from a pre-collected dataset without interacting with the environment during training.
Overestimation Mitigation: By being conservative, it helps prevent the overoptimistic value estimates that can occur in offline RL.

In the GridWorld context:

The agent learns to navigate a grid to reach a goal position.
The learning process uses only pre-collected data of random trajectories.
The CQL regularization helps the agent avoid choosing actions that weren't well-represented in the dataset.

Q-Learning (QL) with GridWorld

Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in states.

Key features:

Value Iteration: It iteratively updates Q-values based on the rewards received and the estimated future values.
Off-policy: It can learn from data collected by any policy, not just the one it's currently following.
Exploration-Exploitation: Typically uses an epsilon-greedy strategy to balance between exploring new actions and exploiting known good actions.

In the GridWorld context:

The agent learns to associate each state-action pair with an expected cumulative reward (Q-value).
It updates these Q-values based on the immediate rewards and the maximum Q-value of the next state.
The learned Q-values are used to determine the best action in each state.

Comparison

Data Usage:
- CQL is designed for offline learning from a fixed dataset.
- Standard QL typically learns through online interaction, but can be adapted for offline use.
Conservatism:
- CQL explicitly penalizes choosing actions not well-represented in the dataset.
- QL doesn't have this built-in conservatism, which can lead to overoptimistic estimates in offline settings.
Complexity:
- CQL adds additional complexity with its conservatism regularization term.
- QL is generally simpler in its update rule.
Performance in Offline Settings:
- CQL often performs better in purely offline scenarios due to its conservative nature.
- QL may struggle with offline data, especially if the dataset doesn't cover the state-action space well.

Both algorithms, when implemented in GridWorld, aim to learn a policy for navigating the grid efficiently. The main difference lies in how they handle the challenges of learning from a fixed dataset, with CQL being more suited to this offline learning scenario.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
cql_learning.py		cql_learning.py
data_generation.py		data_generation.py
environment.py		environment.py
main.py		main.py
q_learning.py		q_learning.py
temp.py		temp.py
temp2.py		temp2.py
test_policy.py		test_policy.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Offline Reinforcement Learning Algorithms

Simple Conservative Q-Learning (CQL) with GridWorld

Q-Learning (QL) with GridWorld

Comparison

About

Releases

Packages

Languages

rustem17/offline_rl

Folders and files

Latest commit

History

Repository files navigation

Offline Reinforcement Learning Algorithms

Simple Conservative Q-Learning (CQL) with GridWorld

Q-Learning (QL) with GridWorld

Comparison

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages