Skip to content

rustem17/offline_rl

Repository files navigation

Offline Reinforcement Learning Algorithms

Simple Conservative Q-Learning (CQL) with GridWorld

Conservative Q-Learning (CQL) is an offline RL algorithm designed to address the overestimation problem in standard Q-learning when learning from a fixed dataset.

Key features:

  1. Conservatism: CQL adds a regularization term to the standard Q-learning loss, which penalizes Q-values of out-of-distribution actions.
  2. Offline Learning: It learns from a pre-collected dataset without interacting with the environment during training.
  3. Overestimation Mitigation: By being conservative, it helps prevent the overoptimistic value estimates that can occur in offline RL.

In the GridWorld context:

  • The agent learns to navigate a grid to reach a goal position.
  • The learning process uses only pre-collected data of random trajectories.
  • The CQL regularization helps the agent avoid choosing actions that weren't well-represented in the dataset.

Q-Learning (QL) with GridWorld

Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in states.

Key features:

  1. Value Iteration: It iteratively updates Q-values based on the rewards received and the estimated future values.
  2. Off-policy: It can learn from data collected by any policy, not just the one it's currently following.
  3. Exploration-Exploitation: Typically uses an epsilon-greedy strategy to balance between exploring new actions and exploiting known good actions.

In the GridWorld context:

  • The agent learns to associate each state-action pair with an expected cumulative reward (Q-value).
  • It updates these Q-values based on the immediate rewards and the maximum Q-value of the next state.
  • The learned Q-values are used to determine the best action in each state.

Comparison

  1. Data Usage:

    • CQL is designed for offline learning from a fixed dataset.
    • Standard QL typically learns through online interaction, but can be adapted for offline use.
  2. Conservatism:

    • CQL explicitly penalizes choosing actions not well-represented in the dataset.
    • QL doesn't have this built-in conservatism, which can lead to overoptimistic estimates in offline settings.
  3. Complexity:

    • CQL adds additional complexity with its conservatism regularization term.
    • QL is generally simpler in its update rule.
  4. Performance in Offline Settings:

    • CQL often performs better in purely offline scenarios due to its conservative nature.
    • QL may struggle with offline data, especially if the dataset doesn't cover the state-action space well.

Both algorithms, when implemented in GridWorld, aim to learn a policy for navigating the grid efficiently. The main difference lies in how they handle the challenges of learning from a fixed dataset, with CQL being more suited to this offline learning scenario.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages