This repository provides documentation for various reinforcement learning concepts and algorithms. Below, you'll find brief explanations of each topic covered.
Bandit algorithms are used in scenarios where an agent must decide between multiple actions, each having an unknown reward distribution.
- Multi-Armed Bandits: Multi-Armed Bandits represent a class of problems where an agent faces multiple choices, often called "arms" or "actions." Each action has an associated reward distribution, but the agent initially has limited knowledge about these distributions. The challenge is to maximize the cumulative reward while exploring and exploiting the available actions effectively.
- Epsilon-Greedy Algorithm: The Epsilon-Greedy algorithm is a strategy for addressing the exploration-exploitation dilemma in Multi-Armed Bandits. The key idea is to balance between two actions:
- Exploration (with probability ε): The agent chooses a random action to learn more about its reward distribution.
- Exploitation (with probability 1-ε): The agent selects the action with the highest estimated reward based on its current knowledge. By tuning the value of ε, you can control the trade-off between exploration and exploitation. A higher ε promotes exploration, while a lower ε favors exploitation.
- UCB (Upper Confidence Bound): The Upper Confidence Bound (UCB) algorithm is another technique used in Multi-Armed Bandits to make decisions. It balances exploration and exploitation by considering uncertainty in reward estimates. The UCB algorithm calculates an upper confidence bound for each action's expected reward. Actions with higher upper bounds are chosen, as they might yield higher rewards. This approach encourages exploration because actions with uncertain estimates are given a chance to be selected. The UCB algorithm uses a parameter (typically denoted as c) to control the balance between exploration and exploitation. Higher values of c result in more exploration.
- Thompson Sampling: Thompson Sampling is an algorithm for Multi-Armed Bandits that employs Bayesian probability to make decisions. It maintains a probability distribution over the possible reward distributions for each action. The Thompson Sampling algorithm selects an action by sampling from the posterior distribution of each action's expected reward. This probabilistic approach incorporates uncertainty naturally and tends to favor actions with higher estimated rewards while still exploring to some extent. Thompson Sampling is known for its strong theoretical guarantees and is an effective strategy for the exploration-exploitation problem.
These strategies and algorithms are fundamental in solving problems involving uncertainty and trade-offs between exploration and exploitation, such as in recommendation systems, online advertising, and clinical trials.
MDPs model scenarios where an agent interacts with an environment, taking actions to maximize cumulative rewards. Optimal policy algorithms help find the best strategy for the agent.
- MDP Basics: MDP (Markov Decision Processes) Basics:
MDPs are mathematical models used in reinforcement learning. They consist of key components:
- States: Represent possible situations or configurations.
- Actions: Choices or decisions available to the agent.
- Transition Probabilities: Describe the likelihood of moving from one state to another when taking an action.
- Reward Functions: Specify immediate rewards for transitions.
- Value Iteration: Value Iteration is an algorithm used to find optimal policies in MDPs. It works by iteratively updating state values until convergence. The algorithm helps identify the best actions to take in each state to maximize cumulative rewards.
- Policy Iteration: Policy Iteration is an algorithm for solving MDPs. It alternates between two steps:
- Policy Evaluation: Computes the expected cumulative reward for a given policy.
- Policy Improvement: Adjusts the policy by selecting actions that maximize expected rewards. This process repeats until the policy converges to an optimal one.
- Linear Programming: Linear Programming is a technique for finding optimal policies in MDPs. It formulates the problem as a linear program to maximize the expected cumulative reward while respecting constraints. It provides a systematic way to solve MDPs optimally.
Q-learning and SARSA are reinforcement learning algorithms used for finding optimal policies in MDPs. Linear interpolation is a technique used to approximate value functions.
- Q-learning: Q-learning is a model-free reinforcement learning algorithm used to estimate the values of state-action pairs (known as Q-values) and find optimal policies. It operates in discrete state and action spaces and learns by iteratively updating Q-values based on observed rewards and transitions. Q-learning is known for its simplicity and efficiency in solving problems with unknown dynamics.
- SARSA Learning: SARSA is another model-free reinforcement learning algorithm that combines action selection and learning. It stands for State-Action-Reward-State-Action. SARSA learns by observing state-action pairs, executing actions according to a policy, and updating Q-values based on the rewards received and the next action chosen. It's particularly useful for on-policy learning, where the policy being learned is the same as the policy used to explore the environment.
- Linear Interpolation: Linear interpolation is a technique used to approximate value functions in continuous state spaces within reinforcement learning. It works by estimating the value of a state that lies between two known states with known values. Linear interpolation assumes a linear relationship between states and values, providing a simple and continuous approximation method. It is especially valuable when dealing with problems that involve continuous and unbounded state spaces.
Feel free to explore the documentation for each topic in this repository. Each section provides a brief overview of the concept or algorithm, making it easier to understand.