Comparison of Bandit Algorithms

Overview

This repository includes implementations and performance reports of several bandit algorithms. The study covers:

1. Explore-then-Commit

2. Upper Confidence Bound (UCB)

3. Thompson Sampling

4. Linear UCB (LinUCB)

5. Linear Thompson Sampling (LinTS)

6. Generalized Linear Model Bandit (GLM)

Implementation

To simulate a specific algorithm, edit the Simulation.py script by enabling the desired algorithm and disabling the others.

For example, to run the UCB algorithm with $\alpha = 0.5$, update the code as follows:

## Initiate Bandit Algorithms ##
algorithms = {}

#algorithms['EpsilonGreedyLinearBandit'] = EpsilonGreedyLinearBandit(dimension=context_dimension, lambda_=0.1, epsilon=None)
#algorithms['EpsilonGreedyMultiArmedBandit'] = EpsilonGreedyMultiArmedBandit(num_arm=n_articles, epsilon=0.1)
#algorithms['ExplorethenCommit'] = ExplorethenCommit(num_arm=n_articles, m=30)
algorithms['UCBBandit'] = UCBBandit(num_arm=n_articles, alpha=0.5)
#algorithms['ThompsonSamplingGaussianMAB'] = ThompsonSamplingGaussianMAB(num_arm=n_articles)
#algorithms['LinearUCBBandit'] = LinearUCBBandit(dimension=context_dimension, lambda_=0.1, alpha=0.5) #delta=0.05, alpha=2.358
#algorithms['LinearThompsonSamplingMAB'] = LinearThompsonSamplingMAB(dimension=context_dimension, lambda_=0.1)

After selecting your algorithm, run the Simulation.py script.

1. Explore-then-Commit

Result

Hyperparameter (m)	Cumulative Regret
10	1001.40
20	214.90
30	334.02

(a) m = 10 (b) m = 20 (c) m = 30

Figure 1: Explore then Commit accumulated regret

2. Upper Confidence Bound (UCB)

Reward Estimation + Confidence Bound

$$ \text{UCB} = \hat u_{t-1,i} + \sqrt{\frac{2 \ln t}{S_{t-1,i}}} $$

Result

Hyperparameter (α)	Cumulative Regret
0.1	256.50
0.5	977.03
1.0	1906.65

(a) $\alpha$ = 0.1 (b) $\alpha$ = 0.5 (c) $\alpha$ = 1

Figure 2: UCB Bandit accumulated regret

3. Thompson Sampling

Posterior Distribution

$$ N \sim \left( \hat u_{t-1,i}, \frac{1}{S_{t-1,i} + 1} \right) $$

Result

Cumulative Regret
100

Figure 3: Thompson Sampling accumulated regret

4. Linear UCB (LinUCB)

Parameter Estimation

$$ \hat \theta_{t+1} = A^{-1}_ {t+1} b_{t+1} $$

Reward Estimation + Confidence Bound

$$ \text{UCB} = x^T \hat \theta_t + \alpha \sqrt{x^T A^{-1} x} $$

Result

Hyperparameter (α)	Cumulative Regret
0.5	24.43
1.5	177.89
2.5	487.73

(a) $\alpha$ = 0.5 (b) $\alpha$ = 1.5 (c) $\alpha$ = 2.5

Figure 4: Linear UCB accumulated regret

(a) $\alpha$ = 0.5 (b) $\alpha$ = 1.5 (c) $\alpha$ = 2.5

Figure 5: Linear UCB estimation error

5. Linear Thompson Sampling (LinTS)

Posterior Distribution

$$ N \sim (\hat{\theta}_t, A^{-1}_t) $$

Result

Cumulative Regret
1098.24

Figure 6: Linear Thompson Sampling accumulated regret and estimation error

6. Generalized Linear Model (GLM) Bandit: Non-linear Bandit

Modified Non-LinearReward Function For Testing

$$ R = (x^T \theta)^2 + \epsilon, \text{ where } \epsilon \sim N(\mu, \sigma^2) $$

GLM Parameter Estimation (MLE)

$$ \hat \theta_{t+1} = \max_{\theta} P(r|\theta) = A^{-1}_ {t+1} b_{t+1} $$

GLM UCB

$$ UCB_{GLM} = f(x^T \hat \theta_t) + \alpha \sqrt{x^T A^{-1} x} = (x^T \hat \theta_t)^2 + \alpha \sqrt{x^T A^{-1} x} $$

Result

Hyperparameter (α)	Cumulative Regret
0.1	62.16
0.5	727.63
1.5	5948.48

(a) $\alpha$ = 0.1 (b) $\alpha$ = 0.5 (c) $\alpha$ = 1.5

Figure 7: GLM-UCB accumulated regret

(a) $\alpha$ = 0.1 (b) $\alpha$ = 0.5 (c) $\alpha$ = 1.5

Figure 8: GLM-UCB estimation error

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
SimulationResults		SimulationResults
figures		figures
img		img
lib		lib
.DS_Store		.DS_Store
Articles.py		Articles.py
README.md		README.md
Simulation.py		Simulation.py
SimulationNonLinear.py		SimulationNonLinear.py
Simulation_save_file.py		Simulation_save_file.py
Users.py		Users.py
__init__.py		__init__.py
conf.py		conf.py
custom_errors.py		custom_errors.py
util_functions.py		util_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparison of Bandit Algorithms

Overview

Implementation

1. Explore-then-Commit

Result

2. Upper Confidence Bound (UCB)

Reward Estimation + Confidence Bound

Result

3. Thompson Sampling

Posterior Distribution

Result

4. Linear UCB (LinUCB)

Parameter Estimation

Reward Estimation + Confidence Bound

Result

5. Linear Thompson Sampling (LinTS)

Posterior Distribution

Result

6. Generalized Linear Model (GLM) Bandit: Non-linear Bandit

Modified Non-LinearReward Function For Testing

GLM Parameter Estimation (MLE)

GLM UCB

Result

About

Releases

Packages

Languages

kapshaul/OnlineLearning

Folders and files

Latest commit

History

Repository files navigation

Comparison of Bandit Algorithms

Overview

Implementation

1. Explore-then-Commit

Result

2. Upper Confidence Bound (UCB)

Reward Estimation + Confidence Bound

Result

3. Thompson Sampling

Posterior Distribution

Result

4. Linear UCB (LinUCB)

Parameter Estimation

Reward Estimation + Confidence Bound

Result

5. Linear Thompson Sampling (LinTS)

Posterior Distribution

Result

6. Generalized Linear Model (GLM) Bandit: Non-linear Bandit

Modified Non-LinearReward Function For Testing

GLM Parameter Estimation (MLE)

GLM UCB

Result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages