Skip to content

Repository of Online Learning algorithms, including Bandits, UCB, and more.

Notifications You must be signed in to change notification settings

kapshaul/OnlineLearning

Repository files navigation

Comparison of Bandit Algorithms

Overview

This repository includes implementations and performance reports of several bandit algorithms. The study covers:

1. Explore-then-Commit

2. Upper Confidence Bound (UCB)

3. Thompson Sampling

4. Linear UCB (LinUCB)

5. Linear Thompson Sampling (LinTS)

6. Generalized Linear Model Bandit (GLM)

Implementation

To simulate a specific algorithm, edit the Simulation.py script by enabling the desired algorithm and disabling the others.

For example, to run the UCB algorithm with $\alpha = 0.5$, update the code as follows:

## Initiate Bandit Algorithms ##
algorithms = {}

#algorithms['EpsilonGreedyLinearBandit'] = EpsilonGreedyLinearBandit(dimension=context_dimension, lambda_=0.1, epsilon=None)
#algorithms['EpsilonGreedyMultiArmedBandit'] = EpsilonGreedyMultiArmedBandit(num_arm=n_articles, epsilon=0.1)
#algorithms['ExplorethenCommit'] = ExplorethenCommit(num_arm=n_articles, m=30)
algorithms['UCBBandit'] = UCBBandit(num_arm=n_articles, alpha=0.5)
#algorithms['ThompsonSamplingGaussianMAB'] = ThompsonSamplingGaussianMAB(num_arm=n_articles)
#algorithms['LinearUCBBandit'] = LinearUCBBandit(dimension=context_dimension, lambda_=0.1, alpha=0.5) #delta=0.05, alpha=2.358
#algorithms['LinearThompsonSamplingMAB'] = LinearThompsonSamplingMAB(dimension=context_dimension, lambda_=0.1)

After selecting your algorithm, run the Simulation.py script.

1. Explore-then-Commit

Result

Hyperparameter (m) Cumulative Regret
10 1001.40
20 214.90
30 334.02

(a) m = 10                                         (b) m = 20                                         (c) m = 30

Figure 1: Explore then Commit accumulated regret

2. Upper Confidence Bound (UCB)

Reward Estimation + Confidence Bound

$$ \text{UCB} = \hat u_{t-1,i} + \sqrt{\frac{2 \ln t}{S_{t-1,i}}} $$

Result

Hyperparameter (α) Cumulative Regret
0.1 256.50
0.5 977.03
1.0 1906.65

(a) $\alpha$ = 0.1                                         (b) $\alpha$ = 0.5                                         (c) $\alpha$ = 1

Figure 2: UCB Bandit accumulated regret

3. Thompson Sampling

Posterior Distribution

$$ N \sim \left( \hat u_{t-1,i}, \frac{1}{S_{t-1,i} + 1} \right) $$

Result

Cumulative Regret
100

Figure 3: Thompson Sampling accumulated regret

4. Linear UCB (LinUCB)

Parameter Estimation

$$ \hat \theta_{t+1} = A^{-1}_ {t+1} b_{t+1} $$

Reward Estimation + Confidence Bound

$$ \text{UCB} = x^T \hat \theta_t + \alpha \sqrt{x^T A^{-1} x} $$

Result

Hyperparameter (α) Cumulative Regret
0.5 24.43
1.5 177.89
2.5 487.73

(a) $\alpha$ = 0.5                                         (b) $\alpha$ = 1.5                                         (c) $\alpha$ = 2.5

Figure 4: Linear UCB accumulated regret


(a) $\alpha$ = 0.5                                         (b) $\alpha$ = 1.5                                         (c) $\alpha$ = 2.5

Figure 5: Linear UCB estimation error

5. Linear Thompson Sampling (LinTS)

Posterior Distribution

$$ N \sim (\hat{\theta}_t, A^{-1}_t) $$

Result

Cumulative Regret
1098.24

Figure 6: Linear Thompson Sampling accumulated regret and estimation error

6. Generalized Linear Model (GLM) Bandit: Non-linear Bandit

Modified Non-LinearReward Function For Testing

$$ R = (x^T \theta)^2 + \epsilon, \text{ where } \epsilon \sim N(\mu, \sigma^2) $$

GLM Parameter Estimation (MLE)

$$ \hat \theta_{t+1} = \max_{\theta} P(r|\theta) = A^{-1}_ {t+1} b_{t+1} $$

GLM UCB

$$ UCB_{GLM} = f(x^T \hat \theta_t) + \alpha \sqrt{x^T A^{-1} x} = (x^T \hat \theta_t)^2 + \alpha \sqrt{x^T A^{-1} x} $$

Result

Hyperparameter (α) Cumulative Regret
0.1 62.16
0.5 727.63
1.5 5948.48

(a) $\alpha$ = 0.1                                         (b) $\alpha$ = 0.5                                         (c) $\alpha$ = 1.5

Figure 7: GLM-UCB accumulated regret


(a) $\alpha$ = 0.1                                         (b) $\alpha$ = 0.5                                         (c) $\alpha$ = 1.5

Figure 8: GLM-UCB estimation error