Chapter 2: Multi-Armed Bandit
Handling Exploitation vs. Exploration. Notes from Barto Sutton.
Yash Bonde . 2019-05-19 . 6 min read
Reinforcement LearningNotes

These are just my notes of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. My solutions to the exercises are on this page.

We do not know what the rules of the game are; all we are allowed to do is watch the playing. Of course if we are allowed to watch long enough we may catch on a few rules. - Richard Feynman

2.1 K-armed Bandit Problem

Assume that you are repeatedly faced with a choice among kk options, after each choice you receive a numerical reward from stationary probability distribution. Your objective is to maximise the total reward over some time period. So the value of some arbitrary action aa is given by q(a)=E[RtAt=a]q_*(a) = \mathbb{E}[R_t|A_t=a]. We denote the estimated value of action aa at timestep tt as Qt(a)Q_t(a) which we would like make close to q(a)q_*(a).

2.2 Action Value Methods

Since we need a moving understanding of the reward we calculate the estimates for it. One way to estimate is by averaging all the rewards received.

Qt(a)i=0t1R1Ai=1i=0t11Ai=aQ_t(a) \doteq \frac{\sum_{i = 0}^{t-1}R\cdot 1_{A_i = 1}}{\sum_{i = 0}^{t-1}{1_{A_i = a}}}

where 1At=a1_{A_t = a} means value is one for all the cases where At=aA_t = a. Once we have this to take any action we can use argmaxaQt(a){argmax}_a Q_t(a), which is to take a the action with highest value, this is also called a "greedy method". But this kind of selection prevents "exploration" of the values and gets stuck. To prevent this we use epsilon greedy method, where we take a random action with probability ϵ\epsilon.

2.4 Incremental Implementation

Now we continue to build upon the previous equation to make it recursive like this. (Ideally we would like to have as many recursive algorithms as possible, this makes more sense computationally)

Qn+1=1ni=1nRiQ_{n+1} = \frac{1}{n}\sum_{i = 1}^{n}R_i

Qn+1=1n(Rn+i=1n1Ri)Q_{n+1} = \frac{1}{n}(R_n + \sum_{i = 1}^{n-1}R_i)

Qn+1=Qn+1n(RnQn)Q_{n+1} = Q_n + \frac{1}{n}(R_n - Q_n)

Which gives us a meta form of equation that we will use throughout in reinforcement learning, that is

NewEstimateOldEstimate+StepSize×(TargetOldEstimate)NewEstimate \gets OldEstimate + StepSize \times (Target - OldEstimate)

Where TargetOldEstimateTarget - OldEstimate is the ErrorError which is reduced by taking a step of StepSizeStepSize in towards the TargetTarget. Note that the StepSize=α=1nStepSize = \alpha = \frac{1}{n} This might look similar to the Gradient Descent update and later we will see how we can use gradients to get proper values. So we can give the bandit algorithm as follows:

Bandit Learning Algorithm

Bandit Learning Algorithm

2.5 Tracking a non-stationary process

When the probability distribution is not static i.e. it does not change with time it is called stationary process, but most of the problems in reinforcement learning has non-stationary distribution. We can expand and calculate the values as follows, here α\alpha is the step-size

Qn+1=Qn+α(RnQn)Q_{n+1} = Q_n + \alpha(R_n - Q_n)

Qn+1=αRn+(1α)QnQ_{n+1} = \alpha R_n + (1 - \alpha)Q_n

Qn+1=αRn+(1α)(αRn1+(1α)(...))Q_{n+1} = \alpha R_n + (1 - \alpha)(\alpha R_{n-1} + (1 - \alpha)(...))

Qn+1=(1α)nQ1+i=1nα(1α)niRiQ_{n+1} = (1 - \alpha)^nQ_1 + \sum_{i = 1}^{n}\alpha(1 - \alpha)^{n-i}R_i

Note that we call this weighted average because the average with or without the additional terms will remain the same, since (1α)n+i=1nα(1α)ni=1(1 - \alpha)^n + \sum_{i = 1}^{n}\alpha(1 - \alpha)^{n-i} = 1. This type of distributions are called "exponential recency weighted average".

There are two important things for proper α\alpha (remember this as we will need to use it later on):

1.n=1αn(a)=1. \sum_{n=1}^{\infty} \alpha_n(a)= \infty

2.n=1αn2(a)<2. \sum_{n=1}^{\infty} \alpha_n^2(a) < \infty

2.6 Optimistic Initial Values

The methods yet discussed are dependent in some initial action-value estimate Q1(a)Q_1(a), or statistically speaking these methods are biased. Initial values can be changed to encourage exploration. Suppose q(a)=N(0,1)q_*(a) = \mathcal{N}(0,1) but we set Q1=+5Q_1 = +5. Whichever actions are chosen, the reward is less than the starting estimates. The learner switches to the other actions after being "disappointed" with the rewards it is receiving. We call this trick Optimistic Initial Values. However it does not perform well in non-stationary problems as its exploration is temporary. If the task changes, creating renewed need for exploration, this method cannot help.

2.7 Upper Confidence Bound Action Selection

ϵ\epsilon-greedy action selection forces the non-greedy solutions to the tried but there is not a discrimination between them. It would be better to select among non greedy action according to their potential of actually being optimistic.

Atargmaxa(Qt(a)+cln(t)Nt(a))A_t \doteq argmax_a (Q_t(a) + c\sqrt{\frac{ln(t)}{N_t(a)}})

Where Nt(a)N_t(a) is the number of times aa has been selected prior to tt. So if Nt(a)=0N_t(a) = 0, then aa is considered to be maximising action. The idea of Upper Confidence Bound (UCB) action selection is that the square root term is the measure of uncertainty cc confidence level. Each time aa is selected the uncertainty is decreased.

It is more difficult to extend beyond bandits and move to general RL settings:

  • For non-stationary distributions more complex methods would be needed
  • When state spaces are huge, it will fail

2.8 Gradient Bandit Algorithm

Selecting action on estimates is not the only way we consider learning a preference for each action aa, denoted by Ht(a)H_t(a). We call the preference probability distribution as policy and denote it by πt(a)\pi_t(a)

Pr{At=a}eHt(a)b=1keHt(b)πt(a)Pr\{A_t=a\} \doteq \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}} \doteq \pi_t(a)

Where H1(a)=0H_1(a) = 0. This is useful as the only thing required for any situation is the relative preference over the other possible actions. The equation for stochastic gradient ascent is given as:

Ht+1(At)=Ht(At)+α(RtRtˉ)(1˙a=Atπt(a))H_{t+1}(A_t) = H_t(A_t) + \alpha(R_t - \bar{R_t})(\dot{1}_{a = A_t} - \pi_t(a))

Here 1˙\dot{1} is an operator whose the value is 11 if a=Ata = A_t else 00. RtˉR\bar{R_t} \in \mathbb{R} is the average of all the rewards up through and including time tt. Rtˉ\bar{R_t} term serves as a baseline probability of taking AtA_t in the future is increased and vice versa.

Additional Topic: Bandit Gradient as Stochastic Gradient Ascent

This is an additional topic where we show here that the gradient bandit algorithm can be represented in form of stochastic gradient ascent (SGA), is covered in detail in another page here.

2.9 Associative Search (Contextual Bandits)

Unlike till now where we simply had a k-bandit problem, what if there are several different k-bandit problems and at each step you you confront one of those chosen at random. Thus changing the bandit task to a step by step task, whose true action values are changed randomly. However in Practice it does not work well. They are like the full RL problem in that they involve learning a policy. But like k-armed bandit their reward is only at the immediate step. If its action affects the next situation and rewards it would become a full RL problem.