RL Chapter 2 Exercises
Solution to exercises from Chapter 2 of Barto Sutton.
Yash Bonde . 2019-05-19 . 3 min read
Reinforcement LearningNotes

NOTE: This part requires some basic understanding of calculus.

These are just my solutions of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here. If there are any problems with the solutions or you have some ideas ping me at bonde.yash97@gmail.com.

2.6 Mysterious Spikes

The formula for Q-value update is QQ+1n(RQ)Q \leftarrow Q + \frac{1}{n}(R - Q) and we set the initial Q-value as Q1(a)=+5Q_1(a) = +5. Initially this results in large exploration as every option that it explores is "disappointing". As it explores more options it reaches q(a)q_*(a) and the returns are very high. But the rewards are still negative and value keeps slipping but eventually starts to increase again

2.7 Unbiased Constant Step-Size Trick

Sample averages are not completely satisfactory solution because sample averages don't perform well in non-stationary problem. One way to avoid this is to have a step size of βnαoˉn\beta_n \doteq \frac{\alpha}{\bar{o}_n} where

\bar{o}_n \doteq \bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1}) \tag{1}

We have to show that ono_n is an exponential recency weighted average without initial bias. The solution can be given in the following manner:

Qn+1=Qn+β(RnQn)Q_{n+1} = Q_n + \beta(R_n - Q_n)

=(1β)Qn+βRn= (1 - \beta)Q_n + \beta R_n

=(1αnonˉ)Qn+αnonˉRn= (1 - \frac{\alpha_n}{\bar{o_n}})Q_n + \frac{\alpha_n}{\bar{o_n}}R_n

\bar{o}_n Q_{n+1} = \alpha R_n - (\bar{o}_n - \alpha)Q_n \tag{2}

Now we replace the value of oˉn\bar{o}_n in (1)(1) from (2)(2)

=αRn[oˉn1+α(1oˉn1)α]Qn= \alpha R_n - \left[\bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1}) - \alpha \right]Q_n

oˉnQn+1=αRn(1α)oˉn1Qn\bar{o}_n Q_{n+1} = \alpha R_n - (1 - \alpha)\bar{o}_{n-1}Q_n

The expansion of Qn+1Q_{n+1} can be given as

Q_{n+1} = (1-\alpha)^nQ_1 + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i \tag{3}

We can look at similarity between (2)(2) and (3)(3) and fitting the equation we get

oˉnQn+1=(1α)noˉ0Q1+i=1nα(1α)n1Ri\bar{o}_n Q_{n+1} = (1-\alpha)^n\bar{o}_0Q_1 + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i

oˉnQn+1=i=1nα(1α)n1Ri\bar{o}_n Q_{n+1} = \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i

Thus we have shown that QnQ_n is without initial bias, this also is a form of exponential recency weighted bias as later values are given less importance.

2.8 Spikes in UCB Algorithm

Upper Confidence Bound or UCB has the formula

Atargmaxa(Qt(a)+cln(t)Nt(a))A_t \doteq argmax_a (Q_t(a) + c\sqrt{\frac{ln(t)}{N_t(a)}})

The spike occurs on 11th step and the problem is a 10-bandit problem. Initially Nt(a)=0N_t(a) = 0 and it increments one by one as all the actions are taken. On the 11th step all the actions have been taken and we can now take the action that was most optimal till now. (note that t=1t = 1 after 11th)

Now from the 12th step, we still have not obtained the optimal solution q(a)q_*(a) and so the system is forced to diversify the options and rewards subsequently decreases. However since we already have explored all the possible option we can get a head-start over ϵ\epsilon-greedy method and as tt\rightarrow \infty the rewards start to increase.

2.9 Softmax and Sigmoid

Let there be two actions {A1,A2}\{A_1, A_2\} and its corresponding preferences {a1,a2}\{a_1, a_2\}, the logistic probability of a1a_1 can be then given as

sigmoid(a1)=11+ea1sigmoid(a_1) = \frac{1}{1 + e^{-a_1}}

The winning case can be when Pr(a1)>Pr(a2)Pr(a_1) > Pr(a_2) or a1>a2a_1 > a_2. Similarly the softmax probability can be given as

softmax(a1)=ea1ea1+ea2softmax(a_1) = \frac{e^{a_1}}{e^{a_1} + e^{a_2}}

softmax(a1)=11+e(a2a1)softmax(a_1) = \frac{1}{1 + e^{(a_2 - a_1)}}

Here also we can see that when the score a2>a1a_2 > a_1 the probability of selection of A1A_1 is more. Thus softmax and logistic behave in the same way for a two action game.