NOTE: This part requires some basic understanding of calculus.
These are just my solutions of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here. If there are any problems with the solutions or you have some ideas ping me at bonde.yash97@gmail.com.
6.1 Changing Value function
We are given the TD update equation as
and we can right it with subscript as an array as
and we write the error that is applied on state as
we also know that
and after update with discounted error because the first mathematical operation we apply is on state ie. , we get the equation as
So after expanding and compressing we get the equation
So the additional error added is
6.2 TD vs. MC
I think a lot of problem that require an estimation for long sequences where we need the estimation as the task continues would benefit from the TD learning system. Eg. we all love to eat and often have tendency to overeat, if we decide to go to a buffet and our priorities is to eat diversity of food while enjoying the good ones more. This is a perfect setup where TD learning would be beneficial.
6.3 Analysis of random walks
We know that for the first step only was changed which means the random walk landed on and from there it went to the terminal state on left, other wise the would have changed. Again, because walk ended on terminal state on left of only changed. We can calculate the values very easily:
Now we have , and , so we get . Similarly if it went the other way .
6.4 Analysis of random walks
It is clear that if there was a wider set of s then TD would be a better algorithm than MC. As seen that we get good performance for TD at and at the MC starts to deteriorate. So even though the TD learning would occur when is small it would become sub-optimal as it would take more walks per episode to learn.
6.5 Analysis of random walks
We can see that increase in increases the MSE over multiple steps that might happen because the noise is too high. You think like it starts to oscillate around the true value function.
6.8 Sarsa Convergence
we can now continue to substitute the values and see the expansion as and
6.11 Q-learning
There is a subtle difference in the Sarsa and Q-learning, ie. the action taken by Sarsa is the next action taken () where as in Q-learning the only the state changes () and new action is taken at each step. Thus learning happens independent of whether the action was taken or not!
6.12 Q-learning Sarsa
The question is if we make the action selection greedy then are the two algorithms same. The answer is no, consider this case , now considering greedy action selection we get . On observing the followed rewards and new state we update the value of this state so say that we got a large negative reward then the new values may become .
Action: In case of Sarsa we will see that the action is taken as and so the action cannot change, however the value is again updated for Q-learning and this time it will not choose . Thus no the action for the two will not be same.
Weight Update: They will be same as we basically have the same variables, thus if both algorithms take action and get the same and we do get the same weight updates.