NOTE: This part requires some basic understanding of calculus.
These are just my solutions of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here. If there are any problems with the solutions or you have some ideas ping me at bonde.yash97@gmail.com.

Grid for problems 4.1 - 4.3
4.1 Q-Values
qπ(11,down)=−1+vπ(T)=−1+0=−1
qπ(7,down)=−1+vπ(11)=−15
4.2 state values
vπ(15)=−1+0.25(−20−22−14+vπ(15))
vπ(15)=0.75−15=−20
No the dynamics do not change the value of the state because it is still as far off from terminal as S12
4.3 q-value functions
qπ(s,a)=E[Gt∣St=s,At=a]
qπ(s,a)=E[Rt+1+γGt+1∣St=s,At=a]
qπ(s,a)=E[Rt+1+γqπ(St+1,At+1)∣St=s,At=a]
qπ(s,a)=∑s′,rp(s′,r∣s,a)[r+γ∑a′π(a′∣s′)qπ(s′,a′)]
qk(s,a)=∑s′,rp(s′,r∣s,a)[r+γ∑a′π(a′∣s′)qk+1(s′,a′)]
4.4 Subtle bug
The policy iteration algorithm has a subtle bug as follows. Imagine if we are in a state s where either actions a1 and a2 predicted by the policy π(a∣s) lead to the same state s′ (assume it is terminal and there are multiple ways to reach the terminal). In this case the policy will keep on oscillating and may never terminate.
π(s)←maxa∑s′,rp(s′,r∣s,a)[r+γV(s′)]∀s∈S
To fix this we replace the second last line in algorithm i.e. if old action=π(s) then policy-stable←fails, because the policy is oscillating between equally good policies. The way to solve this is to replace it directly with the value returned, if the value is same then the policy is stable.
if vπ′(s)=vπ;policy-stable←true
4.5 Action Value Algorithm
The problem is to convert the policy iteration algorithm's v∗ to q∗. This can be done by adding an inner loop ∀a∈As in 2. Policy Evaluation
q(s,a)←Q(s,a)
Q(s,a)←∑s′,rp(s′r∣s,a)[r+γ∑a′π(a′∣s′)Q(s′,a′)]
Δ←maxa[Δ,∣q(s,a)−Q(s,a)∣]
4.6 Epsilon-soft policy
The problem is to optimize the policy iteration algorithm's when the policy is epsilon-soft i.e. minimum probability of any action is ϵ/∣A(s)∣, thus we can modify Policy Evaluation piece as
v(s)←V(s)
a←max(ϵ/∣A(s)∣,π(s))
V(s)←∑s′,rp(s′r∣s,a)[r+γV(s′)]
4.8 Gambler's Problem - 1
The reason it bets all the money at 50 is because that is the price at which it has the highest probability to win. If the value function represents the probability then it has highest probability of winning at 50.

Value Distribution for ph=0.4
4.9 Gambler's Problem - 2
The reason it bets all the money at 50 is because that is the price at which it has the highest probability to win. If the value function represents the probability then it has highest probability of winning at 50.

Value Distribution for ph=0.25

Value Distribution for ph=0.55
No the results are not stable as θ→0 as the Δ value starts to increase and learning deteriorates. I am unable to generate the optimal policies, I suspect it is because of the float values.