For this post, we will assume policy π is deterministic.
To find a optimal policy, the key is to find the true state-value function .
There are two methods to achieve the goal.
Policy-Iteration
Value-Iteration
Policy Iteration
Randomly initialize the policy. Init state-value function to zero.
We calculate the improved policy with current state-value function.
Apply the policy, and recalculate the value function.
Repeat 1 and 2 until the policy doesn't change.
Value Iteration
Update the state-value function with current state-value function.
Iterate 1 until it converges
Calculate the policy with final state-value function.
Policy Improvement with Policy-Iteration
Let's define a action-state value function Q(s,a)
Qπ(s,a)=R(s,a)+γs′∈S∑p(s′∣s,a)Vπ(s′)
Action-state value function defines a expected value if we do the action a for given state s
To compute a policy for given state s, we should calculate the following:
πi+1(s)=argmaxaQπ(s,a)
The policy is updated to increase the max value of state-value function.
// Pseudo code for policy iteration algorithm
1. policy_{i+1} = policy_{i} // copy the previous policy
2. policy_{i+1}(s) = a // partially update the policy argmax_for_{a} Q(s, a)
// keep doing 1. and 2. will keep improving the policy
Proof of Policy-iteration algorithm
So the we are going to keep improving policy πi
But how can we ensure that partially updating the policy will keep improving the overall value?
Definition of the term "Policy is improved"
Vπ1≥Vπ2:Vπ1(s)≥Vπ2(s)∀s
We say policy is improved if the following inequality is satisfied.
From (1) to (2), we picked a action a that maximize the value.
We can see that the definition of (3) is (2)
From (4), we created a new policy πi+1(s) by picking the action that maximize (3). Note that all the future policy is still πi(s)
In (6), every πi is replaced to .
If we keep expanding the equation, and pick a good a that maximize Q(∗,∗) and build a policy πi+1, then we can say that the new policy is better(or equal) to the previous policy.
Getting Optimal policy with Value Iteration
In Value-Iteration method, we are going to iteratively update state-value function.
Since it is calculating it through multiple timesteps, Value Iteration can be expressed in Bellman equation.
V(s)=R(s)+γs′∈S∑P(s′∣s)V(s′)]...(7)
Let's define a Bellman backup operator B as follows:
BV(s)=amax[R(s,a)+γs′∈S∑p(s′∣s,a)V(s′)]
Then using the (7), we can make an algorithm that calculates the state-value function
// Pseudo code for Value Iteration algorithm
1. Initialize state-value function as zeros V_0(s)=1 for all s
2. Calculate the next state-value function using equation (7).
3. Iterate step 2 until the state-value function converges
4. Extract the policy from final state-value function.
Extracting policy from final state-value function
If we get the final state-value function, we can extract the policy from it!
π(s)=aargmax[R(s,a)+γs′∈S∑P(s′∣s,a)Vfinal(s′)]
Proof that state-value function will converge in Value Iteration method
How do we know that by iterating the value iteration will make the state-value function converge into some point?
This can be proved as follows:
Let's define a notation of distance between state-value function.
∥V−V′∥=amax∣V(s)−V′(s)∣
Our goal is to prove the following:
∥BVj−BVk∥≤∥Vj−Vk∥
Contraction Operator
Let O be an operator, and ∣x∣ denote any norm of x
if ∣OV−OV′∣≤∣V−V′∣, then O is a contraction operator.
As a result, every-time we apply Bellman operator, the gap between two consecutive state-value function gets smaller. This indicates Value Iteration method will make the state-value function V converge.