Policy-Improvement Algorithm

Summary

For this post, we will assume policy $\pi$ is deterministic.

To find a optimal policy, the key is to find the true state-value function .

There are two methods to achieve the goal.

Policy-Iteration
Value-Iteration

Policy Iteration
Randomly initialize the policy. Init state-value function to zero.
We calculate the improved policy with current state-value function.
Apply the policy, and recalculate the value function.
Repeat 1 and 2 until the policy doesn't change.

Value Iteration
Update the state-value function with current state-value function.
Iterate 1 until it converges
Calculate the policy with final state-value function.

Policy Improvement with Policy-Iteration

Let's define a action-state value function $Q(s, a)$

Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V^\pi(s')

Action-state value function defines a expected value if we do the action $a$ for given state $s$

To compute a policy for given state $s$ , we should calculate the following:

\pi_{i+1}(s) = argmax_a Q^\pi(s, a)

The policy is updated to increase the max value of state-value function.

// Pseudo code for policy iteration algorithm
1. policy_{i+1} = policy_{i} // copy the previous policy

2. policy_{i+1}(s) = a // partially update the policy argmax_for_{a} Q(s, a)

// keep doing 1. and 2. will keep improving the policy

Proof of Policy-iteration algorithm

So the we are going to keep improving policy $\pi_i$

But how can we ensure that partially updating the policy will keep improving the overall value?

Definition of the term "Policy is improved"

V^{\pi_1} \ge V^{\pi_2}: V^{\pi_1}(s) \ge V^{\pi_2}(s) \ \forall s

We say policy is improved if the following inequality is satisfied.

Proof

Let's assume the deterministic policy.

V^{\pi_i}(s) = R(s, \pi_i(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_i(s)) \cdot V^{\pi_i}(s') \ ... (1) \\ \le \underset{a}{max} \ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a) \cdot V^{\pi_i}(s') \ ...(2) \\ = \underset{a}{max} \ Q^{\pi_i}(s, a) \ ...(3) \\ = R(s, \pi_{i+1}(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_{i+1}(s))\cdot V^{\pi_{i}}(s') \ ...(4) \\ = R(s, \pi_{i+1}(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_{i+1}(s)) \cdot \underset{a}{max} \left[ R(s', a') + \gamma \sum_{s'' \in S} p(s'' | s', a')V^{\pi_i}(s'')\right] \ ...(5) \\ = R(s, \pi_{i+1}(s))+\gamma\sum_{s' \in S}p(s'|s, \pi_{i+1}(s)) \cdot \underset{a'}{max} \ Q(s', a') \\ ... \\ = V^{\pi_{i+1}}(s) \ ...(6)

From $(1)$ to $(2)$ , we picked a action $a$ that maximize the value.

We can see that the definition of $(3)$ is $(2)$

From $(4)$ , we created a new policy $\pi_{i+1}(s)$ by picking the action that maximize $(3)$ . Note that all the future policy is still $\pi_i(s)$

In $(6)$ , every $\pi_i$ is replaced to .

If we keep expanding the equation, and pick a good $a$ that maximize $Q(*, *)$ and build a policy $\pi_{i+1}$ , then we can say that the new policy is better(or equal) to the previous policy.

Getting Optimal policy with Value Iteration

In Value-Iteration method, we are going to iteratively update state-value function.

Since it is calculating it through multiple timesteps, Value Iteration can be expressed in Bellman equation.

V(s)=R(s)+\gamma \sum_{s' \in S}P(s'|s)V(s') ]\ ...(7)

Let's define a Bellman backup operator $B$ as follows:

B \ V(s) = \underset{a}{max} \left[ R(s, a)+\gamma \sum_{s' \in S} p(s'|s, a)V(s') \right]

Then using the $(7)$ , we can make an algorithm that calculates the state-value function

// Pseudo code for Value Iteration algorithm
1. Initialize state-value function as zeros V_0(s)=1 for all s
2. Calculate the next state-value function using equation (7).
3. Iterate step 2 until the state-value function converges
4. Extract the policy from final state-value function.

Extracting policy from final state-value function

If we get the final state-value function, we can extract the policy from it!

\pi(s)=\underset{a}{argmax} \left[ R(s, a)+ \gamma \sum_{s' \in S} P(s'|s, a)V_{final}(s') \right]

Proof that state-value function will converge in Value Iteration method

How do we know that by iterating the value iteration will make the state-value function converge into some point?

This can be proved as follows:

Let's define a notation of distance between state-value function.

\left\| V - V' \right\| = \underset{a}{max}|V(s)-V'(s)|

Our goal is to prove the following:

\left\| BV_j - BV_k \right\| \le \left\| V_j - V_k \right\|

Contraction Operator
Let $O$ be an operator, and $|x|$ denote any norm of $x$
if $|OV-OV'| \le |V - V'|$ , then $O$ is a contraction operator.

\left\| B V_k - B V_j \right\| = \underset{s}{max} |BV_k(s)-BV_j(s)| \\ =\underset{s}{max} \left[ \underset{a}{max} \left\{ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V_j(s') \right\} - \underset{a'}{max} \left\{ R(s, a')+\gamma \sum_{s' \in S}p(s'|s, a')V_k(s') \right\} \right] \\ \le \underset{s}{max} \left[ \underset{a}{max} \left\{ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V_j(s') - R(s, a) - \gamma \sum_{s' \in S}p(s'|s, a)V_k(s') \right\} \right] \\ = \underset{a}{max} \left\{ \gamma \sum_{s' \in S} p(s'|a)\left(V_k(s') - V_j(s')\right) \right\} \\ \le \underset{a}{max}\left\{ \left\| V_k - V_j \right\| \gamma \sum_{s' \in S} p(s'|s, a) \right\} \\ = \gamma \left\| V_k - V_j \right\|

As a result, every-time we apply Bellman operator, the gap between two consecutive state-value function gets smaller. This indicates Value Iteration method will make the state-value function $V$ converge.

References

Last updated 1 month ago

Was this helpful?

// Pseudo code for policy iteration algorithm 1. policy_{i+1} = policy_{i} // copy the previous policy 2. policy_{i+1}(s) = a // partially update the policy argmax_for_{a} Q(s, a) // keep doing 1. and 2. will keep improving the policy

// Pseudo code for Value Iteration algorithm 1. Initialize state-value function as zeros V_0(s)=1 for all s 2. Calculate the next state-value function using equation (7). 3. Iterate step 2 until the state-value function converges 4. Extract the policy from final state-value function.