ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • Policy Improvement with Policy-Iteration
  • Proof of Policy-iteration algorithm
  • Getting Optimal policy with Value Iteration
  • Proof that state-value function will converge in Value Iteration method
  • References

Was this helpful?

Edit on GitHub

Policy-Improvement Algorithm

Summary

For this post, we will assume policy π\piπ is deterministic.

To find a optimal policy, the key is to find the true state-value function .

There are two methods to achieve the goal.

  1. Policy-Iteration

  2. Value-Iteration

Policy Iteration

  1. Randomly initialize the policy. Init state-value function to zero.

  2. We calculate the improved policy with current state-value function.

  3. Apply the policy, and recalculate the value function.

  4. Repeat 1 and 2 until the policy doesn't change.

Value Iteration

  1. Update the state-value function with current state-value function.

  2. Iterate 1 until it converges

  3. Calculate the policy with final state-value function.

Policy Improvement with Policy-Iteration

Let's define a action-state value function Q(s,a)Q(s, a)Q(s,a)

Qπ(s,a)=R(s,a)+γ∑s′∈Sp(s′∣s,a)Vπ(s′)Q^\pi(s, a) = R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V^\pi(s')Qπ(s,a)=R(s,a)+γs′∈S∑​p(s′∣s,a)Vπ(s′)

Action-state value function defines a expected value if we do the action aaa for given state sss

To compute a policy for given state sss, we should calculate the following:

πi+1(s)=argmaxaQπ(s,a)\pi_{i+1}(s) = argmax_a Q^\pi(s, a)πi+1​(s)=argmaxa​Qπ(s,a)

The policy is updated to increase the max value of state-value function.

// Pseudo code for policy iteration algorithm
1. policy_{i+1} = policy_{i} // copy the previous policy

2. policy_{i+1}(s) = a // partially update the policy argmax_for_{a} Q(s, a)

// keep doing 1. and 2. will keep improving the policy

Proof of Policy-iteration algorithm

So the we are going to keep improving policy πi\pi_iπi​

But how can we ensure that partially updating the policy will keep improving the overall value?

Definition of the term "Policy is improved"

Vπ1≥Vπ2:Vπ1(s)≥Vπ2(s) ∀sV^{\pi_1} \ge V^{\pi_2}: V^{\pi_1}(s) \ge V^{\pi_2}(s) \ \forall sVπ1​≥Vπ2​:Vπ1​(s)≥Vπ2​(s) ∀s

We say policy is improved if the following inequality is satisfied.

Proof

Let's assume the deterministic policy.

Vπi(s)=R(s,πi(s))+γ∑s′∈Sp(s′∣s,πi(s))⋅Vπi(s′) ...(1)≤maxa R(s,a)+γ∑s′∈Sp(s′∣s,a)⋅Vπi(s′) ...(2)=maxa Qπi(s,a) ...(3)=R(s,πi+1(s))+γ∑s′∈Sp(s′∣s,πi+1(s))⋅Vπi(s′) ...(4)=R(s,πi+1(s))+γ∑s′∈Sp(s′∣s,πi+1(s))⋅maxa[R(s′,a′)+γ∑s′′∈Sp(s′′∣s′,a′)Vπi(s′′)] ...(5)=R(s,πi+1(s))+γ∑s′∈Sp(s′∣s,πi+1(s))⋅maxa′ Q(s′,a′)...=Vπi+1(s) ...(6)V^{\pi_i}(s) = R(s, \pi_i(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_i(s)) \cdot V^{\pi_i}(s') \ ... (1) \\ \le \underset{a}{max} \ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a) \cdot V^{\pi_i}(s') \ ...(2) \\ = \underset{a}{max} \ Q^{\pi_i}(s, a) \ ...(3) \\ = R(s, \pi_{i+1}(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_{i+1}(s))\cdot V^{\pi_{i}}(s') \ ...(4) \\ = R(s, \pi_{i+1}(s))+\gamma \sum_{s' \in S}p(s'|s, \pi_{i+1}(s)) \cdot \underset{a}{max} \left[ R(s', a') + \gamma \sum_{s'' \in S} p(s'' | s', a')V^{\pi_i}(s'')\right] \ ...(5) \\ = R(s, \pi_{i+1}(s))+\gamma\sum_{s' \in S}p(s'|s, \pi_{i+1}(s)) \cdot \underset{a'}{max} \ Q(s', a') \\ ... \\ = V^{\pi_{i+1}}(s) \ ...(6)Vπi​(s)=R(s,πi​(s))+γs′∈S∑​p(s′∣s,πi​(s))⋅Vπi​(s′) ...(1)≤amax​ R(s,a)+γs′∈S∑​p(s′∣s,a)⋅Vπi​(s′) ...(2)=amax​ Qπi​(s,a) ...(3)=R(s,πi+1​(s))+γs′∈S∑​p(s′∣s,πi+1​(s))⋅Vπi​(s′) ...(4)=R(s,πi+1​(s))+γs′∈S∑​p(s′∣s,πi+1​(s))⋅amax​[R(s′,a′)+γs′′∈S∑​p(s′′∣s′,a′)Vπi​(s′′)] ...(5)=R(s,πi+1​(s))+γs′∈S∑​p(s′∣s,πi+1​(s))⋅a′max​ Q(s′,a′)...=Vπi+1​(s) ...(6)

From (1)(1)(1) to (2)(2)(2), we picked a action aaa that maximize the value.

We can see that the definition of (3)(3)(3) is (2)(2)(2)

From (4)(4)(4), we created a new policy πi+1(s)\pi_{i+1}(s)πi+1​(s) by picking the action that maximize (3)(3)(3). Note that all the future policy is still πi(s)\pi_i(s)πi​(s)

In (6)(6)(6), every πi\pi_iπi​ is replaced to .

If we keep expanding the equation, and pick a good aaa that maximize Q(∗,∗)Q(*, *)Q(∗,∗) and build a policy πi+1\pi_{i+1}πi+1​, then we can say that the new policy is better(or equal) to the previous policy.

Getting Optimal policy with Value Iteration

In Value-Iteration method, we are going to iteratively update state-value function.

Since it is calculating it through multiple timesteps, Value Iteration can be expressed in Bellman equation.

V(s)=R(s)+γ∑s′∈SP(s′∣s)V(s′)] ...(7)V(s)=R(s)+\gamma \sum_{s' \in S}P(s'|s)V(s') ]\ ...(7)V(s)=R(s)+γs′∈S∑​P(s′∣s)V(s′)] ...(7)

Let's define a Bellman backup operator BBB as follows:

B V(s)=maxa[R(s,a)+γ∑s′∈Sp(s′∣s,a)V(s′)]B \ V(s) = \underset{a}{max} \left[ R(s, a)+\gamma \sum_{s' \in S} p(s'|s, a)V(s') \right]B V(s)=amax​[R(s,a)+γs′∈S∑​p(s′∣s,a)V(s′)]

Then using the (7)(7)(7), we can make an algorithm that calculates the state-value function

// Pseudo code for Value Iteration algorithm
1. Initialize state-value function as zeros V_0(s)=1 for all s
2. Calculate the next state-value function using equation (7).
3. Iterate step 2 until the state-value function converges
4. Extract the policy from final state-value function.

Extracting policy from final state-value function

If we get the final state-value function, we can extract the policy from it!

π(s)=argmaxa[R(s,a)+γ∑s′∈SP(s′∣s,a)Vfinal(s′)]\pi(s)=\underset{a}{argmax} \left[ R(s, a)+ \gamma \sum_{s' \in S} P(s'|s, a)V_{final}(s') \right]π(s)=aargmax​[R(s,a)+γs′∈S∑​P(s′∣s,a)Vfinal​(s′)]

Proof that state-value function will converge in Value Iteration method

How do we know that by iterating the value iteration will make the state-value function converge into some point?

This can be proved as follows:

Let's define a notation of distance between state-value function.

∥V−V′∥=maxa∣V(s)−V′(s)∣ \left\| V - V' \right\| = \underset{a}{max}|V(s)-V'(s)|∥V−V′∥=amax​∣V(s)−V′(s)∣

Our goal is to prove the following:

∥BVj−BVk∥≤∥Vj−Vk∥\left\| BV_j - BV_k \right\| \le \left\| V_j - V_k \right\|∥BVj​−BVk​∥≤∥Vj​−Vk​∥

Contraction Operator

Let OOO be an operator, and ∣x∣|x|∣x∣ denote any norm of xxx

if ∣OV−OV′∣≤∣V−V′∣|OV-OV'| \le |V - V'|∣OV−OV′∣≤∣V−V′∣, then OOO is a contraction operator.

∥BVk−BVj∥=maxs∣BVk(s)−BVj(s)∣=maxs[maxa{R(s,a)+γ∑s′∈Sp(s′∣s,a)Vj(s′)}−maxa′{R(s,a′)+γ∑s′∈Sp(s′∣s,a′)Vk(s′)}]≤maxs[maxa{R(s,a)+γ∑s′∈Sp(s′∣s,a)Vj(s′)−R(s,a)−γ∑s′∈Sp(s′∣s,a)Vk(s′)}]=maxa{γ∑s′∈Sp(s′∣a)(Vk(s′)−Vj(s′))}≤maxa{∥Vk−Vj∥γ∑s′∈Sp(s′∣s,a)}=γ∥Vk−Vj∥\left\| B V_k - B V_j \right\| = \underset{s}{max} |BV_k(s)-BV_j(s)| \\ =\underset{s}{max} \left[ \underset{a}{max} \left\{ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V_j(s') \right\} - \underset{a'}{max} \left\{ R(s, a')+\gamma \sum_{s' \in S}p(s'|s, a')V_k(s') \right\} \right] \\ \le \underset{s}{max} \left[ \underset{a}{max} \left\{ R(s, a) + \gamma \sum_{s' \in S}p(s'|s, a)V_j(s') - R(s, a) - \gamma \sum_{s' \in S}p(s'|s, a)V_k(s') \right\} \right] \\ = \underset{a}{max} \left\{ \gamma \sum_{s' \in S} p(s'|a)\left(V_k(s') - V_j(s')\right) \right\} \\ \le \underset{a}{max}\left\{ \left\| V_k - V_j \right\| \gamma \sum_{s' \in S} p(s'|s, a) \right\} \\ = \gamma \left\| V_k - V_j \right\|∥BVk​−BVj​∥=smax​∣BVk​(s)−BVj​(s)∣=smax​[amax​{R(s,a)+γs′∈S∑​p(s′∣s,a)Vj​(s′)}−a′max​{R(s,a′)+γs′∈S∑​p(s′∣s,a′)Vk​(s′)}]≤smax​[amax​{R(s,a)+γs′∈S∑​p(s′∣s,a)Vj​(s′)−R(s,a)−γs′∈S∑​p(s′∣s,a)Vk​(s′)}]=amax​{γs′∈S∑​p(s′∣a)(Vk​(s′)−Vj​(s′))}≤amax​{∥Vk​−Vj​∥γs′∈S∑​p(s′∣s,a)}=γ∥Vk​−Vj​∥

As a result, every-time we apply Bellman operator, the gap between two consecutive state-value function gets smaller. This indicates Value Iteration method will make the state-value function VVV converge.

References

Last updated 1 month ago

Was this helpful?

[1]

[2]

[3]

https://www.youtube.com/watch?v=WsvFL-LjA6U
https://www.youtube.com/watch?v=gHdsUUGcBC0
https://web.stanford.edu/class/cs234/modules.html