ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Markov Assumption
  • Markov Chain
  • Markov Reward Process
  • Markov Decision Process
  • MDP + = MRP
  • References

Was this helpful?

Edit on GitHub
  1. Reinforcement Learning

Markov

Last updated 1 month ago

Was this helpful?

Markov Assumption

Let hth_tht​ be a sufficient statistic of history.

Then we say state sts_tst​ is Markov if and only if probability distribution of next state st+1s_{t+1}st+1​ only relies on current state sts_tst​.

p(st+1∣st)=p(st+1∣ht)p(s_{t+1}|s_t)=p(s_{t+1}|h_t)p(st+1​∣st​)=p(st+1​∣ht​)

This markov assumption makes the model simple.

Markov Chain

Model without decisions or rewards

Markov Chain means the probability of next state given current state. It can be represented as a matrix.

P=[p(s1∣s1)p(s2∣s1)...p(sn∣s1)p(s1∣s2)p(s2∣s2)...p(sn∣s2)............p(s1∣sn)p(s2∣sn)...p(sn∣sn)] P = \begin{bmatrix} p(s_1|s_1) & p(s_2|s_1) & ... & p(s_n|s_1) \\ p(s_1|s_2) & p(s_2|s_2) & ... & p(s_n|s_2) \\ ... & ... & ... & ... \\ p(s_1|s_n) & p(s_2|s_n) & ... & p(s_n|s_n) \\ \end{bmatrix}P=​p(s1​∣s1​)p(s1​∣s2​)...p(s1​∣sn​)​p(s2​∣s1​)p(s2​∣s2​)...p(s2​∣sn​)​............​p(sn​∣s1​)p(sn​∣s2​)...p(sn​∣sn​)​​

Markov Reward Process

Markov Chain with rewards

It is defined with following components

Calculating State-Value function with Algorithm

We can calculate state-value function with iterative calculation.

Markov Decision Process

MRP with Decision process(actions)

It is defined with following components

In MDP, we can define a policy as follows:

Stochastic policy returns a probabilisitic distribution among action space.

Since MDP with policy can be viewed as MRP, we can define a state-value function.

We can make it as a more general form with deterministic policy.

References

SSS is finite set of states

PPP is dynamics/transition model specifies p(st+1∣st)p(s_{t+1}|s_t)p(st+1​∣st​)

R(s)R(s)R(s) is a reward function for current state sss.

We should also encounter the rewards of the future state. So we bring a concept of state-value function V(s)V(s)V(s) for MRP.

V(s)=E[rt+γ⋅rt+1+γ2⋅rt+2...]V(s)=E[r_t+\gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} ...]V(s)=E[rt​+γ⋅rt+1​+γ2⋅rt+2​...]

Where rtr_trt​ is immediate reward got at timestep ttt.

V(s)=R(s)+γ∑s′∈Sp(s′∣s)⋅V(s′)V(s) = R(s)+\gamma \sum_{s' \in S}{p(s'|s)\cdot V(s')}V(s)=R(s)+γs′∈S∑​p(s′∣s)⋅V(s′)

We keep updating the state-value function V(s)V(s)V(s) (or viewed as state-value Matrix VVV) with dynamic-programming.

SSS is finite set of states

AAA is set of actions

PPP is dynamics model for each action. P(s′∣s,a)P(s'|s, a)P(s′∣s,a)

RRR is reward function. R(s,a)R(s, a)R(s,a)

Policy π\piπ

A policy π\piπ is a function that returns which action to take for given state sss.

Deterministic policy returns a single action aaa for current state sss.

a=π(s)a = \pi(s)a=π(s)
π(a∣s)\pi(a|s)π(a∣s)

MDP + π\piπ = MRP

We can view MDP+ π\piπ as MRP because π\piπ gives information of P(s′∣s)P(s'|s)P(s′∣s).

Rπ(s)=∑a∈Aπ(a∣s)R(s,a)R^{\pi}(s)=\sum_{a\in A}\pi(a|s)R(s, a)Rπ(s)=a∈A∑​π(a∣s)R(s,a)
Pπ(s′∣s)=∑a∈Aπ(a∣s)P(s′∣s,a)P^{\pi}(s'|s)=\sum_{a\in A}\pi(a|s) P(s'|s, a)Pπ(s′∣s)=a∈A∑​π(a∣s)P(s′∣s,a)
V(s)=∑a∈Aπ(a∣s)[R(s,a)+γ∑s′∈Sp(s′∣s,a)V(s′)] ...(1)V(s) = \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in S}{p(s'|s, a)V(s')} \right] \ ...(1)V(s)=a∈A∑​π(a∣s)[R(s,a)+γs′∈S∑​p(s′∣s,a)V(s′)] ...(1)
Vk+1π(s)=R(s,π(s))+∑s′∈Sp(s′∣s,π(s))Vkπ(s′) ...(2)V_{k+1}^\pi(s)= R(s, \pi(s))+\sum_{s' \in S} p(s'|s, \pi(s))V_k^\pi(s') \ ...(2)Vk+1π​(s)=R(s,π(s))+s′∈S∑​p(s′∣s,π(s))Vkπ​(s′) ...(2)

By iterating equation (2)(2)(2) until it converges, we can get the value function for every state.

[1]

[2]

[3]

https://www.youtube.com/watch?v=WsvFL-LjA6U
https://www.youtube.com/watch?v=gHdsUUGcBC0
https://web.stanford.edu/class/cs234/modules.html