Markov

Last updated 1 month ago

Was this helpful?

Markov

Markov Assumption

Let $h_t$ be a sufficient statistic of history.

Then we say state $s_t$ is Markov if and only if probability distribution of next state $s_{t+1}$ only relies on current state $s_t$ .

p(s_{t+1}|s_t)=p(s_{t+1}|h_t)

This markov assumption makes the model simple.

Markov Chain

Model without decisions or rewards

Markov Chain means the probability of next state given current state. It can be represented as a matrix.

P = \begin{bmatrix} p(s_1|s_1) & p(s_2|s_1) & ... & p(s_n|s_1) \\ p(s_1|s_2) & p(s_2|s_2) & ... & p(s_n|s_2) \\ ... & ... & ... & ... \\ p(s_1|s_n) & p(s_2|s_n) & ... & p(s_n|s_n) \\ \end{bmatrix}

Markov Reward Process

Markov Chain with rewards

It is defined with following components

Calculating State-Value function with Algorithm

We can calculate state-value function with iterative calculation.

Markov Decision Process

MRP with Decision process(actions)

It is defined with following components

In MDP, we can define a policy as follows:

Stochastic policy returns a probabilisitic distribution among action space.

Since MDP with policy can be viewed as MRP, we can define a state-value function.

We can make it as a more general form with deterministic policy.

References

Last updated 1 month ago

Was this helpful?

Markov Assumption

Let $h_t$ be a sufficient statistic of history.

Then we say state $s_t$ is Markov if and only if probability distribution of next state $s_{t+1}$ only relies on current state $s_t$ .

p(s_{t+1}|s_t)=p(s_{t+1}|h_t)

This markov assumption makes the model simple.

Markov Chain

Model without decisions or rewards

Markov Chain means the probability of next state given current state. It can be represented as a matrix.

P = \begin{bmatrix} p(s_1|s_1) & p(s_2|s_1) & ... & p(s_n|s_1) \\ p(s_1|s_2) & p(s_2|s_2) & ... & p(s_n|s_2) \\ ... & ... & ... & ... \\ p(s_1|s_n) & p(s_2|s_n) & ... & p(s_n|s_n) \\ \end{bmatrix}

Markov Reward Process

Markov Chain with rewards

It is defined with following components

$S$ is finite set of states
$P$ is dynamics/transition model specifies $p(s_{t+1}|s_t)$
$R(s)$ is a reward function for current state $s$ .

We should also encounter the rewards of the future state. So we bring a concept of state-value function $V(s)$ for MRP.

V(s)=E[r_t+\gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} ...]

Where $r_t$ is immediate reward got at timestep $t$ .

Calculating State-Value function with Algorithm

We can calculate state-value function with iterative calculation.

V(s) = R(s)+\gamma \sum_{s' \in S}{p(s'|s)\cdot V(s')}

We keep updating the state-value function $V(s)$ (or viewed as state-value Matrix $V$ ) with dynamic-programming.

Markov Decision Process

MRP with Decision process(actions)

It is defined with following components

$S$ is finite set of states
$A$ is set of actions
$P$ is dynamics model for each action. $P(s'|s, a)$
$R$ is reward function. $R(s, a)$

Policy $\pi$

In MDP, we can define a policy as follows:

A policy $\pi$ is a function that returns which action to take for given state $s$ .

Deterministic policy returns a single action $a$ for current state $s$ .

a = \pi(s)

Stochastic policy returns a probabilisitic distribution among action space.

\pi(a|s)

MDP + $\pi$ = MRP

We can view MDP+ $\pi$ as MRP because $\pi$ gives information of $P(s'|s)$ .

R^{\pi}(s)=\sum_{a\in A}\pi(a|s)R(s, a)

P^{\pi}(s'|s)=\sum_{a\in A}\pi(a|s) P(s'|s, a)

Since MDP with policy can be viewed as MRP, we can define a state-value function.

V(s) = \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in S}{p(s'|s, a)V(s')} \right] \ ...(1)

We can make it as a more general form with deterministic policy.

V_{k+1}^\pi(s)= R(s, \pi(s))+\sum_{s' \in S} p(s'|s, \pi(s))V_k^\pi(s') \ ...(2)

By iterating equation $(2)$ until it converges, we can get the value function for every state.

References

[1]

[2]

[3]

$S$ is finite set of states

$P$ is dynamics/transition model specifies $p(s_{t+1}|s_t)$

$R(s)$ is a reward function for current state $s$ .

We should also encounter the rewards of the future state. So we bring a concept of state-value function $V(s)$ for MRP.

V(s)=E[r_t+\gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} ...]

Where $r_t$ is immediate reward got at timestep $t$ .

V(s) = R(s)+\gamma \sum_{s' \in S}{p(s'|s)\cdot V(s')}

We keep updating the state-value function $V(s)$ (or viewed as state-value Matrix $V$ ) with dynamic-programming.

$S$ is finite set of states

$A$ is set of actions

$P$ is dynamics model for each action. $P(s'|s, a)$

$R$ is reward function. $R(s, a)$

Policy $\pi$

A policy $\pi$ is a function that returns which action to take for given state $s$ .

Deterministic policy returns a single action $a$ for current state $s$ .

a = \pi(s)

\pi(a|s)

MDP + $\pi$ = MRP

We can view MDP+ $\pi$ as MRP because $\pi$ gives information of $P(s'|s)$ .

R^{\pi}(s)=\sum_{a\in A}\pi(a|s)R(s, a)

P^{\pi}(s'|s)=\sum_{a\in A}\pi(a|s) P(s'|s, a)

V(s) = \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s' \in S}{p(s'|s, a)V(s')} \right] \ ...(1)

V_{k+1}^\pi(s)= R(s, \pi(s))+\sum_{s' \in S} p(s'|s, \pi(s))V_k^\pi(s') \ ...(2)

By iterating equation $(2)$ until it converges, we can get the value function for every state.

Markov Assumption

Markov Chain

Markov Reward Process

Calculating State-Value function with Algorithm

Markov Decision Process

References

Markov Assumption

Markov Chain

Markov Reward Process

Calculating State-Value function with Algorithm

Markov Decision Process

Policy π\piπ

MDP + π\piπ = MRP

References

Policy π\piπ

MDP + π\piπ = MRP

Policy $\pi$

MDP + $\pi$ = MRP

Policy $\pi$

MDP + $\pi$ = MRP