Markov
Last updated
Was this helpful?
Last updated
Was this helpful?
Let be a sufficient statistic of history.
Then we say state is Markov if and only if probability distribution of next state only relies on current state .
This markov assumption makes the model simple.
Model without decisions or rewards
Markov Chain means the probability of next state given current state. It can be represented as a matrix.
Markov Chain with rewards
It is defined with following components
We can calculate state-value function with iterative calculation.
MRP with Decision process(actions)
It is defined with following components
In MDP, we can define a policy as follows:
Stochastic policy returns a probabilisitic distribution among action space.
Since MDP with policy can be viewed as MRP, we can define a state-value function.
We can make it as a more general form with deterministic policy.
is finite set of states
is dynamics/transition model specifies
is a reward function for current state .
We should also encounter the rewards of the future state. So we bring a concept of state-value function for MRP.
Where is immediate reward got at timestep .
We keep updating the state-value function (or viewed as state-value Matrix ) with dynamic-programming.
is finite set of states
is set of actions
is dynamics model for each action.
is reward function.
A policy is a function that returns which action to take for given state .
Deterministic policy returns a single action for current state .
We can view MDP+ as MRP because gives information of .
By iterating equation until it converges, we can get the value function for every state.
[1]
[2]
[3]