Reinforcement Learning/Markov Decision Process
Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.
The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, .
Definition
[edit | edit source]A Markov decision process is a 4-tuple , where
- is a finite set of states,
- is a finite set of actions (alternatively, is the finite set of actions available from state ),
- is the probability that action in state at time will lead to state at time ,
- is the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action
(Note: The theory of Markov decision processes does not state that or are finite, but the basic algorithms below assume that they are finite.)
Policy Specification
[edit | edit source]A policy if a function that specifies the action that the decision maker will choose when it is in state .
Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain
The goal is to choose a policy that will maximize some cumulative stochastic rewards function.
Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:
- (where we choose , i.e. actions given by the policy). And the expectation is taken over
where is the discount factor satisfying , which is usually close to 1. (For example, for some discount rate r.)
Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of only, as assumed above.
The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.