Reinforcement Learning/Markov Decision Process

From Wikiversity
Jump to navigation Jump to search

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, .

Definition[edit | edit source]

A Markov decision process is a 4-tuple , where

  • is a finite set of states,
  • is a finite set of actions (alternatively,  is the finite set of actions available from state ),
  • is the probability that action in state at time will lead to state at time ,
  • is the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action

(Note: The theory of Markov decision processes does not state that or are finite, but the basic algorithms below assume that they are finite.)

Policy Specification[edit | edit source]

A policy if a function that specifies the action that the decision maker will choose when it is in state .

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain


The goal is to choose a policy that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:

   (where we choose , i.e. actions given by the policy). And the expectation is taken over

where is the discount factor satisfying , which is usually close to 1. (For example, for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.