Reinforcement Learning/Markov Decision Process

Markov Decision Process (MDP) is Markov Chain + Rewards function + Actions.

The Markov Decision Process is reduced to Markov Rewards process by choosing a "policy" that specifies the action taken given the state, $\pi (s)$ .

Definition

A Markov decision process is a 4-tuple $(S,A,P_{a},R_{a})$ , where

$S$ is a finite set of states,
$A$ is a finite set of actions (alternatively, is the finite set of actions available from state ),
$P_{a}(s,s')={\text{Pr}}(s_{t+1}=s'|s_{t}=s,a_{t}=a)$ is the probability that action $a$ in state $s$ at time $t$ will lead to state $s'$ at time $t+1$ ,
$R_{a}(s,s')$ is the immediate reward (or expected immediate reward) received after transitioning from state $s$ to state $s'$ , due to action $a$

(Note: The theory of Markov decision processes does not state that $S$ or $A$ are finite, but the basic algorithms below assume that they are finite.)

Policy Specification

A policy if a function $\pi$ that specifies the action $a=\pi (s)$ that the decision maker will choose when it is in state $s$ .

Once a Markov decision process is combined with a policy, this fixes the action for each state and the resulting combination behaves like a Markov chain $\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=a)=\Pr(s_{t+1}=s'\mid s_{t}=s,a_{t}=\pi (s))$

The goal is to choose a policy $\pi$ that will maximize some cumulative stochastic rewards function.

Typically the expected the cumulative reward is a discounted sum over a potentially infinite horizon:

\mathbb {E} [\sum _{t=0}^{\infty }{\gamma ^{t}R_{a_{t}}(s_{t},s_{t+1})}]

(where we choose

a_{t}=\pi (s_{t})

, i.e. actions given by the policy). And the expectation is taken over

s_{t+1}\sim P_{a_{t}}(s_{t},s_{t+1})

where $\ \gamma \$ is the discount factor satisfying $0\leq \ \gamma \ \leq \ 1$ , which is usually close to 1. (For example, $\gamma =1/(1+r)$ for some discount rate r.)

Because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of $s$ only, as assumed above.

The discount-factor motivates the decision maker to favor taking actions early, rather not postpone them indefinitely.