Reinforcement Learning/Policy iteration

From Wikiversity
Jump to navigation Jump to search

Policy Iteration (PI) is one of the algorithms for finding the optimal policy (MDP control).

Policy iteration is a model-based algorithm.

The complexity of the algorithm is where is the number of iterations needed for convergence. Theoretically, the maximum number of iterations is .

The algorithm converges to the global optimum.

State-action value Q[edit | edit source]

State-action value of a policy , is calculated by taking the specified action immediately, then following the policy

Here, is the reward function in MDP and is the transition model.

Algorithm[edit | edit source]

  • Set
  • Initialize randomly for all states
  • While or (L1-norm, measures if the policy changed for any state):
    • Compute state-action value of a policy , for all and all
    • Compute new policy , for all by choosing the action that returns the maximum state-action value for each specific state

Explanation[edit | edit source]

In each iteration, by definition we have

Proof[edit | edit source]