Reinforcement Learning/Monte Carlo Policy Evaluation

From Wikiversity
Jump to navigation Jump to search

The goal is to estimate by generating many episodes under policy .

  • An episode is a series of states, actions, and rewards () created for an Markov Decision Process (MDP) under policy .
  • In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.
  • The error of calculated reward reduces with , where is the number of trajectories created.
  • This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
  • The evaluation does NOT require formal derivation of dynamics and rewards models.
  • This method does NOT assume states to be Markov.
  • Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:

  1. First-visit Monte Carlo
  2. Every-visit Monte Carlo
  3. Incremental Monte Carlo


First-visit Monte Carlo[edit | edit source]

Algorithm:[edit | edit source]

Initialize

Loop:

  • Sample episode
  • Define as return from time step onwards in th episode
  • For each state visited in episode
    • For first time that state is visited in episode
      • Increment counter of total first visits:
      • Increment total return
      • Update estimate

Properties:[edit | edit source]

  • first-time MC estimator is an unbiased estimator of true . (Read more about Bias of an estimator).
  • By law of large numbers, as

Every-visit Monte Carlo[edit | edit source]

Algorithm:[edit | edit source]

Initialize

Loop:

  • Sample episode
  • Define as return from time step onwards in th episode
  • For each state visited in episode
    • For every time that state is visited in episode
      • Increment counter of total first visits:
      • Increment total return
      • Update estimate

Properties:[edit | edit source]

  • every-visit MC estimator is a biased estimator of true . (Read more about Bias of an estimator).
  • The every-visit MC estimator has MSE (variance + bias2) than first-visit estimator, because we collect way more data when we count every visit.
  • The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.

Incremental Monte Carlo[edit | edit source]

Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.

The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.

In both first-visit and every-visit MC algorithms the value function is updated by the following equation

This equation is easily derivable by looking value of , , and each time the value function is updated.

If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations

If we set , we arrive at the original first-visit or every-visit MC algorithms, but if set we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.