The goal is to estimate $V^{\pi }(s)$ by generating many episodes under policy $\pi$ .

An episode is a series of states, actions, and rewards ( $s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},\cdots$ ) created for an Markov Decision Process (MDP) under policy $\pi$ .

In this method, we simply simulate many trajectories (decision processes), and calculate the average returns.

The error of calculated reward reduces with $1/{\sqrt {N}}$ , where $N$ is the number of trajectories created.
This method can be used only for episodic decision processes, meaning that the trajectories are finite and terminates after a number of states.
The evaluation does NOT require formal derivation of dynamics and rewards models.
This method does NOT assume states to be Markov.
Generally a high variance estimator. Reducing the variance can require a lot of data. Therefore, in cases where data is expensive to acquire or the stakes are high, MC may be impractical.

There are different types of Monte Carlo policy evaluation:

First-visit Monte Carlo
Every-visit Monte Carlo
Incremental Monte Carlo

First-visit Monte Carlo

Algorithm:

Initialize $N(s)=0,G(s)=0~~~\forall s\in S$

Loop:

Sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},\cdots ,s_{i,T_{i}}$
Define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma ^{2}r_{i,t+2}+\cdots \gamma ^{T_{i-1}}r_{i,T_{i}}$ as return from time step $t$ onwards in $i$ th episode
For each state $s$ $s$ visited in episode $i$ $i$
- For first time $t$ $t$ that state $s$ $s$ is visited in episode $i$ $i$
  - Increment counter of total first visits: $N(s)=N(s)+1$
  - Increment total return $G(s)=G(s)+G_{i,t}$
  - Update estimate $V^{\pi }(s)=G(s)/N(s)$

Properties:

$V^{\pi }$ first-time MC estimator is an unbiased estimator of true $\mathbb {E} _{\pi }[G_{t}\mid s_{t}=s]$ . (Read more about Bias of an estimator).
By law of large numbers, as $N(s)\rightarrow \infty ,V^{\pi }(s)\rightarrow \mathbb {E} [G_{t}\mid s_{t}=s]$

Every-visit Monte Carlo

Algorithm:

Initialize $N(s)=0,G(s)=0~~~\forall s\in S$

Loop:

Sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2},\cdots ,s_{i,T_{i}}$
Define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma ^{2}r_{i,t+2}+\cdots \gamma ^{T_{i-1}}r_{i,T_{i}}$ as return from time step $t$ onwards in $i$ th episode
For each state $s$ $s$ visited in episode $i$ $i$
- For every time $t$ $t$ that state $s$ $s$ is visited in episode $i$ $i$
  - Increment counter of total first visits: $N(s)=N(s)+1$
  - Increment total return $G(s)=G(s)+G_{i,t}$
  - Update estimate $V^{\pi }(s)=G(s)/N(s)$

Properties:

$V^{\pi }$ every-visit MC estimator is a biased estimator of true $V^{\pi }(s)=\mathbb {E} _{\pi }[G_{t}\mid s_{t}=s]$ . (Read more about Bias of an estimator).
The every-visit MC estimator has MSE (variance + bias²) than first-visit estimator, because we collect way more data when we count every visit.
The every-visit estimator is a consistent estimator, meaning that the bias value consistently decreases with increasing number of simulated episodes. The bias of a consistent estimator asymptotically goes to zero with increasing number of sample size.

Incremental Monte Carlo

Incremental MC policy evaluation is a more general form of policy evaluation that can be applied to both first-visit and every-visit policy evaluation algorithms.

The benefit of using incremental MC algorithm is that it can be applied to cases where the system is non-stationary. The algorithm does this by giving higher weight to newer data.

In both first-visit and every-visit MC algorithms the value function is updated by the following equation $V^{\pi }(s)=V^{\pi }(s){\frac {N(s)-1}{N(s)}}+{\frac {G_{i,t}(s)}{N(s)}}=V^{\pi }(s)+{\frac {1}{N(s)}}{\big (}G_{i,t}(s)-V^{\pi }(s){\big )}$ This equation is easily derivable by looking value of $V^{\pi }(s)$ , $G(s)$ , and $N(s)$ each time the value function is updated.

If we change the update equation to the following we arrive at the incremental MC algorithm which can have both first-visit and every-visit variations $V^{\pi }(s)=V^{\pi }(s)+\alpha {\big (}G_{i,t}(s)-V^{\pi }(s){\big )}$ If we set $\alpha =1/N(s)$ , we arrive at the original first-visit or every-visit MC algorithms, but if set $\alpha >1/N(s)$ we have an algorithm that gives more weight to the newer data and is more suitable for non-stationary domains.