Reinforcement Learning/Temporal Difference Learning

From Wikiversity
Jump to navigation Jump to search

Temporal difference (TD) learning is a central and novel idea in reinforcement learning.

  • It is a combination of Monte Carlo and dynamic programing methods
  • It is a Model-free learning algorithm
  • It both bootstraps (builds on top of previous best estimate) and samples
  • It can an be used for both episodic or infinite-horizon (non-episodic) domains
  • Immediately updates estimate of V after each
  • Requires the system to be Markovian
  • Biased estimator of value function but often much lower variance than Monte Carlo estimator
  • Converges to true value in finite state cases, but does not always converge with infinite number of states (known as function approximation)

Algorithm Temporal Difference Learning TD(0)[edit | edit source]

TD learning can be applied as a spectrum between pure Monte Carlo and dynamic programing, but the simplest TD learning is as follows

  • Input:
  • Initialize
  • Loop
    • Sample tuple
    • Update

Temporal difference error is defined as

n-step Return[edit | edit source]

is TD(0)

and so on up to infinity

is MC

Is defined as n-step learning TD(n)