Reinforcement Learning/Temporal Difference Learning
Appearance
Temporal difference (TD) learning is a central and novel idea in reinforcement learning.
- It is a combination of Monte Carlo and dynamic programing methods
- It is a Model-free learning algorithm
- It both bootstraps (builds on top of previous best estimate) and samples
- It can an be used for both episodic or infinite-horizon (non-episodic) domains
- Immediately updates estimate of V after each
- Requires the system to be Markovian
- Biased estimator of value function but often much lower variance than Monte Carlo estimator
- Converges to true value in finite state cases, but does not always converge with infinite number of states (known as function approximation)
Algorithm Temporal Difference Learning TD(0)
[edit | edit source]TD learning can be applied as a spectrum between pure Monte Carlo and dynamic programing, but the simplest TD learning is as follows
- Input:
- Initialize
- Loop
- Sample tuple
- Update
Temporal difference error is defined as
n-step Return
[edit | edit source]is TD(0)
and so on up to infinity
is MC
Is defined as n-step learning TD(n)