Are RL methods stable with function approximation?
The situation is a bit complicated and in flux at present. Stability guarantees depend on the specific algorithm and function approximator, and on the way it is used. This is what we knew as of August 2001: • For some nonlinear parameterized function approximators (FA), any temporal-difference (TD) learning method (including Q-learning and Sarsa) can become unstable (parameters and estimates going to infinity). [Tsitsiklis & Van Roy 1996] • TD(lambda) with linear FA converges near the best linear solution when trained on-policy… [ Tsitsiklis & Van Roy 1997] • …but may become unstable when trained off-policy (updating states with a different distribution than that seen when following the policy). [ Baird 1995] • From which it follows that Q-learning with linear FA can also be unstable. [ Baird 1995] • Sarsa(lambda), on the other hand, is guaranteed stable, although only the weakest of error bounds has been shown. [Gordon 2001] • New linear TD algorithms for the off-policy case have