A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation.

We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target ... The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L 2 norm. ... Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. ...

dblp:conf/nips/SuttonSM08 fatcat:yqffnwixvzbgtkl3cgcklggvyi

Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. ... We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for ... ., of ON-SGSP), the critic is based on temporal difference (TD) learning [Sutton and Barto, 1998 ]. Actor recursion: This incrementally updates the policy using gradient descent. ...

arXiv:1401.2086v2 fatcat:gg5fmy4rlbgg3f5v2ata6ng2rq

Multiple Versions

We propose here an efficient algorithm which is convergent under linear function approximation while planning using temporally abstract actions. ... Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. ... Recently various temporal difference algorithms were proposed which are compatible with both linear function approximation and off-policy training using primitive action in MDP setting (GTD algorithm ...

arXiv:1703.06471v1 fatcat:ednw3y5xvbfgnkspsb7opl3nia

As an application of our results, we study SA algorithms with an added heavy ball momentum term in the context of Gradient Temporal Difference (GTD) algorithms. ... This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of N-timescale stochastic approximation (SA) iterates for any N≥1. ... With linear function approximation for policy evaluation (i.e., for a fixed π), the goal is to estimate V π (s) from samples of the form (s t , r t+1 , s t+1 ) through a linear model V θ (s) = θ T φ(s) ...

arXiv:2112.03515v2 fatcat:r3eqerwuafgozi3swxazsvvzcy

Multiple Versions

Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity ... This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements. ... Linear value-function approximation We consider a prototypical case of temporal-difference learning, that of learning a linear approximation to the state-value function for a given policy and Markov decision ...

doi:10.1145/1553374.1553501 dblp:conf/icml/SuttonMPBSSW09 fatcat:lq56cfqofvhmxmxyv5r3i4bw74

In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. ... This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. ... The contributions of our paper are as follows: • We derive an online off-policy TD learning algorithm with linear function approximation. ...

arXiv:1911.05697v1 fatcat:sqgxx34bpvd2jbb33yhicaghp4

Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration ... , temporal difference learning and recent gradient temporal difference learning algorithms. ... squares policy iteration, temporal difference learning, and the very recent gradient TD algorithms. ...

doi:10.1007/978-3-642-29946-9_30 fatcat:jyoovu2waffuzd5v3bxqyjdffm

We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. ... We prove for the first time that any compound return with the same contraction modulus as a given n-step return has strictly lower variance. ... Learning is off-policy in this setting, since the agent now predicts returns for a greedy policy that differs from the agent's behavior. ...

arXiv:2402.03903v2 fatcat:7padude4n5cg7elqmzeml5jxce

Open Access Multiple Versions

Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. ... Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. ... A convergent o(n) temporal-difference algorithm for off-policy learn- ing with linear function approximation. In Advances in neural information processing systems, 2009c. Sutton, R. ...

arXiv:1705.09322v4 fatcat:i5aq26ao7bga5l4vsuxjrynga4

Multiple Versions

Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, ... Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. ... reward per step with respect to θ is: We have randomly selected one parameter from each function approximator and plotted its value against the steps. s,o 0:N −1 ,s μ Ω (s, o 0:N −1 , s ) a ∂π(a|s, o ...

doi:10.1609/aaai.v34i10.7160 fatcat:x3c4puiuvnhu5cmgugfrqg2syi

) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and ... This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function V. (2) A linear computational cost (per step ... A comparison study of O 2 TD is conducted with GTD2 and ETD as three off-policy convergent TD algorithms with linear computational cost per step. ...

arXiv:1704.05147v2 fatcat:spl5au6jujc4lkktkd4y2lpo3i

Multiple Versions

This paper presents a regularized Quasi Newton Temporal Difference learning algorithm which uses second-order information while maintaining a fast convergence rate. ... Fast convergent and computationally inexpensive policy evaluation is an essential part of reinforcement learning algorithms based on policy iteration. ... This algorithm works well with linear function approximation and on-policy setting but may diverge in case of non-linear function approximation (Tsitsiklis and van Roy, 1997) or off-policy setting ( ...

dblp:conf/acml/GivchiP14 fatcat:rdqaj7ypwbe53fsl5cdm56c6de

In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in ... We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples. ... To the best of our knowledge this is the first time an almost sure convergence proof of off-policy temporal difference learning algorithm with linear function approximation is presented for step-sizes ...

arXiv:1605.06076v1 fatcat:n2wx5rl5wbfnxaufcgxkjorg7i

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. ... This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD ... We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. ...

dblp:conf/nips/PrecupSPKS05 fatcat:g7swv35yrvhixjick5paxsau4u

Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. ... However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. ... In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation. ...

arXiv:2209.04624v1 fatcat:spxwrqcgzna7jgm4kfycghcweu

Open Access

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Preserved Fulltext

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games [article]

Preserved Fulltext

Other Versions

Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options [article]

Preserved Fulltext

N-Timescale Stochastic Approximation: Stability and Convergence [article]

Preserved Fulltext

Other Versions

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Preserved Fulltext

A Convergent Off-Policy Temporal Difference Algorithm [article]

Preserved Fulltext

MapReduce for Parallel Reinforcement Learning [chapter]

Preserved Fulltext

Averaging n-step Returns Reduces Variance in Reinforcement Learning [article]

Preserved Fulltext

Convergent Tree Backup and Retrace with Function Approximation [article]

Preserved Fulltext

Other Versions

Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)

Preserved Fulltext

O^2TD: (Near)-Optimal Off-Policy TD Learning [article]

Preserved Fulltext

Other Versions

Quasi Newton Temporal Difference Learning

Preserved Fulltext

On a convergent off -policy temporal difference learning algorithm in on-line learning environment [article]

Preserved Fulltext

Off-policy Learning with Options and Recognizers

Preserved Fulltext

Gradient Descent Temporal Difference-difference Learning [article]

Preserved Fulltext