A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
2008
Neural Information Processing Systems
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target ...
The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L 2 norm. ...
Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. ...
dblp:conf/nips/SuttonSM08
fatcat:yqffnwixvzbgtkl3cgcklggvyi
Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games
[article]
2015
arXiv
pre-print
Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. ...
We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for ...
., of ON-SGSP), the critic is based on temporal difference (TD) learning [Sutton and Barto, 1998 ]. Actor recursion: This incrementally updates the policy using gradient descent. ...
arXiv:1401.2086v2
fatcat:gg5fmy4rlbgg3f5v2ata6ng2rq
Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options
[article]
2017
arXiv
pre-print
We propose here an efficient algorithm which is convergent under linear function approximation while planning using temporally abstract actions. ...
Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem. ...
Recently various temporal difference algorithms were proposed which are compatible with both linear function approximation and off-policy training using primitive action in MDP setting (GTD algorithm ...
arXiv:1703.06471v1
fatcat:ednw3y5xvbfgnkspsb7opl3nia
N-Timescale Stochastic Approximation: Stability and Convergence
[article]
2022
arXiv
pre-print
As an application of our results, we study SA algorithms with an added heavy ball momentum term in the context of Gradient Temporal Difference (GTD) algorithms. ...
This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of N-timescale stochastic approximation (SA) iterates for any N≥1. ...
With linear function approximation for policy evaluation (i.e., for a fixed π), the goal is to estimate V π (s) from samples of the form (s t , r t+1 , s t+1 ) through a linear model V θ (s) = θ T φ(s) ...
arXiv:2112.03515v2
fatcat:r3eqerwuafgozi3swxazsvvzcy
Fast gradient-descent methods for temporal-difference learning with linear function approximation
2009
Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09
Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity ...
This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements. ...
Linear value-function approximation We consider a prototypical case of temporal-difference learning, that of learning a linear approximation to the state-value function for a given policy and Markov decision ...
doi:10.1145/1553374.1553501
dblp:conf/icml/SuttonMPBSSW09
fatcat:lq56cfqofvhmxmxyv5r3i4bw74
A Convergent Off-Policy Temporal Difference Algorithm
[article]
2019
arXiv
pre-print
In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. ...
This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. ...
The contributions of our paper are as follows: • We derive an online off-policy TD learning algorithm with linear function approximation. ...
arXiv:1911.05697v1
fatcat:sqgxx34bpvd2jbb33yhicaghp4
MapReduce for Parallel Reinforcement Learning
[chapter]
2012
Lecture Notes in Computer Science
Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration ...
, temporal difference learning and recent gradient temporal difference learning algorithms. ...
squares policy iteration, temporal difference learning, and the very recent gradient TD algorithms. ...
doi:10.1007/978-3-642-29946-9_30
fatcat:jyoovu2waffuzd5v3bxqyjdffm
Averaging n-step Returns Reduces Variance in Reinforcement Learning
[article]
2024
arXiv
pre-print
We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation. ...
We prove for the first time that any compound return with the same contraction modulus as a given n-step return has strictly lower variance. ...
Learning is off-policy in this setting, since the agent now predicts returns for a greedy policy that differs from the agent's behavior. ...
arXiv:2402.03903v2
fatcat:7padude4n5cg7elqmzeml5jxce
Convergent Tree Backup and Retrace with Function Approximation
[article]
2018
arXiv
pre-print
Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. ...
Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. ...
A convergent
o(n) temporal-difference algorithm for off-policy learn-
ing with linear function approximation. In Advances in
neural information processing systems, 2009c.
Sutton, R. ...
arXiv:1705.09322v4
fatcat:i5aq26ao7bga5l4vsuxjrynga4
Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, ...
Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. ...
reward per step with respect to θ is: We have randomly selected one parameter from each function approximator and plotted its value against the steps. s,o 0:N −1 ,s μ Ω (s, o 0:N −1 , s ) a ∂π(a|s, o ...
doi:10.1609/aaai.v34i10.7160
fatcat:x3c4puiuvnhu5cmgugfrqg2syi
O^2TD: (Near)-Optimal Off-Policy TD Learning
[article]
2017
arXiv
pre-print
) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and ...
This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function V. (2) A linear computational cost (per step ...
A comparison study of O 2 TD is conducted with GTD2 and ETD as three off-policy convergent TD algorithms with linear computational cost per step. ...
arXiv:1704.05147v2
fatcat:spl5au6jujc4lkktkd4y2lpo3i
Quasi Newton Temporal Difference Learning
2014
Asian Conference on Machine Learning
This paper presents a regularized Quasi Newton Temporal Difference learning algorithm which uses second-order information while maintaining a fast convergence rate. ...
Fast convergent and computationally inexpensive policy evaluation is an essential part of reinforcement learning algorithms based on policy iteration. ...
This algorithm works well with linear function approximation and on-policy setting but may diverge in case of non-linear function approximation (Tsitsiklis and van Roy, 1997) or off-policy setting ( ...
dblp:conf/acml/GivchiP14
fatcat:rdqaj7ypwbe53fsl5cdm56c6de
On a convergent off -policy temporal difference learning algorithm in on-line learning environment
[article]
2016
arXiv
pre-print
In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in ...
We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples. ...
To the best of our knowledge this is the first time an almost sure convergence proof of off-policy temporal difference learning algorithm with linear function approximation is presented for step-sizes ...
arXiv:1605.06076v1
fatcat:n2wx5rl5wbfnxaufcgxkjorg7i
Off-policy Learning with Options and Recognizers
2005
Neural Information Processing Systems
We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. ...
This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD ...
We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. ...
dblp:conf/nips/PrecupSPKS05
fatcat:g7swv35yrvhixjick5paxsau4u
Gradient Descent Temporal Difference-difference Learning
[article]
2022
arXiv
pre-print
Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. ...
However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. ...
In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation. ...
arXiv:2209.04624v1
fatcat:spxwrqcgzna7jgm4kfycghcweu
« Previous
Showing results 1 — 15 out of 11,791 results