Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








11,791 Hits in 8.1 sec

A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei
2008 Neural Information Processing Systems  
We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target  ...  The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L 2 norm.  ...  Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity.  ... 
dblp:conf/nips/SuttonSM08 fatcat:yqffnwixvzbgtkl3cgcklggvyi

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games [article]

H.L Prasad, L.A.Prashanth, Shalabh Bhatnagar
2015 arXiv   pre-print
Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima.  ...  We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a N-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for  ...  ., of ON-SGSP), the critic is based on temporal difference (TD) learning [Sutton and Barto, 1998 ]. Actor recursion: This incrementally updates the policy using gradient descent.  ... 
arXiv:1401.2086v2 fatcat:gg5fmy4rlbgg3f5v2ata6ng2rq

Multi-Timescale, Gradient Descent, Temporal Difference Learning with Linear Options [article]

Peeyush Kumar, Doina Precup
2017 arXiv   pre-print
We propose here an efficient algorithm which is convergent under linear function approximation while planning using temporally abstract actions.  ...  Moreover using spatial abstractions to learn policies for various situations at once while using temporal abstraction models is an open problem.  ...  Recently various temporal difference algorithms were proposed which are compatible with both linear function approximation and off-policy training using primitive action in MDP setting (GTD algorithm  ... 
arXiv:1703.06471v1 fatcat:ednw3y5xvbfgnkspsb7opl3nia

N-Timescale Stochastic Approximation: Stability and Convergence [article]

Rohan Deb, Shalabh Bhatnagar
2022 arXiv   pre-print
As an application of our results, we study SA algorithms with an added heavy ball momentum term in the context of Gradient Temporal Difference (GTD) algorithms.  ...  This paper presents the first sufficient conditions that guarantee the stability and almost sure convergence of N-timescale stochastic approximation (SA) iterates for any N≥1.  ...  With linear function approximation for policy evaluation (i.e., for a fixed π), the goal is to estimate V π (s) from samples of the form (s t , r t+1 , s t+1 ) through a linear model V θ (s) = θ T φ(s)  ... 
arXiv:2112.03515v2 fatcat:r3eqerwuafgozi3swxazsvvzcy

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, Eric Wiewiora
2009 Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09  
Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity  ...  This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.  ...  Linear value-function approximation We consider a prototypical case of temporal-difference learning, that of learning a linear approximation to the state-value function for a given policy and Markov decision  ... 
doi:10.1145/1553374.1553501 dblp:conf/icml/SuttonMPBSSW09 fatcat:lq56cfqofvhmxmxyv5r3i4bw74

A Convergent Off-Policy Temporal Difference Algorithm [article]

Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar
2019 arXiv   pre-print
In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation.  ...  This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem.  ...  The contributions of our paper are as follows: • We derive an online off-policy TD learning algorithm with linear function approximation.  ... 
arXiv:1911.05697v1 fatcat:sqgxx34bpvd2jbb33yhicaghp4

MapReduce for Parallel Reinforcement Learning [chapter]

Yuxi Li, Dale Schuurmans
2012 Lecture Notes in Computer Science  
Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration  ...  , temporal difference learning and recent gradient temporal difference learning algorithms.  ...  squares policy iteration, temporal difference learning, and the very recent gradient TD algorithms.  ... 
doi:10.1007/978-3-642-29946-9_30 fatcat:jyoovu2waffuzd5v3bxqyjdffm

Averaging n-step Returns Reduces Variance in Reinforcement Learning [article]

Brett Daley, Martha White, Marlos C. Machado
2024 arXiv   pre-print
We additionally prove that this variance-reduction property improves the finite-sample complexity of temporal-difference learning under linear function approximation.  ...  We prove for the first time that any compound return with the same contraction modulus as a given n-step return has strictly lower variance.  ...  Learning is off-policy in this setting, since the agent now predicts returns for a greedy policy that differs from the agent's behavior.  ... 
arXiv:2402.03903v2 fatcat:7padude4n5cg7elqmzeml5jxce

Convergent Tree Backup and Retrace with Function Approximation [article]

Ahmed Touati, Pierre-Luc Bacon, Doina Precup, Pascal Vincent
2018 arXiv   pre-print
Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms.  ...  Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy.  ...  A convergent o(n) temporal-difference algorithm for off-policy learn- ing with linear function approximation. In Advances in neural information processing systems, 2009c. Sutton, R.  ... 
arXiv:1705.09322v4 fatcat:i5aq26ao7bga5l4vsuxjrynga4

Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)

Akshay Dharmavaram, Matthew Riemer, Shalabh Bhatnagar
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions,  ...  Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions.  ...  reward per step with respect to θ is: We have randomly selected one parameter from each function approximator and plotted its value against the steps. s,o 0:N −1 ,s μ Ω (s, o 0:N −1 , s ) a ∂π(a|s, o  ... 
doi:10.1609/aaai.v34i10.7160 fatcat:x3c4puiuvnhu5cmgugfrqg2syi

O^2TD: (Near)-Optimal Off-Policy TD Learning [article]

Bo Liu, Daoming Lyu, Wen Dong, Saad Biaz
2017 arXiv   pre-print
) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and  ...  This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function V. (2) A linear computational cost (per step  ...  A comparison study of O 2 TD is conducted with GTD2 and ETD as three off-policy convergent TD algorithms with linear computational cost per step.  ... 
arXiv:1704.05147v2 fatcat:spl5au6jujc4lkktkd4y2lpo3i

Quasi Newton Temporal Difference Learning

Arash Givchi, Maziar Palhang
2014 Asian Conference on Machine Learning  
This paper presents a regularized Quasi Newton Temporal Difference learning algorithm which uses second-order information while maintaining a fast convergence rate.  ...  Fast convergent and computationally inexpensive policy evaluation is an essential part of reinforcement learning algorithms based on policy iteration.  ...  This algorithm works well with linear function approximation and on-policy setting but may diverge in case of non-linear function approximation (Tsitsiklis and van Roy, 1997) or off-policy setting (  ... 
dblp:conf/acml/GivchiP14 fatcat:rdqaj7ypwbe53fsl5cdm56c6de

On a convergent off -policy temporal difference learning algorithm in on-line learning environment [article]

Prasenjit Karmakar, Rajkumar Maity, Shalabh Bhatnagar
2016 arXiv   pre-print
In this paper we provide a rigorous convergence analysis of a "off"-policy temporal difference learning algorithm with linear function approximation and per time-step linear computational complexity in  ...  We support our theoretical results by providing suitable empirical results for standard off-policy counterexamples.  ...  To the best of our knowledge this is the first time an almost sure convergence proof of off-policy temporal difference learning algorithm with linear function approximation is presented for step-sizes  ... 
arXiv:1605.06076v1 fatcat:n2wx5rl5wbfnxaufcgxkjorg7i

Off-policy Learning with Options and Recognizers

Doina Precup, Richard S. Sutton, Cosmin Paduraru, Anna Koop, Satinder P. Singh
2005 Neural Information Processing Systems  
We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods.  ...  This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD  ...  We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI.  ... 
dblp:conf/nips/PrecupSPKS05 fatcat:g7swv35yrvhixjick5paxsau4u

Gradient Descent Temporal Difference-difference Learning [article]

Rong J.B. Zhu, James M. Murray
2022 arXiv   pre-print
Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning.  ...  However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable.  ...  In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation.  ... 
arXiv:2209.04624v1 fatcat:spxwrqcgzna7jgm4kfycghcweu
« Previous Showing results 1 — 15 out of 11,791 results