Gradient Descent Only Converges to Minimizers.

We show that gradient descent converges to a local minimizer, almost surely with random initialization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory. ... Acknowledgements The authors would like to thank Chi Jin, Tengyu Ma, Robert Nishihara, Mahdi Soltanolkotabi, Yuekai Sun, Jonathan Taylor, and Yuchen Zhang for their insightful feedback. ... However, a short-step gradient method will only converge to minimizers. ...

dblp:conf/colt/LeeSJR16 fatcat:3yskjaj27vfevgdqtaoo23iunu

Given a non-convex twice differentiable cost function f, we prove that the set of initial conditions so that gradient descent converges to saddle points where \nabla^2 f has at least one strictly negative ... Moreover, this result extends to forward-invariant convex subspaces, allowing for weak (non-globally Lipschitz) smoothness assumptions. Finally, we produce an upper bound on the allowable step-size. ... We are also thankful to Nisheeth Vishnoi, on whose blog the article appeared, for pointing it out to us. ...

arXiv:1605.00405v2 fatcat:gcq5zhwd4vc6vmocwn5ntb7lku

Multiple Versions

Given a twice continuously differentiable cost function f , we prove that the set of initial conditions so that gradient descent converges to saddle points where ∇ 2 f has at least one strictly negative ... Moreover, this result extends to forward-invariant convex subspaces, allowing for weak (non-globally Lipschitz) smoothness assumptions. Finally, we produce an upper bound on the allowable step-size. ... We are also thankful to Nisheeth Vishnoi, on whose blog the article appeared, for pointing it out to us. ...

doi:10.4230/lipics.itcs.2017.2 dblp:conf/innovations/PanageasP17 fatcat:7xw7ljqcpjdjfbzn7llzrfgyru

This paper deals with minimizing average of loss functions using Gradient Descent (GD) and Stochastic Gradient Descent (SGD). ... We present these two algorithms for minimizing average of a large number of smooth convex functions. ... Gradient Descent Gradient descent 5 (also known as steepest descent) is an optimization technique for minimizing unconstrained multidimensional smooth convex function which starts with some initial parameters ...

doi:10.3329/dujs.v64i2.54490 fatcat:ary2gaindbgsbd2i2gywndjce4

Open Access

Although the optimization does converge, it does so slowly and not close to the speed of gradient descent. ... In this paper, another optimization method, Frank-Wolfe optimization, is applied to a small deep network and compared to gradient descent. ... They both converge to approximately 95% test accuracy, as do line search and gradient descent. ...

arXiv:2006.03960v1 fatcat:5sebass7jzgxjjik2mbn4zjcem

We show that gradient descent converges to a local minimizer, almost surely with random initialization. This is proved by applying the Stable Manifold Theorem from dynamical systems theory. ... Acknowledgements The authors would like to thank Chi Jin, Tengyu Ma, Robert Nishihara, Mahdi Soltanolkotabi, Yuekai Sun, Jonathan Taylor, and Yuchen Zhang for their insightful feedback. ... However, a short-step gradient method will only converge to minimizers. Remark 4.2. ...

arXiv:1602.04915v2 fatcat:5tndnl2rffanzlylalwhv6v2uy

Multiple Versions

Numerical experiment results show that entropic mirror descent with the Polyak step size converges fast in minimizing quantum Renyi divergences. ... To compute these quantities requires minimizing some order-α quantum Renyi divergences over the set of quantum states. ... Previously, the Polyak step size was only considered for gradient descent-type methods. • In practice, numerical results show entropic mirror descent with the Polyak step size converges fast for minimizing ...

arXiv:2109.06054v2 fatcat:pjdtpjh665ghrmhj6pflgdvdle

Multiple Versions

The Steepest descent method and the Conjugate gradient method to minimize nonlinear functions have been studied in this work. ... On the other hand, the Steepest descent method converges a function in less time than the Conjugate gradient method. ... We obtain that the Steepest descent method requires less time than the Conjugate gradient method to minimize the function. ...

doi:10.53898/josse2021113 fatcat:tko2ztzb4zbd5iwj5pabxnkgsy

Training a neural network with the gradient descent algorithm gives rise to a discrete-time nonlinear dynamical system. ... if the algorithm converges to an orbit. ... Therefore, given a fixed step size δ, the gradient descent can converge to only a subset of the local optima, and there are always some solutions that the gradient descent cannot converge to independent ...

arXiv:1805.08890v2 fatcat:xypywxiypzcz3g3p4hsqhstbji

Multiple Versions

In particular, we prove that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, geometrically converge to the global optimum and exactly recover ... Both the alternating exact minimization and alternating gradient descent algorithms attain linear rate of convergence for d = 600 and d = 900. ... gradient descent, alternating exact minimization (i.e., alternating least squares or coordinate descent), as well as alternating gradient descent (i.e., coordinate gradient descent), which are shown in ...

pmid:28316458 pmcid:PMC5354472 fatcat:smlyigamd5dvrpb3ey2bu4lanq

In spite of its nonconvexity, this problem has a well-behaved geometric landscape, permitting local search algorithms such as gradient descent to converge to global minimizers. ... We identify conditions under which this new problem also has a well-behaved geometric landscape, and we propose an extension of distributed gradient descent (DGD) to solve this problem. ... Thus, the question arises of whether local search algorithms such as gradient descent actually converge to a global minimizer of (13) . ...

dblp:conf/nips/ZhuLYTW19 fatcat:h5xeheanjvbf3pn7cf6d42mehe

The purpose of this paper is to present the capabilities of the conjugate gradient methods based on the theoretical analysis of the gradient method, the precursor of the descent methods. ... Different test systems are proposed to solve, in order to obtain a solution that can determine the speed of convergence of the conjugate address proposed by Liu-Storey and Dai-Yuan [1]. ... We would like to thank the referee for his valuable suggestions that improved the presentation of this paper and our gratitude to the Department of Mathematics of the Universidad Tecnológica de Pereira ...

doi:10.12988/ces.2017.711189 fatcat:77klin6duzbsjlpyyp6o7gc54a

Open Access

approximated via the gradient descent flow. ... We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL) networks. ... Therefore, while C[x[Z(s)]] always converges to a stationary value of the cost function under the gradient descent flow, Z(s) cannot generally be assumed to converge to a minimizer Z * . ...

arXiv:2311.07065v1 fatcat:z6ib4exbsbhtxe6bgbel5temla

We provide experiments on test functions in four- and three-dimensional spaces, where natural gradient descent proves its ability to converge in the neighborhood of global minimum. ... The proposed algorithm is equipped with step-size adaptation, which allows it to obtain higher accuracy, taking a small number of iterations in the process of minimization, compared with the usual gradient ... Acknowledgments: The authors would like to thank the North Caucasus Federal University for their support in the project competitions of scientific groups and individual scientists of the North Caucasus ...

doi:10.3390/math10193556 fatcat:s43iogi2jfbqzlwl6yy3evkryq

DOAJ Szczepanski

Olshausen and Field's Sparsenet algorithm relies on a fixed step projected gradient descent. With the right step, it can avoid local minima and converge towards the global minimum. ... The problem then becomes to find the right step size. In this work we provide the expression of the optimal step for the gradient descent but the step we use is twice as large as the optimal step. ... MOD finds that minimum in only one iteration, but if each Sparsenet dictionary update was allowed to iterate on its gradient descent with a well chosen step, it would converge towards the result of the ...

doi:10.1007/978-3-642-28551-6_29 fatcat:spenb3d5pvastb6jmupza2j3wi

Gradient Descent Only Converges to Minimizers

Preserved Fulltext

Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions [article]

Preserved Fulltext

Other Versions

Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions

Preserved Fulltext

Minimizing Average of Loss Functions Using Gradient Descent and Stochastic Gradient Descent

Preserved Fulltext

Frank-Wolfe optimization for deep networks [article]

Preserved Fulltext

Gradient Descent Converges to Minimizers [article]

Preserved Fulltext

Other Versions

Minimizing Quantum Renyi Divergences via Mirror Descent with Polyak Step Size [article]

Preserved Fulltext

Other Versions

Comparison Between Steepest Descent Method and Conjugate Gradient Method by Using Matlab

Preserved Fulltext

Step Size Matters in Deep Learning [article]

Preserved Fulltext

Other Versions

A Nonconvex Optimization Framework for Low Rank Matrix Estimation

Preserved Fulltext

Distributed Low-rank Matrix Factorization With Exact Consensus

Preserved Fulltext

Comparison of the conjugate gradient methods of Liu-Storey and Dai-Yuan

Preserved Fulltext

Non-approximability of constructive global ℒ^2 minimizers by gradient descent in Deep Learning [article]

Preserved Fulltext

Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions

Preserved Fulltext

Dictionary Learning with Large Step Gradient Descent for Sparse Representations [chapter]

Preserved Fulltext