On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points.

Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. ... While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a ... Acknowledgements We thank Tongyang Li and Quanquan Gu for valuable discussions. ...

arXiv:1902.04811v2 fatcat:rmdh2zan2vhdxbbzi6if2krnwe

Multiple Versions

Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. ... While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a ... On Nonconvex Optimization for Machine Learning 11:3 Nonconvex optimization problems are intractable in general. ...

doi:10.1145/3418526 fatcat:tgzxmy5tmbaw7phpmfzwugt5ne

Langevin dynamics(AGLD), for non-convex optimization problems. ... In this paper, we propose a new adaptive stochastic gradient Langevin dynamics (ASGLD) algorithmic framework and its two specialized versions, namely adaptive stochastic gradient (ASG) and adaptive gradient ... Related work In this section, we provide an overview on saddle point escape for non-convex learning, which has attracted a lot of attention in the machine learning community recently. ...

arXiv:1805.09416v1 fatcat:5qbqlfpfnnfkth3ogmlwzahw4a

Open Access

Numerical experiments on two benchmark machine learning applications show C-DPSSG's competitive performance which validate our theoretical findings. ... To tackle this, we develop a simple and effective switching idea, where a generalized stochastic gradient (GSG) computation oracle is employed to hasten the iterates' progress to a saddle point solution ... Data Sets We rely on four binary classification datasets namely, a4a, phishing and ijcnn1 from https://www.csie.ntu.edu. tw/ ˜cjlin/libsvmtools/datasets/ and sido data from http://www.causality.inf.ethz.ch ...

arXiv:2309.00997v2 fatcat:m5crfnh3fzditcxon5uw4g3tta

Multiple Versions

In this paper, we consider a smooth unconstrained nonconvex optimization problem, and propose a perturbed A-GD (PA-GD) which is able to converge (with high probability) to the second-order stationary points ... Alternating gradient descent (A-GD) is a simple but popular algorithm in machine learning, which updates two blocks of variables in an alternating manner using gradient descent steps. ... Many recent works have analyzed the saddle points in machine learning problems (Kawaguchi, 2016) . ...

dblp:conf/icml/LuHW19 fatcat:qdlx3v6hx5bwfnvgecdhkxh25e

Gradient descent-ascent (GDA) is a widely used algorithm for minimax optimization. ... However, GDA has been proved to converge to stationary points for nonconvex minimax optimization, which are suboptimal compared with local minimax points. ... Although GDA can find stationary points in nonconvex minimax optimization, the stationary points may include candidate solutions that are far more sub-optimal than global minimax points, (e.g. saddle points ...

arXiv:2110.07098v5 fatcat:upsz2vcyhjfzdg5kd5zj6suj7u

Multiple Versions

The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing ... The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. ... It remains an open question how one can design stochastic-gradient-type methods to optimize parameters of a DNN so as to find good local minimizers and avoid poor local minimizers and/or saddle points. ...

arXiv:1706.10207v1 fatcat:mezejqzn3bgozjhgpafyick3xy

Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent method ... Multiview representation learning is popular for latent factor analysis. ... We then provide a real data experiment for comparing the computational performance our nonconvex stochastic gradient algorithm for solving (2.1) with the convex stochastic gradient algorithm for solving ...

dblp:conf/icml/ChenYLZ17 fatcat:mhrlr4fm3vcrtly67fj72a4edi

We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large scale datasets. Empirical evidences illustrate the potential of the proposed methods. ... NP classification is a nonconvex problem involving a constraint on false negatives rate. ... The first algorithm leverages modern nonconvex optimization techniques [Tao and An 1998 ]. The second algorithm is a stochastic gradient algorithm suitable for very large datasets. ...

doi:10.1145/1961189.1961200 fatcat:yykk3w5gc5a37cjmvwrlxh3n7i

Among machine learning models, stochastic gradient descent (SGD) is not only simple but also very effective. ... Following that, this study introduces several versions of SGD and its variant, which are already in the PyTorch optimizer, including SGD, Adagrad, adadelta, RMSprop, Adam, AdamW, and so on. ... Stochastic gradient descent (SGD) [33] is simple and successful among machine learning models. ...

doi:10.3390/math11030682 fatcat:6sqjnyl3xnfnpeyco7uvgyq22a

DOAJ Szczepanski

We study distributed stochastic nonconvex optimization in multi-agent networks. ... The proposed method hinges on successive convex approximation (SCA) techniques, leveraging dynamic consensus as a mechanism to track the average gradient among the agents, and recursive averaging to recover ... control and coordination, and distributed machine learning, just to name a few. ...

arXiv:2004.14882v1 fatcat:oat7muwqzvfovjtx5atmmvna7m

Multiple Versions

Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent (SGD) ... It naturally arises in many data analysis, machine learning, and information retrieval applications to model dependent structures among multiple data sources. ... Journal of Machine Learning Research 6 1817-1853. A , R., C , A., L , K. and S , N. (2012). Stochastic optimization for pca and pls. ...

arXiv:1702.08134v10 fatcat:nd4hsrzuvnbcpmkuhk3qsjuhiq

Multiple Versions

Zeroth-order (ZO) optimization is a subset of gradient-free optimization that emerges in many signal processing and machine learning applications. ... It is used for solving optimization problems similarly to gradient-based methods. However, it does not require the gradient, using only function evaluations. ... The work [41] and [63] focused on stochastic optimization and deterministic optimization, respectively. 4) Constrained nonconvex optimization: The criterion for convergence is commonly determined ...

arXiv:2006.06224v2 fatcat:fx624eqhifbqpp5hbd5a5cmsny

Multiple Versions

Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. ... In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that ... Nonconvex Problems. Nonconvex functions can have three types of critical points, defined as points w for which ∇f (w) = 0: saddle points, local minima, and global minima. ...

arXiv:2005.08854v2 fatcat:y6fvajvq2naajeqs6lo3trrgwy

Multiple Versions

The performance of a deep neural network is highly dependent on its training, and finding better local optimal solutions is the goal of many optimization algorithms. ... In this work, we propose Learning Rate Dropout (LRD), a simple gradient descent technique for training related to coordinate descent. ... By adding stochasticity to loss descent path, this technique helps the model to traverse quickly through the "transient" plateau (e.g. saddle points or local minima) and gives the model more chances to ...

arXiv:1912.00144v2 fatcat:drapxgb5uzdsfnlbwernlzloce

Multiple Versions

On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points [article]

Preserved Fulltext

Other Versions

On Nonconvex Optimization for Machine Learning

Preserved Fulltext

Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time [article]

Preserved Fulltext

Switch and Conquer: Efficient Algorithms By Switching Stochastic Gradient Oracles For Decentralized Saddle Point Problems [article]

Preserved Fulltext

Other Versions

PA-GD: On the Convergence of Perturbed Alternating Gradient Descent to Second-Order Stationary Points for Structured Nonconvex Optimization

Preserved Fulltext

A Cubic Regularization Approach for Finding Local Minimax Points in Nonconvex Minimax Optimization [article]

Preserved Fulltext

Other Versions

Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning [article]

Preserved Fulltext

Online Partial Least Square Optimization: Dropping Convexity for Better Efficiency and Scalability

Preserved Fulltext

Batch and online learning algorithms for nonconvex neyman-pearson classification

Preserved Fulltext

Recent Advances in Stochastic Gradient Descent in Deep Learning

Preserved Fulltext

Distributed Stochastic Nonconvex Optimization and Learning based on Successive Convex Approximation [article]

Preserved Fulltext

Other Versions

Dropping Convexity for More Efficient and Scalable Online Multiview Learning [article]

Preserved Fulltext

Other Versions

A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning [article]

Preserved Fulltext

Other Versions

Scaling-up Distributed Processing of Data Streams for Machine Learning [article]

Preserved Fulltext

Other Versions

Learning Rate Dropout [article]

Preserved Fulltext

Other Versions