ABSTRACT
Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsifi-cation with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarDigital Library
- Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarCross Ref
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google Scholar
- D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarDigital Library
- Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google Scholar
- Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google Scholar
- J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google Scholar
- L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google Scholar
- Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarDigital Library
- Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google Scholar
- R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarCross Ref
- Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarCross Ref
- Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarCross Ref
- Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google Scholar
- Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google Scholar
- Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google Scholar
- Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarDigital Library
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarCross Ref
- Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google Scholar
- H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarDigital Library
- B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google Scholar
- Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarDigital Library
- Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google Scholar
- A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarDigital Library
- Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarDigital Library
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google Scholar
- A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google Scholar
- Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarDigital Library
- S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarDigital Library
- Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google Scholar
- Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google Scholar
- A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google Scholar
- H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
- Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
- H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google Scholar
- Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google Scholar
- J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarDigital Library
- W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google Scholar
- J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google Scholar
- Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarCross Ref
- Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google Scholar
- Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarCross Ref
- Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google Scholar
- Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarDigital Library
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarDigital Library
- Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarCross Ref
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google Scholar
- D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarDigital Library
- Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google Scholar
- Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google Scholar
- J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google Scholar
- L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google Scholar
- Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarDigital Library
- Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google Scholar
- R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarCross Ref
- Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarCross Ref
- Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarCross Ref
- Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google Scholar
- Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google Scholar
- Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google Scholar
- Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarDigital Library
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarCross Ref
- Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google Scholar
- H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarDigital Library
- B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google Scholar
- Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarDigital Library
- Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google Scholar
- A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarDigital Library
- Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarDigital Library
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google Scholar
- A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google Scholar
- Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarDigital Library
- S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarDigital Library
- Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google Scholar
- Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google Scholar
- A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google Scholar
- H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
- Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
- H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google Scholar
- Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google Scholar
- J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarDigital Library
- W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google Scholar
- J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google Scholar
- Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarCross Ref
- Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google Scholar
- Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarCross Ref
- Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google Scholar
- Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarDigital Library
Index Terms
- Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations
Recommendations
Is local SGD better than minibatch SGD?
ICML'20: Proceedings of the 37th International Conference on Machine LearningWe study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the ...
Minibatch vs local SGD for heterogeneous distributed learning
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing SystemsWe analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize ...
CP-SGD: Distributed stochastic gradient descent with compression and periodic compensation
AbstractCommunication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient ...
Highlights- CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation.
Comments