research-article

Free Access

Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations

Authors:
Debraj Basu

Adobe Inc.

Adobe Inc.
View Profile

,
Deepesh Data

UCLA

UCLA
View Profile

,
Can Karakus

Amazon Inc.

Amazon Inc.
View Profile

,
Suhas Diggavi

UCLA

UCLA
View Profile

NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing SystemsDecember 2019Article No.: 1316Pages 14695–14706

Published:08 December 2019Publication History

NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

Pages 14695–14706

ABSTRACT

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsifi-cation with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

References

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarDigital Library
Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarCross Ref
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google Scholar
D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarDigital Library
Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google Scholar
Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google Scholar
J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google Scholar
L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google Scholar
Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarDigital Library
Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google Scholar
R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarCross Ref
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarCross Ref
Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarCross Ref
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google Scholar
Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google Scholar
Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google Scholar
Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarDigital Library
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarCross Ref
Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google Scholar
H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarDigital Library
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google Scholar
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarDigital Library
Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google Scholar
A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarDigital Library
Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarDigital Library
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google Scholar
A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google Scholar
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarDigital Library
S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarDigital Library
Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google Scholar
Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google Scholar
A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google Scholar
H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google Scholar
Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google Scholar
J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarDigital Library
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google Scholar
J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google Scholar
Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarCross Ref
Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google Scholar
Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarCross Ref
Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google Scholar
Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarDigital Library

References

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarDigital Library
Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarCross Ref
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google Scholar
D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarDigital Library
Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google Scholar
Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google Scholar
J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google Scholar
L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google Scholar
Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarDigital Library
Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google Scholar
R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarCross Ref
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarCross Ref
Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarCross Ref
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google Scholar
Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google Scholar
Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google Scholar
Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarDigital Library
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarCross Ref
Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google Scholar
H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarDigital Library
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google Scholar
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarDigital Library
Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google Scholar
A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarDigital Library
Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarDigital Library
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google Scholar
A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google Scholar
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarDigital Library
S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarDigital Library
Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google Scholar
Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google Scholar
A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google Scholar
H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google Scholar
H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google Scholar
Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google Scholar
J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarDigital Library
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google Scholar
J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google Scholar
Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarCross Ref
Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google Scholar
Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarCross Ref
Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google Scholar
Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarDigital Library

Index Terms

Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
2. Theory of computation
  1. Design and analysis of algorithms

Index terms have been assigned to the content through auto-classification.

Recommendations

Is local SGD better than minibatch SGD?
ICML'20: Proceedings of the 37th International Conference on Machine Learning

We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the ...
Read More
Minibatch vs local SGD for heterogeneous distributed learning
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize ...
Read More
CP-SGD: Distributed stochastic gradient descent with compression and periodic compensation
Abstract
Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combined with a parallel communication mechanism method like pipeline, gradient ...
Highlights
- CP-SGD: Distributed Stochastic Gradient Descent with Compression and Periodic Compensation.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems
December 2019
15947 pages
Editors:
Hanna M. Wallach
,
Hugo Larochelle
,
Alina Beygelzimer
,
Florence d'Alché-Buc
,
Emily B. Fox
Copyright © 2019 Neural Information Processing Systems Foundation, Inc.
Sponsors
In-Cooperation
Publisher
Curran Associates Inc.
Red Hook, NY, United States
Publication History
- Published: 8 December 2019
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 27
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations

NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

ABSTRACT

References

Cited By

References

Index Terms

Recommendations

Is local SGD better than minibatch SGD?

Minibatch vs local SGD for heterogeneous distributed learning

CP-SGD: Distributed stochastic gradient descent with compression and periodic compensation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations

NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems

ABSTRACT

References

Cited By

References

Index Terms

Recommendations

Is local SGD better than minibatch SGD?

Minibatch vs local SGD for heterogeneous distributed learning

CP-SGD: Distributed stochastic gradient descent with compression and periodic compensation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media