Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3454287.3455603guideproceedingsArticle/Chapter ViewAbstractPublication PagesMonograph
research-article
Free Access

Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations

Published:08 December 2019Publication History

ABSTRACT

Communication bottleneck has been identified as a significant issue in distributed optimization of large-scale learning models. Recently, several approaches to mitigate this problem have been proposed, including different forms of gradient compression or computing local models and mixing them iteratively. In this paper we propose Qsparse-local-SGD algorithm, which combines aggressive sparsifi-cation with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients. We propose both synchronous and asynchronous implementations of Qsparse-local-SGD. We analyze convergence for Qsparse-local-SGD in the distributed case, for smooth non-convex and convex objective functions. We demonstrate that Qsparse-local-SGD converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers. We use Qsparse-local-SGD to train ResNet-50 on ImageNet, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy.

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google ScholarGoogle Scholar
  4. D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google ScholarGoogle Scholar
  6. Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google ScholarGoogle Scholar
  7. J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google ScholarGoogle Scholar
  8. L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google ScholarGoogle Scholar
  9. Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google ScholarGoogle Scholar
  11. R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarGoogle ScholarCross RefCross Ref
  12. Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  14. Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google ScholarGoogle Scholar
  16. Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google ScholarGoogle Scholar
  17. Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google ScholarGoogle Scholar
  18. Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  20. Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google ScholarGoogle Scholar
  21. H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google ScholarGoogle Scholar
  23. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google ScholarGoogle Scholar
  25. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google ScholarGoogle Scholar
  28. A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google ScholarGoogle Scholar
  29. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google ScholarGoogle Scholar
  32. Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google ScholarGoogle Scholar
  33. A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google ScholarGoogle Scholar
  34. H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google ScholarGoogle Scholar
  35. Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google ScholarGoogle Scholar
  36. H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google ScholarGoogle Scholar
  37. Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google ScholarGoogle Scholar
  38. J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google ScholarGoogle Scholar
  40. J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google ScholarGoogle Scholar
  41. Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  42. Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google ScholarGoogle Scholar
  43. Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  44. Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google ScholarGoogle Scholar
  45. Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In OSDI, pages 265-283, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In EMNLP, pages 440-445, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In NIPS, pages 1707-1718, 2017.Google ScholarGoogle Scholar
  4. D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In NeurIPS, pages 5977-5987, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, pages 451-459, 2011.Google ScholarGoogle Scholar
  6. Debraj Basu, Deepesh Data, Can Karakus, and Suhas N. Diggavi. Qsparse-local-sgd: Distributed SGD with quantization, sparsification, and local computations. CoRR, abs/1906.02367, 2019.Google ScholarGoogle Scholar
  7. J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: compressed optimisation for non-convex problems. In ICML, pages 559-568, 2018.Google ScholarGoogle Scholar
  8. L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177-186, 2010.Google ScholarGoogle Scholar
  9. Kai Chen and Qiang Huo. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In ICASSP, pages 5880-5884, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, University of Edinburgh, UK, 2015.Google ScholarGoogle Scholar
  11. R. Gitlin, J. Mazo, and M. Taylor. On the design of gradient algorithms for digitally implemented adaptive filters. IEEE Transactions on Circuit Theory, 20(2):125-136, March 1973.Google ScholarGoogle ScholarCross RefCross Ref
  12. Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15(1):2489-2512, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  14. Robbins Herbert and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics. JSTOR, 22, no. 3:400-407, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U. Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In ICML, pages 3252-3261, 2019.Google ScholarGoogle Scholar
  16. Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML, pages 3478-3487, 2019.Google ScholarGoogle Scholar
  17. Jakub Konecný. Stochastic, distributed and federated optimization for machine learning. CoRR, abs/1707.01155, 2017.Google ScholarGoogle Scholar
  18. Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass SVM. In NIPS, pages 325-333, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 86(11):2278-2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  20. Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.Google ScholarGoogle Scholar
  21. H. Mania, X. Pan, D. S. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202-2229, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pages 1273-1282, 2017.Google ScholarGoogle Scholar
  23. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. SGD and hogwild! convergence without the bounded gradients assumption. In ICML, pages 3747-3755, 2018.Google ScholarGoogle Scholar
  25. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS, pages 693-701, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.Google ScholarGoogle Scholar
  28. A. Sergeev and M. D. Balso. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018.Google ScholarGoogle Scholar
  29. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML, pages 807-814, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In NeurIPS, pages 4452-4463, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sebastian U. Stich. Local SGD converges fast and communicates little. In ICLR, 2019.Google ScholarGoogle Scholar
  32. Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pages 1488-1492, 2015.Google ScholarGoogle Scholar
  33. A. Theertha Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML, pages 3329-3337, 2017.Google ScholarGoogle Scholar
  34. H. Tang, S. Gan, C. Zhang, T. Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google ScholarGoogle Scholar
  35. Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In NeurIPS, pages 7663-7673, 2018.Google ScholarGoogle Scholar
  36. H. Wang, S. Sievert, S. Liu, Z. B. Charles, D. S. Papailiopoulos, and S. Wright. ATOMO: communication-efficient learning via atomic sparsification. In NeurIPS, pages 9872-9883, 2018.Google ScholarGoogle Scholar
  37. Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR, abs/1808.07576, 2018.Google ScholarGoogle Scholar
  38. J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In NeurIPS, pages 1306-1316, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS, pages 1508-1518, 2017.Google ScholarGoogle Scholar
  40. J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In ICML, pages 5321-5329, 2018.Google ScholarGoogle Scholar
  41. Tianyu Wu, Kun Yuan, Qing Ling, Wotao Yin, and Ali H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Trans. Signal and Information Processing over Networks, 4(2):293-307, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  42. Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In ICML, pages 7184-7193, 2019.Google ScholarGoogle Scholar
  43. Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI, pages 5693-5700, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  44. Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328-2336, 2013.Google ScholarGoogle Scholar
  45. Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321-3363, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Qsparse-local-SGD: distributed SGD with quantization, sparsification, and local computations
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image Guide Proceedings
        NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems
        December 2019
        15947 pages

        Copyright © 2019 Neural Information Processing Systems Foundation, Inc.

        Publisher

        Curran Associates Inc.

        Red Hook, NY, United States

        Publication History

        • Published: 8 December 2019

        Qualifiers

        • research-article
        • Research
        • Refereed limited

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader