Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3291168.3291210acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Ray: a distributed framework for emerging AI applications

Published:08 October 2018Publication History

ABSTRACT

The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray--a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.

References

  1. Akka. https://akka.io/.Google ScholarGoogle Scholar
  2. Apache Arrow. https://arrow.apache.org/.Google ScholarGoogle Scholar
  3. Dask Benchmarks. http://matthewrocklin.com/blog/work/2017/07/03/scaling.Google ScholarGoogle Scholar
  4. EC2 Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/.Google ScholarGoogle Scholar
  5. OpenAI Baselines: high-quality implementations of reinforcement learning algorithms. https://github.com/openai/baselines.Google ScholarGoogle Scholar
  6. TensorFlow Serving. https://www.tensorflow.org/serving/.Google ScholarGoogle Scholar
  7. ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., ET AL. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. AGARWAL, A., BIRD, S., COZOWICZ, M., HOANG, L., LANGFORD, J., LEE, S., LI, J., MELAMED, D., OSHRI, G., RIBAS, O., SEN, S., AND SLIVKINS, A. A multiworld testing decision service. arXiv preprint arXiv:1606.03966 (2016).Google ScholarGoogle Scholar
  9. ALVARO, P., CONDIE, T., CONWAY, N., ELMELEEGY, K., HELLERSTEIN, J. M., AND SEARS, R. BOOM Analytics: exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European conference on Computer systems (2010), ACM, pp. 223-236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ARMSTRONG, J., VIRDING, R., WIKSTRÖM, C., AND WILLIAMS, M. Concurrent programming in ERLANG. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. BEATTIE, C., LEIBO, J. Z., TEPLYASHIN, D., WARD, T., WAINWRIGHT, M., KÜTTLER, H., LEFRANCQ, A., GREEN, S., VALDÉ S, V., SADIK, A., ET AL. DeepMind Lab. arXiv preprint arXiv:1612.03801 (2016).Google ScholarGoogle Scholar
  12. BLUMOFE, R. D., AND LEISERSON, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720-748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. BROCKMAN, G., CHEUNG, V., PETTERSSON, L., SCHNEIDER, J., SCHULMAN, J., TANG, J., AND ZAREMBA, W. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016).Google ScholarGoogle Scholar
  14. BYKOV, S., GELLER, A., KLIOT, G., LARUS, J. R., PANDYA, R., AND THELIN, J. Orleans: Cloud computing for everyone. In Proceedings of the 2nd ACM Symposium on Cloud Computing (2011), ACM, p. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. CARBONE, P., EWEN, S., FÓRA, G., HARIDI, S., RICHTER, S., AND TZOUMAS, K. State management in Apache Flink: Consistent stateful distributed stream processing. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1718-1729. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. CASADO, M., FREEDMAN, M. J., PETTIT, J., LUO, J., MCKEOWN, N., AND SHENKER, S. Ethane: Taking control of the enterprise. SIGCOMM Comput. Commun. Rev. 37, 4 (Aug. 2007), 1-12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. CHAROUSSET, D., SCHMIDT, T. C., HIESGEN, R., AND WÄHLISCH, M. Native actors: A scalable software platform for distributed, heterogeneous environments. In Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control (2013), ACM, pp. 87-96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys'16) (2016).Google ScholarGoogle Scholar
  19. CRANKSHAW, D., WANG, X., ZHOU, G., FRANKLIN, M. J., GONZALEZ, J. E., AND STOICA, I. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 613-627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107-113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. DENNIS, J. B., AND MISUNAS, D. P. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture (New York, NY, USA, 1975), ISCA '75, ACM, pp. 126-132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. GABRIEL, E., FAGG, G. E., BOSILCA, G., ANGSKUN, T., DONGARRA, J. J., SQUYRES, J. M., SAHAY, V., KAMBADUR, P., BARRETT, B., LUMSDAINE, A., CASTAIN, R. H., DANIEL, D. J., GRAHAM, R. L., AND WOODALL, T. S. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting (Budapest, Hungary, September 2004), pp. 97-104.Google ScholarGoogle ScholarCross RefCross Ref
  23. GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. 29-43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. GONZALEZ, J. E., XIN, R. S., DAVE, A., CRANKSHAW, D., FRANKLIN, M. J., AND STOICA, I. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, USENIX Association, pp. 599-613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. GU, S., HOLLY, E., LILLICRAP, T., AND LEVINE, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA 2017) (2017).Google ScholarGoogle ScholarCross RefCross Ref
  26. HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI, A., JOSEPH, A. D., KATZ, R., SHENKER, S., AND STOICA, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295-308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. HORGAN, D., QUAN, J., BUDDEN, D., BARTH-MARON, G., HESSEL, M., VAN HASSELT, H., AND SILVER, D. Distributed prioritized experience replay. International Conference on Learning Representations (2018).Google ScholarGoogle Scholar
  28. ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (New York, NY, USA, 2007), EuroSys '07, ACM, pp. 59-72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. JIA, Y., SHELHAMER, E., DONAHUE, J., KARAYEV, S., LONG, J., GIRSHICK, R., GUADARRAMA, S., AND DARRELL, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  30. JORDAN, M. I., AND MITCHELL, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255-260.Google ScholarGoogle ScholarCross RefCross Ref
  31. LEIBIUSKY, J., EISBRUCH, G., AND SIMONASSI, D. Getting Started with Storm. O'Reilly Media, Inc., 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. LI, M., ANDERSEN, D. G., PARK, J. W., SMOLA, A. J., AHMED, A., JOSIFOVSKI, V., LONG, J., SHEKITA, E. J., AND SU, B.-Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, pp. 583-598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. LOOKS, M., HERRESHOFF, M., HUTCHINS, D., AND NORVIG, P. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181 (2017).Google ScholarGoogle Scholar
  34. LOW, Y., GONZALEZ, J., KYROLA, A., BICKSON, D., GUESTRIN, C., AND HELLERSTEIN, J. GraphLab: A new framework for parallel machine learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (Arlington, Virginia, United States, 2010), UAI'10, pp. 340-349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. MALEWICZ, G., AUSTERN, M. H., BIK, A. J., DEHNERT, J. C., HORN, I., LEISER, N., AND CZAJKOWSKI, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 135-146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. MNIH, V., BADIA, A. P., MIRZA, M., GRAVES, A., LILLICRAP, T. P., HARLEY, T., SILVER, D., AND KAVUKCUOGLU, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529-533.Google ScholarGoogle ScholarCross RefCross Ref
  38. MURRAY, D. A Distributed Execution Engine Supporting Data-dependent Control Flow. University of Cambridge, 2012.Google ScholarGoogle Scholar
  39. MURRAY, D. G., MCSHERRY, F., ISAACS, R., ISARD, M., BARHAM, P., AND ABADI, M. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 439-455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. MURRAY, D. G., SCHWARZKOPF, M., SMOWTON, C., SMITH, S., MADHAVAPEDDY, A., AND HAND, S. CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 113-126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. NAIR, A., SRINIVASAN, P., BLACKWELL, S., ALCICEK, C., FEARON, R., MARIA, A. D., PANNEERSHELVAM, V., SULEYMAN, M., BEATTIE, C., PETERSEN, S., LEGG, S., MNIH, V., KAVUKCUOGLU, K., AND SILVER, D. Massively parallel methods for deep reinforcement learning, 2015.Google ScholarGoogle Scholar
  42. NG, A., COATES, A., DIEL, M., GANAPATHI, V., SCHULTE, J., TSE, B., BERGER, E., AND LIANG, E. Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX (2006), 363-372.Google ScholarGoogle Scholar
  43. NISHIHARA, R., MORITZ, P., WANG, S., TUMANOV, A., PAUL, W., SCHLEIER-SMITH, J., LIAW, R., NIKNAMI, M., JORDAN, M. I., AND STOICA, I. Real-time machine learning: The missing pieces. In Workshop on Hot Topics in Operating Systems (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. OPENAI. OpenAI Dota 2 1v1 bot. https://openai.com/the-international/, 2017.Google ScholarGoogle Scholar
  45. OUSTERHOUT, K., WENDELL, P., ZAHARIA, M., AND STOICA, I. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 69-84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. PASZKE, A., GROSS, S., CHINTALA, S., CHANAN, G., YANG, E., DEVITO, Z., LIN, Z., DESMAISON, A., ANTIGA, L., AND LERER, A. Automatic differentiation in PyTorch.Google ScholarGoogle Scholar
  47. QU, H., MASHAYEKHI, O., TEREI, D., AND LEVIS, P. Canary: A scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412 (2016).Google ScholarGoogle Scholar
  48. ROCKLIN, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130-136.Google ScholarGoogle ScholarCross RefCross Ref
  49. SALIMANS, T., HO, J., CHEN, X., AND SUTSKEVER, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017).Google ScholarGoogle Scholar
  50. SANFILIPPO, S. Redis: An open source, in-memory data structure store. https://redis.io/, 2009.Google ScholarGoogle Scholar
  51. SCHULMAN, J., WOLSKI, F., DHARIWAL, P., RADFORD, A., AND KLIMOV, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).Google ScholarGoogle Scholar
  52. SCHWARZKOPF, M., KONWINSKI, A., ABD-EL-MALEK, M., AND WILKES, J. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), EuroSys '13, ACM, pp. 351-364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. SERGEEV, A., AND DEL BALSO, M. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018).Google ScholarGoogle Scholar
  54. SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M., ET AL. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484-489.Google ScholarGoogle ScholarCross RefCross Ref
  55. SILVER, D., LEVER, G., HEESS, N., DEGRIS, T., WIERSTRA, D., AND RIEDMILLER, M. Deterministic policy gradient algorithms. In ICML (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. SUTTON, R. S., AND BARTO, A. G. Reinforcement Learning: An Introduction. MIT press Cambridge, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. THAKUR, R., RABENSEIFNER, R., AND GROPP, W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49-66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. TIAN, Y., GONG, Q., SHANG, W., WU, Y., AND ZITNICK, C. L. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. Advances in Neural Information Processing Systems (NIPS) (2017).Google ScholarGoogle Scholar
  59. TODOROV, E., EREZ, T., AND TASSA, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on (2012), IEEE, pp. 5026-5033.Google ScholarGoogle ScholarCross RefCross Ref
  60. VAN DEN BERG, J., MILLER, S., DUCKWORTH, D., HU, H., WAN, A., FU, X.-Y., GOLDBERG, K., AND ABBEEL, P. Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In Robotics and Automation (ICRA), 2010 IEEE International Conference on (2010), IEEE, pp. 2074-2081.Google ScholarGoogle ScholarCross RefCross Ref
  61. VAN RENESSE, R., AND SCHNEIDER, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. VENKATARAMAN, S., PANDA, A., OUSTERHOUT, K., GHODSI, A., ARMBRUST, M., RECHT, B., FRANKLIN, M., AND STOICA, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the Twenty-Sixth ACM Symposium on Operating Systems Principles (2017), SOSP '17, ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. WHITE, T. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. ZAHARIA, M., CHOWDHURY, M., DAS, T., DAVE, A., MA, J., MCCAULEY, M., FRANKLIN, M. J., SHENKER, S., AND STOICA, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. ZAHARIA, M., XIN, R. S., WENDELL, P., DAS, T., ARMBRUST, M., DAVE, A., MENG, X., ROSEN, J., VENKATARAMAN, S., FRANKLIN, M. J., GHODSI, A., GONZALEZ, J., SHENKER, S., AND STOICA, I. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56-65. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
    October 2018
    815 pages
    ISBN:9781931971478

    Publisher

    USENIX Association

    United States

    Publication History

    • Published: 8 October 2018

    Check for updates

    Qualifiers

    • Article