Article

Ray: a distributed framework for emerging AI applications

Authors:
Philipp Moritz

University of California, Berkeley

University of California, Berkeley
View Profile

,
Robert Nishihara

University of California, Berkeley

University of California, Berkeley
View Profile

,
Stephanie Wang

University of California, Berkeley

University of California, Berkeley
View Profile

,
Alexey Tumanov

University of California, Berkeley

University of California, Berkeley
View Profile

,
Richard Liaw

University of California, Berkeley

University of California, Berkeley
View Profile

,
Eric Liang

University of California, Berkeley

University of California, Berkeley
View Profile

,
Melih Elibol

University of California, Berkeley

University of California, Berkeley
View Profile

,
Zongheng Yang

University of California, Berkeley

University of California, Berkeley
View Profile

,
William Paul

University of California, Berkeley

University of California, Berkeley
View Profile

,
Michael I. Jordan

University of California, Berkeley

University of California, Berkeley
View Profile

,
Ion Stoica

University of California, Berkeley

University of California, Berkeley
View Profile

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and ImplementationOctober 2018Pages 561–577

Published:08 October 2018Publication History

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

Pages 561–577

ABSTRACT

The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray--a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system's control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.

References

Akka. https://akka.io/.Google Scholar
Apache Arrow. https://arrow.apache.org/.Google Scholar
Dask Benchmarks. http://matthewrocklin.com/blog/work/2017/07/03/scaling.Google Scholar
EC2 Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/.Google Scholar
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms. https://github.com/openai/baselines.Google Scholar
TensorFlow Serving. https://www.tensorflow.org/serving/.Google Scholar
ABADI, M., BARHAM, P., CHEN, J., CHEN, Z., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., IRVING, G., ISARD, M., ET AL. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA (2016). Google ScholarDigital Library
AGARWAL, A., BIRD, S., COZOWICZ, M., HOANG, L., LANGFORD, J., LEE, S., LI, J., MELAMED, D., OSHRI, G., RIBAS, O., SEN, S., AND SLIVKINS, A. A multiworld testing decision service. arXiv preprint arXiv:1606.03966 (2016).Google Scholar
ALVARO, P., CONDIE, T., CONWAY, N., ELMELEEGY, K., HELLERSTEIN, J. M., AND SEARS, R. BOOM Analytics: exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European conference on Computer systems (2010), ACM, pp. 223-236. Google ScholarDigital Library
ARMSTRONG, J., VIRDING, R., WIKSTRÖM, C., AND WILLIAMS, M. Concurrent programming in ERLANG. Google ScholarDigital Library
BEATTIE, C., LEIBO, J. Z., TEPLYASHIN, D., WARD, T., WAINWRIGHT, M., KÜTTLER, H., LEFRANCQ, A., GREEN, S., VALDÉ S, V., SADIK, A., ET AL. DeepMind Lab. arXiv preprint arXiv:1612.03801 (2016).Google Scholar
BLUMOFE, R. D., AND LEISERSON, C. E. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (Sept. 1999), 720-748. Google ScholarDigital Library
BROCKMAN, G., CHEUNG, V., PETTERSSON, L., SCHNEIDER, J., SCHULMAN, J., TANG, J., AND ZAREMBA, W. OpenAI gym. arXiv preprint arXiv:1606.01540 (2016).Google Scholar
BYKOV, S., GELLER, A., KLIOT, G., LARUS, J. R., PANDYA, R., AND THELIN, J. Orleans: Cloud computing for everyone. In Proceedings of the 2nd ACM Symposium on Cloud Computing (2011), ACM, p. 16. Google ScholarDigital Library
CARBONE, P., EWEN, S., FÓRA, G., HARIDI, S., RICHTER, S., AND TZOUMAS, K. State management in Apache Flink: Consistent stateful distributed stream processing. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1718-1729. Google ScholarDigital Library
CASADO, M., FREEDMAN, M. J., PETTIT, J., LUO, J., MCKEOWN, N., AND SHENKER, S. Ethane: Taking control of the enterprise. SIGCOMM Comput. Commun. Rev. 37, 4 (Aug. 2007), 1-12. Google ScholarDigital Library
CHAROUSSET, D., SCHMIDT, T. C., HIESGEN, R., AND WÄHLISCH, M. Native actors: A scalable software platform for distributed, heterogeneous environments. In Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control (2013), ACM, pp. 87-96. Google ScholarDigital Library
CHEN, T., LI, M., LI, Y., LIN, M., WANG, N., WANG, M., XIAO, T., XU, B., ZHANG, C., AND ZHANG, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS Workshop on Machine Learning Systems (LearningSys'16) (2016).Google Scholar
CRANKSHAW, D., WANG, X., ZHOU, G., FRANKLIN, M. J., GONZALEZ, J. E., AND STOICA, I. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 613-627. Google ScholarDigital Library
DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107-113. Google ScholarDigital Library
DENNIS, J. B., AND MISUNAS, D. P. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2Nd Annual Symposium on Computer Architecture (New York, NY, USA, 1975), ISCA '75, ACM, pp. 126-132. Google ScholarDigital Library
GABRIEL, E., FAGG, G. E., BOSILCA, G., ANGSKUN, T., DONGARRA, J. J., SQUYRES, J. M., SAHAY, V., KAMBADUR, P., BARRETT, B., LUMSDAINE, A., CASTAIN, R. H., DANIEL, D. J., GRAHAM, R. L., AND WOODALL, T. S. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting (Budapest, Hungary, September 2004), pp. 97-104.Google ScholarCross Ref
GHEMAWAT, S., GOBIOFF, H., AND LEUNG, S.-T. The Google file system. 29-43. Google ScholarDigital Library
GONZALEZ, J. E., XIN, R. S., DAVE, A., CRANKSHAW, D., FRANKLIN, M. J., AND STOICA, I. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, USENIX Association, pp. 599-613. Google ScholarDigital Library
GU, S., HOLLY, E., LILLICRAP, T., AND LEVINE, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In IEEE International Conference on Robotics and Automation (ICRA 2017) (2017).Google ScholarCross Ref
HINDMAN, B., KONWINSKI, A., ZAHARIA, M., GHODSI, A., JOSEPH, A. D., KATZ, R., SHENKER, S., AND STOICA, I. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 295-308. Google ScholarDigital Library
HORGAN, D., QUAN, J., BUDDEN, D., BARTH-MARON, G., HESSEL, M., VAN HASSELT, H., AND SILVER, D. Distributed prioritized experience replay. International Conference on Learning Representations (2018).Google Scholar
ISARD, M., BUDIU, M., YU, Y., BIRRELL, A., AND FETTERLY, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (New York, NY, USA, 2007), EuroSys '07, ACM, pp. 59-72. Google ScholarDigital Library
JIA, Y., SHELHAMER, E., DONAHUE, J., KARAYEV, S., LONG, J., GIRSHICK, R., GUADARRAMA, S., AND DARRELL, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
JORDAN, M. I., AND MITCHELL, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255-260.Google ScholarCross Ref
LEIBIUSKY, J., EISBRUCH, G., AND SIMONASSI, D. Getting Started with Storm. O'Reilly Media, Inc., 2012. Google ScholarDigital Library
LI, M., ANDERSEN, D. G., PARK, J. W., SMOLA, A. J., AHMED, A., JOSIFOVSKI, V., LONG, J., SHEKITA, E. J., AND SU, B.-Y. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2014), OSDI'14, pp. 583-598. Google ScholarDigital Library
LOOKS, M., HERRESHOFF, M., HUTCHINS, D., AND NORVIG, P. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181 (2017).Google Scholar
LOW, Y., GONZALEZ, J., KYROLA, A., BICKSON, D., GUESTRIN, C., AND HELLERSTEIN, J. GraphLab: A new framework for parallel machine learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (Arlington, Virginia, United States, 2010), UAI'10, pp. 340-349. Google ScholarDigital Library
MALEWICZ, G., AUSTERN, M. H., BIK, A. J., DEHNERT, J. C., HORN, I., LEISER, N., AND CZAJKOWSKI, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2010), SIGMOD '10, ACM, pp. 135-146. Google ScholarDigital Library
MNIH, V., BADIA, A. P., MIRZA, M., GRAVES, A., LILLICRAP, T. P., HARLEY, T., SILVER, D., AND KAVUKCUOGLU, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (2016). Google ScholarDigital Library
MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K., OSTROVSKI, G., ET AL. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529-533.Google ScholarCross Ref
MURRAY, D. A Distributed Execution Engine Supporting Data-dependent Control Flow. University of Cambridge, 2012.Google Scholar
MURRAY, D. G., MCSHERRY, F., ISAACS, R., ISARD, M., BARHAM, P., AND ABADI, M. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 439-455. Google ScholarDigital Library
MURRAY, D. G., SCHWARZKOPF, M., SMOWTON, C., SMITH, S., MADHAVAPEDDY, A., AND HAND, S. CIEL: A universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2011), NSDI'11, USENIX Association, pp. 113-126. Google ScholarDigital Library
NAIR, A., SRINIVASAN, P., BLACKWELL, S., ALCICEK, C., FEARON, R., MARIA, A. D., PANNEERSHELVAM, V., SULEYMAN, M., BEATTIE, C., PETERSEN, S., LEGG, S., MNIH, V., KAVUKCUOGLU, K., AND SILVER, D. Massively parallel methods for deep reinforcement learning, 2015.Google Scholar
NG, A., COATES, A., DIEL, M., GANAPATHI, V., SCHULTE, J., TSE, B., BERGER, E., AND LIANG, E. Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX (2006), 363-372.Google Scholar
NISHIHARA, R., MORITZ, P., WANG, S., TUMANOV, A., PAUL, W., SCHLEIER-SMITH, J., LIAW, R., NIKNAMI, M., JORDAN, M. I., AND STOICA, I. Real-time machine learning: The missing pieces. In Workshop on Hot Topics in Operating Systems (2017). Google ScholarDigital Library
OPENAI. OpenAI Dota 2 1v1 bot. https://openai.com/the-international/, 2017.Google Scholar
OUSTERHOUT, K., WENDELL, P., ZAHARIA, M., AND STOICA, I. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (New York, NY, USA, 2013), SOSP '13, ACM, pp. 69-84. Google ScholarDigital Library
PASZKE, A., GROSS, S., CHINTALA, S., CHANAN, G., YANG, E., DEVITO, Z., LIN, Z., DESMAISON, A., ANTIGA, L., AND LERER, A. Automatic differentiation in PyTorch.Google Scholar
QU, H., MASHAYEKHI, O., TEREI, D., AND LEVIS, P. Canary: A scheduling architecture for high performance cloud computing. arXiv preprint arXiv:1602.01412 (2016).Google Scholar
ROCKLIN, M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the 14th Python in Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130-136.Google ScholarCross Ref
SALIMANS, T., HO, J., CHEN, X., AND SUTSKEVER, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017).Google Scholar
SANFILIPPO, S. Redis: An open source, in-memory data structure store. https://redis.io/, 2009.Google Scholar
SCHULMAN, J., WOLSKI, F., DHARIWAL, P., RADFORD, A., AND KLIMOV, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).Google Scholar
SCHWARZKOPF, M., KONWINSKI, A., ABD-EL-MALEK, M., AND WILKES, J. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), EuroSys '13, ACM, pp. 351-364. Google ScholarDigital Library
SERGEEV, A., AND DEL BALSO, M. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018).Google Scholar
SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V., LANCTOT, M., ET AL. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484-489.Google ScholarCross Ref
SILVER, D., LEVER, G., HEESS, N., DEGRIS, T., WIERSTRA, D., AND RIEDMILLER, M. Deterministic policy gradient algorithms. In ICML (2014). Google ScholarDigital Library
SUTTON, R. S., AND BARTO, A. G. Reinforcement Learning: An Introduction. MIT press Cambridge, 1998. Google ScholarDigital Library
THAKUR, R., RABENSEIFNER, R., AND GROPP, W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49-66. Google ScholarDigital Library
TIAN, Y., GONG, Q., SHANG, W., WU, Y., AND ZITNICK, C. L. ELF: An extensive, lightweight and flexible research platform for real-time strategy games. Advances in Neural Information Processing Systems (NIPS) (2017).Google Scholar
TODOROV, E., EREZ, T., AND TASSA, Y. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on (2012), IEEE, pp. 5026-5033.Google ScholarCross Ref
VAN DEN BERG, J., MILLER, S., DUCKWORTH, D., HU, H., WAN, A., FU, X.-Y., GOLDBERG, K., AND ABBEEL, P. Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In Robotics and Automation (ICRA), 2010 IEEE International Conference on (2010), IEEE, pp. 2074-2081.Google ScholarCross Ref
VAN RENESSE, R., AND SCHNEIDER, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association. Google ScholarDigital Library
VENKATARAMAN, S., PANDA, A., OUSTERHOUT, K., GHODSI, A., ARMBRUST, M., RECHT, B., FRANKLIN, M., AND STOICA, I. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the Twenty-Sixth ACM Symposium on Operating Systems Principles (2017), SOSP '17, ACM. Google ScholarDigital Library
WHITE, T. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2012. Google ScholarDigital Library
ZAHARIA, M., CHOWDHURY, M., DAS, T., DAVE, A., MA, J., MCCAULEY, M., FRANKLIN, M. J., SHENKER, S., AND STOICA, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (2012), USENIX Association, pp. 2-2. Google ScholarDigital Library
ZAHARIA, M., XIN, R. S., WENDELL, P., DAS, T., ARMBRUST, M., DAVE, A., MENG, X., ROSEN, J., VENKATARAMAN, S., FRANKLIN, M. J., GHODSI, A., GONZALEZ, J., SHENKER, S., AND STOICA, I. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56-65. Google ScholarDigital Library

Recommendations

Distributed ray tracing

Ray tracing is one of the most elegant techniques in computer graphics. Many phenomena that are difficult or impossible with other techniques are simple with ray tracing, including shadows, reflections, and refracted light. Ray directions, however, ...
Read More
Real-Time Ray-Traced Soft Shadows of Environmental Lighting by Conical Ray Culling

Soft shadows of environmental lighting provide important visual cues in realistic rendering. However, rendering of soft shadows of environmental lighting in real-time is difficult because evaluating the visibility function is challenging. In this work, ...
Read More
Whitted ray-tracing for dynamic scenes using a ray-space hierarchy on the GPU
EGSR'07: Proceedings of the 18th Eurographics conference on Rendering Techniques

In this paper, we present a new algorithm for interactive rendering of animated scenes with Whitted Ray-Tracing, running on the GPU. We focus our attention on the secondary rays (the rays generated by one or more bounces on specular objects), and use ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
October 2018
815 pages
ISBN:9781931971478
Program Chairs:
Andrea Arpaci-Dusseau
University of Wisconsin-Madison
,
Geoff Voelker
University of California, San Diego
Sponsors
In-Cooperation
Publisher
USENIX Association
United States
Publication History
- Published: 8 October 2018
Check for updates
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

Ray: a distributed framework for emerging AI applications

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

ABSTRACT

References

Cited By

Recommendations

Distributed ray tracing

Real-Time Ray-Traced Soft Shadows of Environmental Lighting by Conical Ray Culling

Whitted ray-tracing for dynamic scenes using a ray-space hierarchy on the GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Ray: a distributed framework for emerging AI applications

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

ABSTRACT

References

Cited By

Recommendations

Distributed ray tracing

Real-Time Ray-Traced Soft Shadows of Environmental Lighting by Conical Ray Culling

Whitted ray-tracing for dynamic scenes using a ray-space hierarchy on the GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media