Quality of Service Support for Fine-Grained Sharing on GPUs.

This survey aims to summarize and categorize the emerging challenges and optimization opportunities for multi-tenant DL inference on GPU. ... With such strong computing scaling of GPUs, multi-tenant deep learning inference by co-locating multiple DL models onto the same GPU becomes widely deployed to improve resource utilization, enhance serving ... However, as we introduced before, achieving fine-grained resource partitioning is non-achievable until recently GPU vendors release a series of resource sharing and partitioning support like multistreams ...

arXiv:2203.09040v3 fatcat:utvpoyvvajfhfghgpf45nxnbne

Open Access Multiple Versions

Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. ... In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity ... ACKNOWLEDGEMENTS This work is supported in part by the U.S. National Science Foundation under Grants CCF-1551511 and CNS-1551262. ...

arXiv:2312.07743v1 fatcat:36ixvocghbdjnm6ioogsqjkefy

Multiple Versions

However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. ... Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated ... Prior works on fine-grained GPU sharing fall into several categories. ...

arXiv:1902.04610v1 fatcat:a4l66d2zcbd23jwwzdez2qitfq

This complexity increases if the user needs to add provenance data capture services to the workflow. ... This manuscript introduces ProvDeploy to assist the user in configuring containers for scientific workflows with integrated provenance data capture. ... Acknowledgments This was supported in part by CNPq, FAPERJ, and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brazil (CAPES) -Finance Code 001. ...

arXiv:2403.15324v2 fatcat:hd6gtot4azghvpi55znbclps74

Multiple Versions

However, the primary focus of this study is to analyse how to enhance performance under power consumption limitations for emerging technologies. ... Today's High-Performance Computing (HPC) systems require significant usage of "supercomputers" and extensiveparallel processing approaches for solving complicated computational tasks at the Petascale level ... This model achieves coarse grain parallelism through MPI and fine-grain parallelism through GPU computations. ...

doi:10.26483/ijarcs.v10i2.6397 fatcat:k3l3lk5kuzhnldn5b2qzkh4eia

Open Access

Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. ... Experiment results show that balanced sparsity achieves up to 3.1x practical speedup for model inference on GPU, while retains the same high model accuracy as fine-grained sparsity. ... Shijie Cao was partly supported by National Nature Science Foundation of China (No.61772159). ...

arXiv:1811.00206v4 fatcat:3yptunrdnzchlepiikqwszhizu

Multiple Versions

This paper proposes a new multi-application paradigm for GPUs, called NURA, that provides high potential to improve resource utilization and ensure fairness and Quality-of-Service (QoS). ... Some pieces of prior work (e.g. spatial multitasking) have limited opportunity to improve resource utilization, while others, e.g. simultaneous multi-kernel, provide fine-grained resource sharing at the ... To ensure the quality of service (QoS) of the primary kernel, we slightly modify the warp scheduler to always prioritize CTAs of the primary kernel over the CTAs of the Helper Kernel. ...

doi:10.1145/3489048.3522656 fatcat:xcmtppre3rer3etvsjnrvzjtei

Our approach adapts to high parallelism property of GPU, showing incredible potential for sparsity in the widely deployment of deep learning services. ... Another trend accelerates sparse model inference on general-purpose hardwares by adopting coarse-grained sparsity to prune or regularize consecutive weights for efficient computation. ... Shijie Cao was partly supported by National Nature Science Foundation of China (No.61772159). ...

doi:10.1609/aaai.v33i01.33015676 fatcat:uq6pbptj5bayhbhkmng3fzy2xi

Thanks to the flexible workload assignment among multiple render agents, ShareRender enables fine-grained resource sharing at the frame-level to significantly improve GPU utilization. ... For each game running in a VM, ShareRender starts a graphics wrapper to intercept frame rendering requests and assign them to render agents responsible for frame rendering on GPUs. ... CONCLUSION In this paper, we present the ShareRender, a cloud gaming system bypasses GPU virtualization and enables fine-grained resource sharing in cloud gaming. ...

doi:10.1145/3123266.3123306 dblp:conf/mm/ZhangLLJL17 fatcat:ytpngwkigfdtjdi34a5fvo3dju

The authors propose a solution called DRPRS for fine-grained pedestrian recognition using deep learning techniques supported by stream processing from Apache Storm. ... Timetable performance evaluation is critical for improving the train service quality. ... The authors propose a solution called DRPRS for fine-grained pedestrian recognition using deep learning techniques supported by stream processing from Apache Storm. ...

doi:10.1049/iet-its.2018.0116 fatcat:jjxp3lvyz5d6bho5irovoolk6y

DOAJ

However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. ... Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. ... Analogously, Liquid [47] also supports fine-grained GPU sharing for further resource utilization improvement using a random forest model. ...

arXiv:2205.11913v3 fatcat:fnbinueyijb4nc75fpzd6hzjgq

Multiple Versions

This paper presents an in-depth study and analysis of the model of college students' mental health education using fine-grained parallel computational programming. ... for future intervention studies. ... Acknowledgments e study was supported by the Anqing Normal University. ...

doi:10.1155/2022/7044526 fatcat:6ngexqt4snbrjpd43m3rqx2era

DOAJ

We introduce Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. ... This paper presents a solution to the challenge of mitigating carbon emissions from hosting large-scale machine learning (ML) inference services. ... This is because fine-grained partitioning allows a higher degree of hardware sharing, and hence, better resource utilization. This leads to lower carbon emissions per request. ...

arXiv:2304.09781v1 fatcat:epbri7k5hnhptprsibgxsviop4

Open Access Multiple Versions

In addition, our framework enables efficient shares of GPU accelerators with multiple functions to increase resource efficiency with minimal overhead. ... This homogeneity assumption causes two challenges in running ML workloads like Deep Neural Network (DNN) inference services on these frameworks. ... ACKNOWLEDGEMENT We thank the anonymous reviewers for their feedback on earlier drafts of this paper. We wish to thank Eric Wu in Hewlett Packard Labs for his support in setting up the testbed. ...

dblp:conf/mlsys/ChoTCS22 fatcat:uxfzaro2lza3ti7bfhe3onhqcq

While this clustering solution is gaining momentum in recent years, efficient runtime support for fine-grained object sharing over the distributed JVM remains a challenge. ... run-time overheads of fine-grained threading. ... We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. ...

doi:10.1109/ipdps.2010.5470471 dblp:conf/ipps/SiZ10 fatcat:yuchdc4zp5borm5vs7j4rqgmzy

A Survey of Multi-Tenant Deep Learning Inference on GPU [article]

Preserved Fulltext

Other Versions

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems [article]

Preserved Fulltext

Other Versions

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications [article]

Preserved Fulltext

ProvDeploy: Provenance-oriented Containerization of High Performance Computing Scientific Workflows [article]

Preserved Fulltext

Other Versions

IMPROVING PERFORMANCE IN HPC SYSTEM UNDER POWER CONSUMPTIONS LIMITATIONS

Preserved Fulltext

Balanced Sparsity for Efficient DNN Inference on GPU [article]

Preserved Fulltext

Other Versions

NURA

Preserved Fulltext

Balanced Sparsity for Efficient DNN Inference on GPU

Preserved Fulltext

ShareRender

Preserved Fulltext

Guest Editorial: Big Traffic Data Analysis and Mining

Preserved Fulltext

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision [article]

Preserved Fulltext

Other Versions

Construction of College Students' Mental Health Education Model Based on Data Analysis

Preserved Fulltext

Clover: Toward Sustainable AI with Carbon-Aware Machine Learning Inference Service [article]

Preserved Fulltext

SLA-Driven ML Inference Framework for Clouds with Hetergeneous Accelerators

Preserved Fulltext

Midpoint routing algorithms for Delaunay triangulations

Preserved Fulltext