Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








18,744 Hits in 4.4 sec

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models [article]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai
2024 arXiv   pre-print
This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs).  ...  ; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement  ...  For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time. Practical Concerns Selecting best servers.  ... 
arXiv:2401.14351v1 fatcat:amjnmrzjx5cnnh7lbnzw3i3f3a

Toward application-specific memory reconfiguration for energy efficiency

Pietro Cicotti, Laura Carrington, Andrew Chien
2013 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13  
, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration?  ...  Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance  ...  The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.  ... 
doi:10.1145/2536430.2536434 dblp:conf/sc/CicottiCC13 fatcat:ssw2vucenzdm7fk2452j5p4z3i

Improving the Performance of DNN-based Software Services using Automated Layer Caching [article]

Mohammadamin Abedi, Yanni Iouannou, Pooyan Jamshidi, Hadi Hemmati
2022 arXiv   pre-print
The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for  ...  However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results.  ...  The work of Pooyan Jamshidi has been partially supported by NSF (Awards 2007202, 2107463, and 2233873) and NASA (Award 80NSSC20K1720).  ... 
arXiv:2209.08625v1 fatcat:qkeq5q5pnjhp7lryjbympig7cm

Towards Content-Centric Geometric Routing

Wouter Tavernier, Sahel Sahhaf, Didier Colle, Mario Pickavet, Piet Demeester
2014 2014 IEEE 21st Symposium on Communications and Vehicular Technology in the Benelux (SCVT)  
In this paper we propose the use of a routing system-inferred coordinate system to improve: i) content server selection upon receiving content requests, and ii) the mapping of content to servers/caches  ...  The proposed approach can be further extended in order to include alternate geometric systems for example supporting hyperbolic geometries.  ...  ACKNOWLEDG ME NT This work is partly funded by the European Commission through the EULER project (Grant 258307), part of the Future Internet Research and Experimentation (FIRE) objective of the Seventh  ... 
doi:10.1109/scvt.2014.7046722 dblp:conf/scvt/TavernierSCPD14 fatcat:l2iggmmrpbdfdbxgbthaamakiu

Unified performance and power modeling of scientific workloads

Shuaiwen Leon Song, Kevin Barker, Darren Kerbyson
2013 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13  
, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration?  ...  Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance  ...  The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.  ... 
doi:10.1145/2536430.2536435 dblp:conf/sc/SongBK13 fatcat:al4dkkcccrettiv3cmaacktety

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems [article]

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia
2023 arXiv   pre-print
system optimizations.  ...  In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data.  ...  A concurrent work [312] jointly optimizes model multiplexing and query caching and also analyzes the optimality of minimizing inference cost.  ... 
arXiv:2312.15234v1 fatcat:4g62xtm4e5futizreet7myp66a

A Survey of Serverless Machine Learning Model Inference [article]

Kamil Kojs
2023 arXiv   pre-print
Large machine learning models often demand GPU resources for efficient inference to meet SLOs.  ...  This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems.  ...  A similar approach implemented in [38] employs a caching mechanism for machine learning models in GPU memory, enhancing model inference efficiency and optimizing GPU memory management.  ... 
arXiv:2311.13587v1 fatcat:srlgwozhhrfbvnpcwaezwh465y

PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment [article]

Meghana Madhyastha, Kunal Lillaney, James Browne, Joshua Vogelstein, Randal Burns
2020 arXiv   pre-print
We present methods to serialize and deserialize tree ensembles that optimize inference latency when models are not already loaded into memory.  ...  The layout interleaves correlated nodes across multiple trees, uses leaf cardinality to collocate the nodes on the most popular paths and is optimized for the I/O blocksize.  ...  This is a straightforward goal, but it diverges from existing systems that are optimized for large batches and load the entire model into memory.  ... 
arXiv:2011.05383v1 fatcat:suklohptlveffpklgeom5gte7e

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [article]

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, Benjamin Lee
2023 arXiv   pre-print
However, deploying such models for inference is difficult due to their large model size and complex communication pattern.  ...  We show that dynamic gating improves execution time by 1.25-4× for LM, 2-5× for MT Encoder and 1.09-1.5× for MT Decoder. It also reduces memory usage by up to 1.36× for LM and up to 1.1× for MT.  ...  Comparison of MoE and Dense models on inference latency.  ... 
arXiv:2303.06182v1 fatcat:r5bjcac6ijg43a3mx24tx2ujtu

Transformer-based Planning for Symbolic Regression [article]

Parshin Shojaee, Kazem Meidani, Amir Barati Farimani, Chandan K. Reddy
2023 arXiv   pre-print
Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and  ...  However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity.  ...  To achieve this, we utilize Monte Carlo Tree Search (MCTS) during inference time to guide the decoder towards optimal solutions for fitting and complexity objectives (as shown in Figure 2(c) ).  ... 
arXiv:2303.06833v5 fatcat:bkq2voqddnef3cmiqajk3ryfim

Clipper: A Low-Latency Online Prediction Serving System [article]

Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica
2017 arXiv   pre-print
Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying  ...  Finally, we compare Clipper to the TensorFlow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve  ...  Acknowledgments We would like to thank Peter Bailis, Alexey Tumanov, Noah Fiedel, Chris Olston, our shepherd Mike Dahlin, and the anonymous reviewers for their feedback.  ... 
arXiv:1612.03079v2 fatcat:fe2w5dhxsnazhd62gxcx4ielpu

CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge [article]

Yihao Fang, Shervin Manzuri Shalmani, Rong Zheng
2020 arXiv   pre-print
Inference of uncompressed large scale DNN models can only run in the cloud with extra communication latency back and forth between cloud and end devices, while compressed DNN models achieve real-time inference  ...  CacheNet caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers.  ...  ACKNOWLEDGMENTS This work is in part supported by the Discovery Grant and Collaborative Research Development Grant from Natural Science and Engineering Council, Canada.  ... 
arXiv:2007.01793v1 fatcat:ygoeavc6d5cubbtgewylyaeb5y

Accelerating Deep Learning Inference via Learned Caches [article]

Arjun Balasubramanian, Adarsh Kumar, Yuhan Liu, Han Cao, Shivaram Venkataraman, Aditya Akella
2021 arXiv   pre-print
However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated.  ...  We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference.  ...  Then, we use these as inputs to select a subset of these learned cache variants for inference.  ... 
arXiv:2101.07344v1 fatcat:cgpq66oh45g7zhi6ayhxhkspnq

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [article]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao
2024 arXiv   pre-print
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs).  ...  heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.  ...  These results indicate that it is suboptimal to apply the same KV cache to all layers without adaptation, and that it is beneficial to detect the structure of each attention head so as to select the optimal  ... 
arXiv:2310.01801v3 fatcat:3p7h6idxl5dqnbth73ytw2wq3i

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [article]

Youpeng Zhao, Di Wu, Jun Wang
2024 arXiv   pre-print
The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to  ...  Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation  ...  DeepSpeed-ZeRO is a deep learning optimization software developed to improve the computation and memory efficiency of training and inference for large models.  ... 
arXiv:2403.17312v1 fatcat:ctmbmyq7kfgdtp6bu3l3hkfjri
« Previous Showing results 1 — 15 out of 18,744 results