Towards Optimal Caching and Model Selection for Large Model Inference.

This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ... ; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement ... For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time. Practical Concerns Selecting best servers. ...

arXiv:2401.14351v1 fatcat:amjnmrzjx5cnnh7lbnzw3i3f3a

Open Access

, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration? ... Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance ... The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. ...

doi:10.1145/2536430.2536434 dblp:conf/sc/CicottiCC13 fatcat:ssw2vucenzdm7fk2452j5p4z3i

The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for ... However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. ... The work of Pooyan Jamshidi has been partially supported by NSF (Awards 2007202, 2107463, and 2233873) and NASA (Award 80NSSC20K1720). ...

arXiv:2209.08625v1 fatcat:qkeq5q5pnjhp7lryjbympig7cm

In this paper we propose the use of a routing system-inferred coordinate system to improve: i) content server selection upon receiving content requests, and ii) the mapping of content to servers/caches ... The proposed approach can be further extended in order to include alternate geometric systems for example supporting hyperbolic geometries. ... ACKNOWLEDG ME NT This work is partly funded by the European Commission through the EULER project (Grant 258307), part of the Future Internet Research and Experimentation (FIRE) objective of the Seventh ...

doi:10.1109/scvt.2014.7046722 dblp:conf/scvt/TavernierSCPD14 fatcat:l2iggmmrpbdfdbxgbthaamakiu

, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration? ... Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance ... The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. ...

doi:10.1145/2536430.2536435 dblp:conf/sc/SongBK13 fatcat:al4dkkcccrettiv3cmaacktety

system optimizations. ... In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. ... A concurrent work [312] jointly optimizes model multiplexing and query caching and also analyzes the optimality of minimizing inference cost. ...

arXiv:2312.15234v1 fatcat:4g62xtm4e5futizreet7myp66a

Open Access

Large machine learning models often demand GPU resources for efficient inference to meet SLOs. ... This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. ... A similar approach implemented in [38] employs a caching mechanism for machine learning models in GPU memory, enhancing model inference efficiency and optimizing GPU memory management. ...

arXiv:2311.13587v1 fatcat:srlgwozhhrfbvnpcwaezwh465y

We present methods to serialize and deserialize tree ensembles that optimize inference latency when models are not already loaded into memory. ... The layout interleaves correlated nodes across multiple trees, uses leaf cardinality to collocate the nodes on the most popular paths and is optimized for the I/O blocksize. ... This is a straightforward goal, but it diverges from existing systems that are optimized for large batches and load the entire model into memory. ...

arXiv:2011.05383v1 fatcat:suklohptlveffpklgeom5gte7e

However, deploying such models for inference is difficult due to their large model size and complex communication pattern. ... We show that dynamic gating improves execution time by 1.25-4× for LM, 2-5× for MT Encoder and 1.09-1.5× for MT Decoder. It also reduces memory usage by up to 1.36× for LM and up to 1.1× for MT. ... Comparison of MoE and Dense models on inference latency. ...

arXiv:2303.06182v1 fatcat:r5bjcac6ijg43a3mx24tx2ujtu

Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and ... However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. ... To achieve this, we utilize Monte Carlo Tree Search (MCTS) during inference time to guide the decoder towards optimal solutions for fitting and complexity objectives (as shown in Figure 2(c) ). ...

arXiv:2303.06833v5 fatcat:bkq2voqddnef3cmiqajk3ryfim

Open Access Multiple Versions

Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying ... Finally, we compare Clipper to the TensorFlow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve ... Acknowledgments We would like to thank Peter Bailis, Alexey Tumanov, Noah Fiedel, Chris Olston, our shepherd Mike Dahlin, and the anonymous reviewers for their feedback. ...

arXiv:1612.03079v2 fatcat:fe2w5dhxsnazhd62gxcx4ielpu

Multiple Versions

Inference of uncompressed large scale DNN models can only run in the cloud with extra communication latency back and forth between cloud and end devices, while compressed DNN models achieve real-time inference ... CacheNet caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers. ... ACKNOWLEDGMENTS This work is in part supported by the Discovery Grant and Collaborative Research Development Grant from Natural Science and Engineering Council, Canada. ...

arXiv:2007.01793v1 fatcat:ygoeavc6d5cubbtgewylyaeb5y

However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. ... We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. ... Then, we use these as inputs to select a subset of these learned cache variants for inference. ...

arXiv:2101.07344v1 fatcat:cgpq66oh45g7zhi6ayhxhkspnq

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). ... heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. ... These results indicate that it is suboptimal to apply the same KV cache to all layers without adaptation, and that it is beneficial to detect the structure of each attention head so as to select the optimal ...

arXiv:2310.01801v3 fatcat:3p7h6idxl5dqnbth73ytw2wq3i

Open Access Multiple Versions

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to ... Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation ... DeepSpeed-ZeRO is a deep learning optimization software developed to improve the computation and memory efficiency of training and inference for large models. ...

arXiv:2403.17312v1 fatcat:ctmbmyq7kfgdtp6bu3l3hkfjri

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models [article]

Preserved Fulltext

Toward application-specific memory reconfiguration for energy efficiency

Preserved Fulltext

Improving the Performance of DNN-based Software Services using Automated Layer Caching [article]

Preserved Fulltext

Towards Content-Centric Geometric Routing

Preserved Fulltext

Unified performance and power modeling of scientific workloads

Preserved Fulltext

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems [article]

Preserved Fulltext

A Survey of Serverless Machine Learning Model Inference [article]

Preserved Fulltext

PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment [article]

Preserved Fulltext

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference [article]

Preserved Fulltext

Transformer-based Planning for Symbolic Regression [article]

Preserved Fulltext

Other Versions

Clipper: A Low-Latency Online Prediction Serving System [article]

Preserved Fulltext

Other Versions

CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge [article]

Preserved Fulltext

Accelerating Deep Learning Inference via Learned Caches [article]

Preserved Fulltext

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [article]

Preserved Fulltext

Other Versions

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching [article]

Preserved Fulltext