A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2024; you can also visit the original URL.
The file type is application/pdf
.
Filters
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models
[article]
2024
arXiv
pre-print
This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ...
; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement ...
For selecting the optimal server for model migration, ServerlessLLM employs a dynamic programming approach to minimize migration time.
Practical Concerns Selecting best servers. ...
arXiv:2401.14351v1
fatcat:amjnmrzjx5cnnh7lbnzw3i3f3a
Toward application-specific memory reconfiguration for energy efficiency
2013
Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13
, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration? ...
Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance ...
The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. ...
doi:10.1145/2536430.2536434
dblp:conf/sc/CicottiCC13
fatcat:ssw2vucenzdm7fk2452j5p4z3i
Improving the Performance of DNN-based Software Services using Automated Layer Caching
[article]
2022
arXiv
pre-print
The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for ...
However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. ...
The work of Pooyan Jamshidi has been partially supported by NSF (Awards 2007202, 2107463, and 2233873) and NASA (Award 80NSSC20K1720). ...
arXiv:2209.08625v1
fatcat:qkeq5q5pnjhp7lryjbympig7cm
Towards Content-Centric Geometric Routing
2014
2014 IEEE 21st Symposium on Communications and Vehicular Technology in the Benelux (SCVT)
In this paper we propose the use of a routing system-inferred coordinate system to improve: i) content server selection upon receiving content requests, and ii) the mapping of content to servers/caches ...
The proposed approach can be further extended in order to include alternate geometric systems for example supporting hyperbolic geometries. ...
ACKNOWLEDG ME NT This work is partly funded by the European Commission through the EULER project (Grant 258307), part of the Future Internet Research and Experimentation (FIRE) objective of the Seventh ...
doi:10.1109/scvt.2014.7046722
dblp:conf/scvt/TavernierSCPD14
fatcat:l2iggmmrpbdfdbxgbthaamakiu
Unified performance and power modeling of scientific workloads
2013
Proceedings of the 1st International Workshop on Energy Efficient Supercomputing - E2SC '13
, and 3) Can we use application characterization to automatically select an energy-optimal memory hierarchy configuration? ...
Finally, as a first step towards automatic reconfiguration, we explore application characterization via reuse distance as a guide to select the best memory hierarchy configuration; we show that reuse distance ...
The contents do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. ...
doi:10.1145/2536430.2536435
dblp:conf/sc/SongBK13
fatcat:al4dkkcccrettiv3cmaacktety
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
[article]
2023
arXiv
pre-print
system optimizations. ...
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. ...
A concurrent work [312] jointly optimizes model multiplexing and query caching and also analyzes the optimality of minimizing inference cost. ...
arXiv:2312.15234v1
fatcat:4g62xtm4e5futizreet7myp66a
A Survey of Serverless Machine Learning Model Inference
[article]
2023
arXiv
pre-print
Large machine learning models often demand GPU resources for efficient inference to meet SLOs. ...
This survey aims to summarize and categorize the emerging challenges and optimization opportunities for large-scale deep learning serving systems. ...
A similar approach implemented in [38] employs a caching mechanism for machine learning models in GPU memory, enhancing model inference efficiency and optimizing GPU memory management. ...
arXiv:2311.13587v1
fatcat:srlgwozhhrfbvnpcwaezwh465y
PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment
[article]
2020
arXiv
pre-print
We present methods to serialize and deserialize tree ensembles that optimize inference latency when models are not already loaded into memory. ...
The layout interleaves correlated nodes across multiple trees, uses leaf cardinality to collocate the nodes on the most popular paths and is optimized for the I/O blocksize. ...
This is a straightforward goal, but it diverges from existing systems that are optimized for large batches and load the entire model into memory. ...
arXiv:2011.05383v1
fatcat:suklohptlveffpklgeom5gte7e
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
[article]
2023
arXiv
pre-print
However, deploying such models for inference is difficult due to their large model size and complex communication pattern. ...
We show that dynamic gating improves execution time by 1.25-4× for LM, 2-5× for MT Encoder and 1.09-1.5× for MT Decoder. It also reduces memory usage by up to 1.36× for LM and up to 1.1× for MT. ...
Comparison of MoE and Dense models on inference latency. ...
arXiv:2303.06182v1
fatcat:r5bjcac6ijg43a3mx24tx2ujtu
Transformer-based Planning for Symbolic Regression
[article]
2023
arXiv
pre-print
Recent advancements in SR have demonstrated the effectiveness of pre-trained transformer-based models in generating equations as sequences, leveraging large-scale pre-training on synthetic datasets and ...
However, these models primarily rely on supervised pre-training goals borrowed from text generation and overlook equation discovery objectives like accuracy and complexity. ...
To achieve this, we utilize Monte Carlo Tree Search (MCTS) during inference time to guide the decoder towards optimal solutions for fitting and complexity objectives (as shown in Figure 2(c) ). ...
arXiv:2303.06833v5
fatcat:bkq2voqddnef3cmiqajk3ryfim
Clipper: A Low-Latency Online Prediction Serving System
[article]
2017
arXiv
pre-print
Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying ...
Finally, we compare Clipper to the TensorFlow Serving system and demonstrate that we are able to achieve comparable throughput and latency while enabling model composition and online learning to improve ...
Acknowledgments We would like to thank Peter Bailis, Alexey Tumanov, Noah Fiedel, Chris Olston, our shepherd Mike Dahlin, and the anonymous reviewers for their feedback. ...
arXiv:1612.03079v2
fatcat:fe2w5dhxsnazhd62gxcx4ielpu
CacheNet: A Model Caching Framework for Deep Learning Inference on the Edge
[article]
2020
arXiv
pre-print
Inference of uncompressed large scale DNN models can only run in the cloud with extra communication latency back and forth between cloud and end devices, while compressed DNN models achieve real-time inference ...
CacheNet caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers. ...
ACKNOWLEDGMENTS This work is in part supported by the Discovery Grant and Collaborative Research Development Grant from Natural Science and Engineering Council, Canada. ...
arXiv:2007.01793v1
fatcat:ygoeavc6d5cubbtgewylyaeb5y
Accelerating Deep Learning Inference via Learned Caches
[article]
2021
arXiv
pre-print
However, traditional caching approaches incur high memory overheads and lookup latencies, leading us to design learned caches - caches that consist of simple ML models that are continuously updated. ...
We present the design of GATI, an end-to-end prediction serving system that incorporates learned caches for low-latency DNN inference. ...
Then, we use these as inputs to select a subset of these learned cache variants for inference. ...
arXiv:2101.07344v1
fatcat:cgpq66oh45g7zhi6ayhxhkspnq
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
[article]
2024
arXiv
pre-print
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). ...
heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. ...
These results indicate that it is suboptimal to apply the same KV cache to all layers without adaptation, and that it is beneficial to detect the structure of each attention head so as to select the optimal ...
arXiv:2310.01801v3
fatcat:3p7h6idxl5dqnbth73ytw2wq3i
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
[article]
2024
arXiv
pre-print
The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to ...
Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation ...
DeepSpeed-ZeRO is a deep learning optimization software developed to improve the computation and memory efficiency of training and inference for large models. ...
arXiv:2403.17312v1
fatcat:ctmbmyq7kfgdtp6bu3l3hkfjri
« Previous
Showing results 1 — 15 out of 18,744 results