Accelerating Distributed MoE Training and Inference with Lina

Jiamin Li; Yimin Jiang; Yibo Zhu; Cong Wang; Hong Xu

by Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, Hong Xu

Released as a article .

2024

Abstract

Scaling model parameters improves model quality at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have sub-linear scaling of computation cost with model size, thus providing opportunities to train and serve a larger model at lower cost than their dense counterparts. However, distributed MoE training and inference is inefficient, mainly due to the interleaved all-to-all communication during model computation. This paper makes two main contributions. First, we systematically analyze all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, we design and build Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor partitioning, so all-to-all and training step time is improved. Lina further exploits the inherent pattern of expert selection to dynamically schedule resources during inference, so that the transfer size and bandwidth of all-to-all across devices are balanced amid the highly skewed expert popularity in practice. Experiments on an A100 GPU testbed show that Lina reduces the training step time by up to 1.73x and reduces the 95%ile inference time by an average of 1.63x over the state-of-the-art systems.
In text/plain format

Archived Content

There are no accessible files associated with this release. You could check other releases for this work for an accessible version.

"Dark" Preservation Only

Save Paper Now!

Know of a fulltext copy of on the public web? Submit a URL and we will archive it

Type article
Stage

submitted

Date 2024-04-28
Version v2
Language en ^?

arXiv 2210.17223v2

Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)

Cite This

BibTeX
CSL-JSON
MLA
Harvard

Lookup Links

Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar

Catalog Record
Revision: ba356c64-dcb9-418c-9651-eff0f2b1e0c1
API URL: JSON

Edit Metadata View History

Accelerating Distributed MoE Training and Inference with Lina release_j6b42rsde5fvll6qa5dcrwfvcm

Abstract

Archived Content

Accelerating Distributed MoE Training and Inference with Lina `release_j6b42rsde5fvll6qa5dcrwfvcm`