Accelerating Distributed MoE Training and Inference with Lina
release_j6b42rsde5fvll6qa5dcrwfvcm
by
Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, Hong Xu
2024
Abstract
Scaling model parameters improves model quality at the price of high
computation overhead. Sparsely activated models, usually in the form of Mixture
of Experts (MoE) architecture, have sub-linear scaling of computation cost with
model size, thus providing opportunities to train and serve a larger model at
lower cost than their dense counterparts. However, distributed MoE training and
inference is inefficient, mainly due to the interleaved all-to-all
communication during model computation. This paper makes two main
contributions. First, we systematically analyze all-to-all overhead in
distributed MoE and present the main causes for it to be the bottleneck in
training and inference, respectively. Second, we design and build Lina to
address the all-to-all bottleneck head-on. Lina opportunistically prioritizes
all-to-all over the concurrent allreduce whenever feasible using tensor
partitioning, so all-to-all and training step time is improved. Lina further
exploits the inherent pattern of expert selection to dynamically schedule
resources during inference, so that the transfer size and bandwidth of
all-to-all across devices are balanced amid the highly skewed expert popularity
in practice. Experiments on an A100 GPU testbed show that Lina reduces the
training step time by up to 1.73x and reduces the 95%ile inference time by an
average of 1.63x over the state-of-the-art systems.
In text/plain
format
Archived Content
There are no accessible files associated with this release. You could check other releases for this work for an accessible version.
Know of a fulltext copy of on the public web? Submit a URL and we will archive it
2210.17223v2
access all versions, variants, and formats of this works (eg, pre-prints)