论文标题
微笑:通过有效的双层路由将特殊的混合物缩放
SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing
论文作者
论文摘要
专家(MOE)并行性的混合物是最近的进步,它以恒定的计算成本来扩大模型大小。 MOE为每个传入令牌选择了不同的参数集(即专家),从而导致了稀疏激活的模型。尽管MOE成功使用了几次应用,但随着专家数量的增加,其培训效率显着降低。 MOE中的路由阶段依赖于All2all Communication Collective的效率,该集体遭受网络拥塞的侵害,可扩展性较差。为了减轻这些问题,我们介绍了微笑,该微笑利用了异质网络带宽并将单步路由分为双层路由。我们的实验结果表明,所提出的方法在巨大的清洁爬行语料库上预处理吞吐量而不会失去任何收敛速度,从而在开关变压器上获得了2.5倍的速度。
The mixture of Expert (MoE) parallelism is a recent advancement that scales up the model size with constant computational cost. MoE selects different sets of parameters (i.e., experts) for each incoming token, resulting in a sparsely-activated model. Despite several successful applications of MoE, its training efficiency degrades significantly as the number of experts increases. The routing stage in MoE relies on the efficiency of the All2All communication collective, which suffers from network congestion and has poor scalability. To mitigate these issues, we introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.