GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper! OUTLINE: 0:00 - Intro & Overview 4:10 - Main Results 5:10 - Mixture-of-Experts 16:00 - Difference to Scaling Classic Transformers 18:50 - Backp

15 views