Sparse is Enough in Scaling Transformers (aka Terraformer) | ML Research Paper Explained

#scalingtransformers #terraformer #sparsity Transformers keep pushing the state of the art in language and other domains, mainly due to their ability to scale to ever more parameters. However, this scaling has made it prohibitively expensive to run a lot of inference requests against a Transformer, both in terms of compute and memory requirements. Scaling Transformers are a new kind of architecture that leverage sparsity in the Transformer blocks to massively speed up inference, and by including additional ideas from other architectures, they create the Terraformer, which is both fast, accurate, and consumes very little memory. OUTLINE: 0:00 - Intro & Overview 4:10 - Recap: Transformer stack 6:55 - Sparse Feedforward layer 19:20 - Sparse QKV Layer 43:55 - Terraformer architecture 55:05 - Experimental Results & Conclusion Paper: Code: Abstract: Large Transformer models yield impressive r

8 views