GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale ...
Back to Top