Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)

#fastweights #deeplearning #transformers Transformers are dominating Deep Learning, but their quadratic memory and compute requirements make them expensive to train and hard to use. Many papers have attempted to linearize the core module: the attention mechanism, using kernels - for example, the Performer. However, such methods are either not satisfactory or have other downsides, such as a reliance on random features. This paper establishes an intrinsic connection between linearized (kernel) attention and the much older Fast Weight Memory Systems, in part popularized by Jürgen Schmidhuber in the 90s. It shows the fundamental limitations of these algorithms and suggests new update rules and new kernels in order to fix these problems. The resulting model compares favorably to Performers on key synthetic experiments and real-world tasks. OUTLINE: 0:00 - Intro & Overview 1:40 - Fast Weight Systems 7:00 - Distributed Storage of Symbolic Values 12:30 - Autoregressive Attention Mechanisms 18:50 - Connecting Fast W
Back to Top