Accelerating Transformers via Kernel Density Estimation Insu Han

A Google TechTalk, presented by Insu Han, 2023/05/30 A Google Algorithms Seminar. ABSTRACT: Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and runtime on various pre-trained models. Bio: Insu Han is a postdoctoral research fellow at Yale University, hosted by Amin Karbasi. He completed his Ph.D. degree in the School of Electrical Engineering at the Korea Advanced Institute of Science and Technology (KAIST) in 2021, under the supervision of Jinwoo Shin. Before that, he obtained his Bachelor’s degree in Electrical Engineering and minored in Mathematics at KAIST. He has worked on developing and analyzing approximate algorithms for large-scale machine learning problems and their applications. His most recent work focuses on accelerating the attention mechanism in large language models via fast kernel density estimation methods. In 2019, he was the recipient of the Microsoft Research Asia Fellowship.

1 view