Chapter 4 of The Big Book of Large Language Models is Here!
Chapter 4 of the Big Book of Large Language Models is finally here!
That was a difficult chapter to write! Originally, I wanted to cram in that chapter all the improvements related to the Transformer architecture, since the Attention is all you need paper, but I realized that it would be too long for one chapter. I ended up focusing only on improvements related to the attention layer and delaying things like relative positional encoding and Mixture of Experts to the next chapter. In this chapter, I addressed the following improvements:
Sparse Attention Mechanisms
The First Sparse Attention: Sparse Transformers
Choosing Sparsity Efficiently: Reformer
Local vs Global Attention: Longformer and BigBird
Linear Attention Mechanisms
Low-Rank Projection of Attention Matrices: Linformer
Recurrent Attention Equivalence: The Linear Transformer
Kernel Approximation: Performers
Memory Efficient Attention
Self-attention Does Not Need O(N 2) Memory
The FlashAttention
Faster Decoding Attention Mechanisms
Multi‑Query Attention
Grouped‑Query Attention
Multi-Head Latent Attention
Long Sequence Attentions
Transformer-XL
Memorizing Transformers
Infini-Attention
Obviously, I could not include everything that was ever invented in the context of the attention layer, but I believe those use cases capture well the different research routes that have been explored since then. I believe it is a very important chapter, as most materials available online tend to focus on the vanilla self-attention, which starts to be an outdated concept for today’s standards. I also found that trying to understand how to improve the self-attention is a very good way to understand what it is we are trying to improve in the first place! The self-attention may appear odd at first, but diving into the inner workings of the layer in order to improve it gives us a level of understanding that is beyond anything we can learn just by looking at the original self-attention. I hope you will enjoy it!