Transformers are taking every domain of ML by storm! Are you ready for the revolution? I think it is becoming more and more important to understand the basics, so pay Attention because Attention is there to stay! Today we look at:
The self-attention mechanism in Transformer explained
Recurrent network encoder-decoder VS Bahdanau attention VS self-attention
Github repositories, articles and Youtube videos about the Attention Mechanism
Transformers: Attention is all you need!
At the center of Transformers is the Attention mechanism, and once you get the intuition, it is not too difficult to understand. Let me try to break it down:
As inputs to a transformer, we have a series of contiguous inputs, for example words (or tokens) in a sentence. When it comes to contiguous inputs, it is not too difficult to see why time series, images, or sound data could fit the bill as well.
We know that a word can be encoded as a vector in an embedding. We can also encode the position of that word in the input sentence into a vector, and add it to the word vector. This way, the same word at a different position in a sentence is encoded differently.
As part of the attention mechanism, we have 3 matrices Wq, Wk, and Wv that project each of the input embedding vectors into 3 different vectors: the Query, the Key, and the Value. This jargon comes from retrieval systems, but I don't find them particularly intuitive!
For each word, we take its related Query vector and compute the dot products to the Key vectors of all the other words. This gives us a sense of how similar the Queries and the Keys are, and that is the basis behind the concept of "attention": how much attention a word should pay to another word in the words input to understand the meaning of the sentence? A Softmax transform normalizes and further accentuates the high similarities of the resulting vector.
This results to one vector for each word. For each of the resulting vectors we now compute the dot products to the Value vectors of all the other words. We now have computed the self-attention!
Repeat this process multiple times such that you generate multiple Attentions, and this gives you a multi-head attention layer. This helps diversify the learning of the possible relationships between the words
The original Transformer block is just an attention layer followed by a set of feed-forward layers with a couple of residual units as found in ResNet, and layer normalizations. A "Transformer" model is usually multiple Transformer blocks one after the others.
Most language models follow this basic architecture. I hope this explanation helps people trying to get into the field!
Recurrent Network VS Bahdanau Attention VS Self-attention
How did we arrive at the self-attention mechanism? Not too long ago, the state of the art models for sequence to sequence learning tasks were Recurrent Neural Networks but self-attention and Transformers completely changed the game! Attention mechanisms are very good at capturing long-range dependencies between words and are highly parallelizable as opposed to RNN networks such as LTSM or GRU.
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.