The Self-Attention Mechanism
The Multihead attention
The encoder
The decoder
The position embedding
The encoder block
The self-attention layer
The layer-normalization
The position-wise feed-forward network
The decoder block
The cross-attention layer
The predicting head
The overall architecture
The architecture is composed of an encoder and a decoder.
The self-attention layer
The self-attentions are used to replace the recurring units to capture the interactions between words within the input sequence and within the output sequence.
In the self-attention layer, we first compute the keys, queries, and values vectors.
We then compute the matrix multiplication between the keys and queries.
After a softmax transformation, this matrix of interactions is called the attention matrix. The resulting output hidden states of the attention layer are the matrix multiplication of the attention matrix and the values vectors.
Each of the resulting hidden states coming from the attention layer can be understood as a weighted average of the values, with the attentions being the weights.
The multi-head attention layer
Using multiple attention layers in parallel helps capture different interactions between the words of the input sequence. The dimension of the of the hidden states coming out of each attention head is divided by the number of heads and concatenated to the other hidden states. The resulting hidden states are combined into final hidden states using a linear layer.
To reduce the dimensionality of the hidden states, we just need to change the shape of the internal matrices:
The position embedding
The position embedding is used to add the position information to the semantic information of the words.
Listen to this episode with a 7-day free trial
Subscribe to The AiEdge Newsletter to listen to this post and get 7 days of free access to the full post archives.