The Multi-head Attention Mechanism Explained!
Last week, we talked about the Attention Mechanism. Today, we dive into the Multihead-Attention Mechanism.
SPONSOR US
Get your product in front of more than 62,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - tens of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing damienb@theaiedge.io.
Using multiple attention layers in parallel helps capture different interactions between the words of the input sequence. The dimension of the of the hidden states coming out of each attention head is divided by the number of heads and concatenated to the other hidden states. The resulting hidden states are combined into final hidden states using a linear layer.
To reduce the dimensionality of the hidden states, we just need to change the shape of the internal matrices:
Watch the video for more information!