We know that LLMs are trained to predict the next word. When we decode the output sequence, we use the tokens of the prompt and the previously predicted tokens to predict the next word. With greedy decoding or multinomial sampling decoding, we use those predictions to output the next token in an autoregressive manner. But is this the sequence we are looking for, considering the prompt? Do we actually care about the probability of the next token in a sequence? What we want is the whole sequence to maximize the probability conditioned on the prompt, not each token separately.
So let's look at why predicting the next token is not the prediction we care about, and how we can do better than simply autoregressing by just looking at the probability of the next token. Let's get into it!
In the case of the greedy search and the multinomial sampling, we iteratively look for the next best token conditioned on the prompt and the previous tokens.
But those are not the probabilities we care about. We care about generating the best sequence of tokens conditioned on a specific prompt.
Fortunately, we can compute the probability we care about from the probabilities predicted by the model.
That is important because there could exist a sequence of tokens with a higher probability than the one generated by the greedy search.
Finding the best sequence is an NP-hard problem, but we can simplify the problem by using a heuristic approach. At each step of the decoding process, we assess multiple candidates, and we use the beam width hyperparameter to keep the top best sequences.
We iterate this process for the N tokens we need to decode, and we keep only the sequence with the highest probability.
Beam search has a few advantages:
Balance Between Quality and Efficiency
Flexibility through Beam Width
Good for Long Sequences
Useful in Structured Prediction Tasks
Can be combined with Multinomial sampling
And a few problems:
Suboptimal Solutions
Computational Cost
Length Bias
Lack of Diversity
Heuristic Nature
End-of-Sequence Prediction
Watch the video for more information!
SPONSOR US
Get your product in front of more than 62,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - tens of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing damienb@theaiedge.io.