Description
Beam search is a search algorithm to find the best choice from many options. It explores a graph by expanding the most promising node in a limited set. A beam search is usually used in situations where there isn’t a way to store the entire search tree in memory.
where is it used in DL?
sequence to sequence models like neural machine translation
how is it used in DL?
used to find the next best word, can be EOS
For example:
In NMT there are lots of possible combinations of words for a translation but we want to pick the best one, aka the one with the max probability distribution, and not one at random.
Interesting things to note:
When beam width = 1, then beam search is essentially a greedy algorithm
Beam search multiplies the log of the probabilities of the words to find the max probability - this leads to beam search favoring very short translations. This is kind of fixed by dividing it by the number of tokens.
Code
def beam_search_decoder(encoded_data, beam_width: int):
sequences = [[list(), 1.0]]
# walk over each step in sequence
for row in encoded_data:
all_candidates = list()
# expand each current candidate
for i in range(len(sequences)):
seq, score = sequences[i]
for j in range(len(row)):
# why are we taking the log of the product of the probabilities
# instead of just the product of the probabilities?
# the probabilities are all numbers less than 1,
# multiplying a lot of numbers less than 1 will result in a very smol number
candidate = [seq + [j], score * -np.log(row[j])]
all_candidates.append(candidate)
# order all candidates by score
ordered = sorted(all_candidates, key=lambda tup: tup[1])
# select beam_width best
sequence = ordered[:beam_width]
return sequence
Math
How the algorithm works from Andrew Ng’s coursera course:
History
The term “beam search” was coined by Raj Reddy, Carnegie Mellon University, 1977.