Although Substack provides a transcription, here’s the original with my formatting.
Hello and welcome to another episode of Practicum AI. I’m Dan Maxwell. In this short presentation, I’m going to talk briefly about word prediction.
Consider this famous sequence of words, spoken by the American President Abraham Lincoln at the start of his Gettysburg Address. Given this sequence, what word comes next? Or given this one, what’s the missing word? Stated simply, this is all that AI language models do. They predict the next word, given a specific context. This simple task lies at the heart of all generative AI text systems, including the most advanced large language models – ChatGPT and Llama being but two examples. When you prompt ChatGPT, it uses your sequence of words to generate a new sequence, one word at a time.
Before transformers made their debut in 2017, the most popular word prediction tools were n-Gram language models and recurrent neural networks or RNNs. But both had significant limitations. I will first talk about n-Gram language models, followed by RNNs, and conclude with a brief introduction to transformers.
N-gram language models were the first and simplest approach to word prediction. Consider the sentence: “A cat sat in the hat.” So, how might an n-gram language model generate this? In this example, our initial sequence has just two words: “A” and “cat”. Here’s how an n-gram model predicts the next word:
1.First – as this is a bi-gram model, it takes the initial two words and searches the model’s document dataset for sentences where these two appear together.
2.In this example, our search retrieved sentences where “sat” was the next word and others where “napped” came next. As we can see, the word “sat” was more prevalent, occurring 11 times in 68% of the sentences. On the other hand, the word “napped” was found in just 5 sentences – 32% of the time.
3.Because “sat” has the highest probability of appearing after these two words, the model selects it as the next word in the sequence. It then takes the next two words – “cat sat” – and repeats the same process until the entire sentence has been generated.
N-gram language models are limited. They assume that the probability of the next word in a sequence depends only on a fixed-size window of previous words (Wikipedia). The problem with this assumption is that the model might not find any matches, especially for larger n-grams of 6, 7, or more words. Consider this 5-word pentagram – “cat pawed at the moving …”. Clearly, the chances of the model finding this exact sequence of words in a large document dataset is minimal. Sophisticated versions of n-gram models use a variety of statistical techniques to predict the next word when no matching sentences are found. Even so, this remains a limiting factor.
Recurrent neural networks or RNNs made next word prediction and translation more precise. With RNNs, all the information about an input is represented in a single piece of state memory or context vector. Thus, RNNs must compress everything they need to know about a word sequence into the available space. This limits the size of the input sequence. And no matter how large we make state memory, some word sequence inevitably exceeds it, and vital information is lost.
A second problem is that RNNs must be trained and used one word at a time. This can be a slow way to work, especially with large datasets.
To gain a better understanding of the state memory limitation, let’s review the basics of RNN architecture.
RNN models contain a feedback loop that allows information to move from one step to another. As such, they’re ideal for modeling sequential data like text. As shown here, an RNN receives some input (a word or character), feeds it through the network, and outputs a vector called the hidden state. At the same time, the model feeds some information back to itself via the feedback loop, which it can then use in the next step.
To the right of the equal sign, the RNN process is unrolled. During each iteration, the RNN cell passes information about its state to the next operation in the sequence. This allows the cell to retain information from previous steps and use it for its output predictions.
The RNN architecture made early machine translation systems possible. The way RNNs translate text is usually done by linking an Encoder to a Decoder. This architecture works well when both the input and output sequences are of fixed length. Here a short English sentence of 3 words plus exclamation point is translated into German. The encoder ingests each sentence element sequentially while maintaining its state along the way. The encoder’s last hidden state – a numerical representation of the entire sentence – is then passed to the decoder. The decoder, in turn, generates the German equivalents from top to bottom.
This architecture is simple and elegant, but it has one big weakness. The encoder’s final hidden state is an information bottleneck. That is, it must represent the meaning of the entire input sequence in a compressed form. With long sequences, this creates a challenge. The information at the start of the sequence might be lost in the process of compressing everything into a single (fixed) representation.
When that happens, the decoder may not have enough information to do its job well.
Alright, let’s simulate the bottleneck problem. This animation shows how an RNN translation model works. Each word is processed separately, with a single hidden state passed between words. The encoder’s final hidden state is then handed off to the decoder which generates the German equivalent.
The transformer architecture represented a significant advance over RNNs. Here we see a transformer executing another translation task. This time from English to French. Unlike directional models which read the text input sequentially (left-to-right or right-to-left), a Transformer encoder reads the entire sequence of words at once. It is therefore considered bidirectional. Or more precisely, we say that it’s non-directional. This property allows the model to learn the context of a word in relation to all the words around it. In other words, the transformer context window is larger than that of n-Gram language models or RNNs. And this is a significant advantage. In future presentations, I’ll talk about transformers in-depth. But this is enough for now.
And before I end this short presentation, here’s a family tree of the most prominent transformer models. Keep in mind that this list is not complete as development in this space is dynamic and ongoing.
Share this post