This is the final part of my series reading “Attention is all you need”, the foundational paper that invented the Transformer model, used in large language models (LLMs). In the first part, we covered some background, and in the second part we reviewed the architecture of the Transformer model. In this part, we’ll discuss the authors’ arguments in favor of Transformer models.
Why Transformer models?
The authors argue in favor of Transformers in section 4 by comparing them to previously extant options, namely recurrent neural networks (RNNs) and convolutional neural networks (CNNs).
RNNs had been discussed in the introduction, which I covered in the first part of this series. The disadvantage of RNNs is that they’re slow to train because of sequential operations.
However, I haven’t yet touched on CNNs. CNNs are commonly used to process images. Each layer of neurons might look at a sliding window of 3×3 pixels in the image. This allows each layer to shrink the image into a smaller and smaller number of pixels, while representing concepts of greater and greater complexity. CNNs can also be applied to language, as if each sentence of length N were an Nx1-pixel image.
What makes it difficult to understand language, is that some words are closely related to each other, even when they’re quite far apart in the sentence. RNN and CNN architecture have difficulty modeling the relationships between distant words, because they have so many layers of neurons between the related words. An RNN basically has a layer of neurons between each adjacent pair of words in the sentence, so that far apart words have many layers between them. A CNN shrinks down the sentence so that far-away words can be related, but it still takes log(N) layers to model relationships between words that are N words apart.
In contrast, the Transformer model has an attention mechanism that allows words to be related through just a few layers of neurons, no matter how far apart the words are in the sentence.
A bit of external context: LLMs like ChatGPT do have a limitation in just how far apart words can be related. These models have what’s called a “context window”, which is the maximum number of tokens that a model can consider. GPT-3 has a context window of 2049 tokens, while GPT-4 has several different versions with context windows from 4k to 128k tokens. Outside of this context window, I believe the model will “forget” what was previously said.
I believe the context window exists because the more tokens included in the context window, the more complicated the model needs to be. They explain that since the attention module needs to encode relationships with each pair of tokens, the complexity is proportional to the square of the number of tokens.
In this section, the authors also argue that Transformer models could be more explainable than other neural networks, because in theory you could show which words are paying attention to which other words. However, I am unconvinced. I mean, in this paper, each “multi-headed attention” module is repeated 6 times, and has 8 “heads”, so if you ask which words are paying attention to which other words, you’re going to get 6*8=48 distinct answers. I found an article with some animations showing what this would look like, and I think it might be a bit complex for the average user.
TL;DR: Compared to alternatives, Transformer models have two advantages: they use parallel (rather than sequential) computation, and they can relate distant words through just a few layers of neurons. However, each neuron layer has higher complexity, which might be why GPT models have a “context window” that limits the number of tokens that can be considered.
Experiments
In sections 5 and 6, the authors describe how they applied their Transformer model to the task of translating English to German and English to French. They’re using the WMT 2014 English-German and English-French data sets. To my understanding, these are basically benchmark datasets so that different researchers can evaluate their models on a level playing field, where everyone is using the same training and test data.
This page shows how different models have performed on the WMT 2014 English-German data set. “Attention is all you need” is responsible for the “Transformer Big” model in 2017. You may observe that it was the best model for its time, but has since been surpassed by other models.
They also tested a number of variants of the model, basically adjusting the number of neurons in each layer. This is standard stuff that I don’t think we need to cover.
TL;DR: The authors test the model on a benchmark translation problem, and outperform previous models.
Discussion
I’m a professional data scientist with a background in physics. I do not specialize in neural networks, and do not keep up with all the computer science research. I have read a few papers though, and my overall impression is that every author has something to sell to you. Any author can convincingly argue in favor of whatever algorithm they’re proposing, and yet many such proposals never seem to make it to practical applications.
But the Transformer model has evidently been incredibly successful. According to Google Scholar, it’s been cited over 100,000 times. And in a mere 6 years, there have already been many public-facing applications. And of course, research didn’t stop there. Many LLMs are not merely scaled-up versions of the Transformer model, they make use of further advancements that we do not have space to discuss.
At their core, Transformer models are an innovation in neural network architecture that permits relationships between far-apart words. They were originally used for translation tasks, but have been adapted for more general language tasks. Naturally, most people treat LLMs as a black box, and perhaps you don’t really need to understand their inner workings to use them or talk about them. But I hope that this series has shown why LLMs are a bit more than just a fancy autocomplete.
Ketil Tveiten says
Interesting stuff! One could argue that the last sentence should read «… shown how LLM’s are a fancy autocomplete», though.
It is probably useful for understanding these sorts of models to observe that they all do the same thing (encode data as points in a vector space; then via some linear algebra construct a partial linear function that measures the thing you care about; then optimize that function by fitting it to match the training data), and all the clever stuff people are doing is in the first step (finding a good encoding) and the second (setting up the initial state of that PL-function so the third step will converge in reasonable time. In this case, the word vectors and the attention-architecture respectively.
Siggy says
@Ketil Tveiten,
Most autocomplete functions are pretty basic, e.g. looking at the last N words, or with Google autocomplete just looking up common queries. Transformer models are a lot more involved than that. But also, Transformers aren’t just about predicting the next word in a sentence. Decoder-only models such as GPT do that, but the original Transformer model was a translation model. There are also encoder-only models such as BERT, which don’t generate text at all.
Ketil Tveiten says
That’s sort of what I was trying to say: calling this «fancy autocomplete» isn’t wrong; it is an autocomplete algorithm, but it understates the amount to which it is very fancy.
… but also it isn’t that fancy when taking the broader perspective; it is after all basically just the same high-dimensional curve fitting stuff as all the other machine learning algorithms. It really is impressive how broad the class of problems this approach can make tractable turns out to be, even if it shouldn’t really be surprising in hindsight. Thanks for these enlightening posts on the details here, it’s fun reading.