Transformer models have replaced RNNs allowing SOTA results for many NLP tasks. This Encoder-Decoder architecture plus the use of the Attention mechanism and Transfer Learning are the base of modern NLP. Read further to understand how the training dataset travels through the network, how it changes for the Attention calculations, and how transfer learning dramatically improves the model performance. Code included below!
In the previous post, I reviewed recurrent neural architectures applied to the solution of an everyday problem, translation. And while it is true that the results were quite good (for a testing scenario), there are several reasons why RNNs have lost popularity, and we now live in a world where the Transformer is king.
The main reasons for the disuse of RNNs are:
These reasons, plus the recent accessibility to specialized parallel computing cores such as TPUs, have led to the decline of architectures such as LSTMs, GRUs, and other types of RNNs in favor of Transformer-based models.
Due to this popularity, I want to focus this article on Transformer architecture, reviewing it through a simple model capable of predicting a person's birthplace. I also analyze the structure of the training dataset to understand how its dimensions change when doing the Attention calculations.
Finally, I explore the transfer learning paradigm by running several training passes and evaluating their results to see how much pretraining helps us achieve better results.
The Transformer Architecture. Source: Dive into Deep Learning
A Transformer is a seq2seq model composed of a stack of encoders that receives an input. These encoders produce an output that feeds a stack of decoders in charge of making predictions. Despite the excellent results Transformers have helped us to achieve, they also come with some drawbacks. The biggest one is the quadratic computation cost present in the Self-Attention mechanism.If you are new to this kind of architecture, I'd recommend starting with this excellent lecture by professor Pascal Poupart of the University of Waterloo.
After reviewing this architecture, these are the most relevant points I noted:
What about the information received by the Encoder? How is the transformer fed?
I have always thought that we can learn a lot from a system if we pay attention to the data flowing across the system's components. Perhaps this kind of analysis is part of the interpretability efforts to describe what NN are doing, undoubtedly, an interesting topic I hope to explore in the future.
In this case, I examined the pathway and transformations the dataset experience within a character-level model. This model is a fork of Andrej Karpathy's minGPT, with modifications done by Stanford's CS224N TAs (thank you guys!). You can find the complete codebase here. The objective of the model is to predict the place where a person was born.
To improve the model's performance, we first executed a pretraining phase to build knowledge about a person's name and its place of birth. The image below shows a simplified view of the training modules and how the data moves between them:
As the image shows, the
train_dataset variable stores the training dataset. This variable is an object containing sentences with person names and the places where they were born, for example:
Eyolf Kleven. Born in Copenhagen, Kleven played as a midfielder for AB from 1927 to 1944 .
We'll later use these sentences to apply a span-corruption algorithm that will be the base of the pretraining objective. The sentences are transformed into numeric values by using a simple mapping against the dataset vocabulary. As our sequences are made of characters, the above example now looks like this:
tensor([33, 80, 70, 67, 61, 3, 39, 67, 60, 77, 60, 69, 14, 3, 30, 70, 73, 69, 3, 64, 69, 3, 31, 1, 60, 69, 3, 71, 67, 56, 80, 60, 59, 3, 56, 74, 3, 56, 3, 68, 64, 59, 61, 64, 60, 67, 59, 60, 1, 70, 71, 60, 69, 63, 56, 62, 60, 69, 12, 3, 39, 67, 60, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
The number 0 represents the pad token we use to ensure all sentences are equal in length, the number 1 represents the span-corruption character. And the first number, 33, represents the letter
E from "Eyolf".
At this point (up until
GPT.forward) the batch of training examples we are feeding into our transformer looks like this:
As these fixed representations of characters don't allow learning the existing relationships between them, we need to add a new dimension. We do this in the forward pass of the Transformer blocks:
def forward(self, idx, targets=None): b, t = idx.size() assert t <= self.block_size, "Cannot forward, model block size is exhausted." # forward the GPT model token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector x = self.drop(token_embeddings + position_embeddings) x = self.blocks(x) x = self.ln_f(x) logits = self.head(x) # if we are given some desired targets also calculate the loss loss = None if targets is not None: loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=0) return logits, loss
As you can see, we add the token and position embeddings to the input we received from the
run_epoch loop. Our input data structure now looks like this:
Notice how each letter (integer number) is now represented with a -dimensional learnable vector. From now on, this input representation ( in the following paragraphs) is used through the model calculations. For example, the Attention mechanism will use these vectors to draw the queries, keys, and values to compare against other parts of the input space.
Now, how does the Multi-Head Attention calculation takes place within the Encoder? To answer this question, it is needed to determine how the tensors and weight matrices interact through the network. If we remember, any Encoder layer contains two sub-modules, the Multi-Headed Self-Attention mechanism and an FFN:
Where is the output and is calculated like this:
Let's take a closer look at the Attention mechanism (in this case based on a scaled dot-product scoring function):
As we can see in the image, the input tensor acts as query, key, and value. We can obtain and learn these parameters by applying the three linear transformations shown earlier.
Dimension-wise, as 's shape is , the output of the linear transformations will keep the same size. As we want to learn the interactions between different parts of and identify which elements are more relevant than others, we split the input tensor into chunks, where is the number of Attention heads to use. Each head will focus on a particular aspect of the input, meaning the Attention calculations happen head-wise. To allow this, we reshape the tensors into where the second and third dimensions are swapped to allow matrix operations.
As we are using a dot-product scoring function, we need to interchange the dimensions of the tensor into before multiplying it with the tensor. This product will produce a tensor of shape .
To get the Attention output, we apply the Softmax function and multiply the results with the Values tensor getting an Attention output tensor of shape .
In this last tensor, we have the Attention calculations for all heads, we concatenate them together by reshaping the tensor into using the
view method. As we can notice, the last tensor, the Attention output, can now be fed into the FNN, which will produce a tensor of the same shape. The resulting tensor is the Encoder output, which will serve as input of the next Encoder layer, or, if it was the last Encoder, as input to the Encoder-Decoder Multi-Head Attention mechanism of the Decoder.
Initially, we run a supervised training of our model with a small dataset to teach it to predict the birthplace of some people, but the results were not precisely good :)
How can we improve this? Of course, we can increase the training dataset size, but we can also initialize our model parameters differently. That is the idea behind pretraining. Pretraining trains our model in two steps:
The second step, also know as Finetuning, is supervised training for our downstream task. This technique is an excellent idea since the parameters learned in the first step include some general knowledge about the language, making the second training step quicker, more effective since it requires less data.
This new paradigm comes in three flavors, all of them using the Transformer architecture:
A GPT-3 generated news article. Source: Language Models are Few-Shot Learners
Overall pre-training and fine-tuning procedures for BERT. Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A diagram of T5 text-to-text framework. Source: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
So it seems this is the way to go these days: pre-train your model and then use its parameters to initialize the model's parameters for the second round of training (finetuning) using a downstream objective and dataset.
Let's see this paradigm in action in the next section. Each training task (except the finetuning) was done in about 2 hours using an Azure GPU VM with the following characteristics:
We feed the model with a dataset containing a question about the place of birth of a person and then the ground truth. Something like this:
Where was Eyolf Kleven born? Copenhagen Where was Marvano born? Belgium Where was Naïma Azough born? Morocco Where was Eugenia Popa born? Bucharest
The configuration was:
The performance after training was not good, compared with the baseline of predicting "London" for any input question (correct rate of 5%). This result is not surprising, considering that the training dataset size is small. Either way, it is fine since the purpose of the exercise is not to achieve top results but to see how pretraining boosts the model's performance.
How to improve the performance of our model? One evident option is to increase the size of our supervised training dataset. Sadly, it might not be viable since the cost of creating a large custom dataset can be prohibitive. Also, increasing the dataset size will increase the computation cost due to the Attention quadratic nature.
A solution for this case is pretraining. In this example, we are using a larger unsupervised dataset from Wikipedia. The dataset contains the name of a person and the corresponding birthplace. Regards the training objective, we set it to predicting a corrupted span of the input sequence.
The input entries look like this:
Greta Knutson. Born in Stockholm, Greta Knutson studied at the Kungliga Konsthögskolan, and settled in Paris, France during the early 1920s . Paul Daniels. Paul Daniels (born 4 June 1981 in Burlington) is an American rower . Tom Jones. Tom Jones (born April 26, 1943) is an American former racing driver, born in Dallas, Texas . Elizabeth Bartlett. Elizabeth Bartlett (24 April 1924 Deal, Kent - 18 June 2008) was a British poet .
The training settings were these:
After training, we saved the model parameters on disk for later use.
Now it is time to check if pretraining helps. To do it, we loaded the parameters from the pretraining stage and retrain the model with the downstream task dataset (question/answer pairs). The training parameters were as follow:
The model's evaluation showed that the performance increased 14 times! Wow!, that is amazing, considering we just run the finetuning task for ten epochs.
These results are a confirmation that the combo pretraining + finetuning + Transformer architecture is the holy grail of Machine Learning since 2017. It just works.
While it is true that the Transformer is crucial to achieving SOTA results in various NLP tasks, they are not perfect. Recent AI's community efforts center on reducing or even eliminating the quadratic operations in the Attention mechanism.
During this exercise, we also implemented an alternative Attention computation that eliminates the quadratic cost. It was a variant of the Synthesizer Attention for the pretraining and finetuning tasks. Unfortunately, the results were not promising (similar to the baseline, of course, this is better than the Model performance, but still modest).
What now? I am finally concluding my journey of learning modern NLP. It is satisfying and exciting to review more and more recent papers. I now feel much better prepared to face the upcoming challenges in my career. It is time to continue reading about the Attention mechanism and its improvements and how to apply this architecture to solving everyday problems for personal or business use cases!
As Dr. Károly Zsolnai-Fehér says: "What a time to be alive!"