An Introduction to the Transformer Model - The power behind recent advances in AI

In this non-technical article we describe the basics of how transfomer models work which is the underlying technology behind Chat-GPT and most of the recent advances in AI

Pranath Fernando


March 29, 2023

1 Introduction

AI and Deep Learning Models are behind recent applications such as Chat-GPT and GPT-4 which have amazed the world, and have created exciting possibilities for applications for business and society. But how do these models actually work? Most of the explanations online are deeply techincal which can make these models hard to understand for many people. Admitedly, most of my own previous articles on this topic have also gone more into the technical details of how these models work, yet I also believe the essence of these models can be explained without any technical details or code. The main technology behind these recent advances is something called the Transfomer Model which was first created in 2017.

In this article, I aim to give a high-level and non-technical overview of how transfomer models work, and the types of tasks they can peform.

2 Where did Transfomer Models come from

Transfomer models came from within a sub-discipline of AI called Natural Language Processing. Its the part of AI concerned with giving computers the ability to understand text and spoken words in much the same way human beings which has been an active area of research since the 1950’s.

In 2015 the team behind Google Translate started using Neural Networks for machine translation for human languages which did much better than previous methods. Yet even this method had some limitations, most notably something called the information bottleneck issue that basically meant as the text you wanted to translate got longer it became more difficult to translate the text well.

In 2017 the Google Brain team announced the creation of a new Transfomer architecture in the now famous research paper Attention Is All You Need. They developed this specically to solve the problem with Google Translate and the ‘information bottlneck’ that had issues translating longer texts. The new Transformer model was easily able to translate longer and longer texts with no problems, and its important to understand that the original intention of this research was to solve this problem.

Yet this radically new model in AI created great excitement in the field, and many other researchers started to try it out to solve different types of problems such as in computer vision, voice recognition and more with great success - including most recently Chat-GPT and GPT-4. In fact it has now been so successful in so many areas, some are starting to consider if Transfomers could even be a general purpose problem solving model. It’s certainly worth noting this is one of the greatest examples of the value of free, open and collaberative scientific research, which enables researchers to build on and experiment with the work of others, leading to unexpected benefits.

3 What can Transformer Models do

Transfomer models are being used for many tasks and problems currently including:

  • Text Classification
  • Sentiment Analysis
  • Machine translation
  • Named entity recognition (NER)
  • Text summarization
  • Text generation
  • Question & answering
  • Biological sequence analysis
  • Computer Vision
  • Time Series Analysis
  • Video understanding

4 What is a Transfomer Model

Recall that Transfomers were orginally created to help improve machine translation, so translating from one sequence of text to another sequence of text.

A Transfomer model is primarily composed of two blocks:

  • Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  • Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Each of these parts can be used independently or together, depending on the task:

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models: Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

The original use of this for machine translation - so was therefore an encoder-decoder type transformer model.

5 Attention Layers

A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”. Here, all we need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.

To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”, because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.

The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.

Now that we have an idea of what attention layers are all about, let’s take a closer look at the Transformer architecture.

6 The Original Architecture

The Transformer architecture was originally designed for translation as we described previously. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.

To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.

The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:

Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word, also known as Bi-directional Attention. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.

The attention mask can also be used in the encoder/decoder to prevent the model from paying attention to some special words — for instance, the special padding word used to make all the inputs the same length when batching together sentences.

7 Encoder Models

Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having “bi-directional” attention, and are often called auto-encoding models.

The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.

Representatives of this family of models include:

  • BERT
  • DistilBERT
  • RoBERTa

8 Decoder Models

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

The pretraining of decoder models usually revolves around predicting the next word in the sentence.

These models are best suited for tasks involving text generation.

Representatives of this family of models include:

  • CTRL
  • GPT
  • GPT-2
  • Transformer XL

9 Encoder-Decoder Models

Encoder-decoder models (also called sequence-to-sequence models) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.

The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, T5 is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.

Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.

Representatives of this family of models include:

  • BART
  • mBART
  • Marian
  • T5

This completes our basic overview of the Transfomer model, I hope you found it insightful !

10 Acknowledgements

I’d like to express my thanks to the great Hugging Face Course which i completed, and acknowledge the use of some images, content and other materials from the course in this article.