A High Level Overview of the Transformer Model - The Magic Behind Recent Advances in AI

In this article we will take a high level non-technical view of key aspects of the Transformer Model - the technology behind recent advances in AI

Pranath Fernando


July 1, 2023

1 Introduction

Recent advances in AI such as ChatGPT have demonstrated impressive abilities for performing a wide range of tasks previously only done by humans. The key technology used in these models is called the Transformer Model. In previous articles I’ve looked at the detailed theoretical underpinnings of this model as well as practical use cases. In this article we will take a high level non-technical view of key aspects of the Transformer Model that have enabled it to make the huge advances that it has made.

2 Languge Models before Transformers

I have covered language models in previous articles - but some of the earliest and simplest language models essentially just predicted the next word in a sequence.

Language models are not brand-new. Recurrent neural networks, or RNNs, are a type of architecture that was used by earlier generations of language models. Although extremely powerful for their time, RNNs were constrained by the amount of memory and compute required to excel in generative tasks. Let’s examine an illustration of an RNN performing a straightforward next-word prediction generating task.

The model can’t make an accurate forecast with only one previous word observed by it. The resources that the model utilises must be greatly scaled when you scale the RNN implementation to be able to see more of the words that come before them in the text.

The model still hasn’t seen enough input, regardless of how much you scale it, to make a reliable prediction. Models require much more information than just the last few words in order to accurately predict the following word. Models must comprehend the entirety of the sentence, if not the entire document. Language complexity is the issue in this situation. One word can mean many different things in many different languages. They are homophones for example. It is possible for words to contain syntactic ambiguity or ambiguity inside sentence structures.

Take for example this sentence, “The teacher taught the students with the book.” Did the teacher teach using the book or did the student have the book, or was it both? How can an algorithm make sense of human language if sometimes we can’t?

Everything changed in 2017, though, following the publishing of the paper ‘Attention is All You Need’ by Google and the University of Toronto. The architecture of the transformer had arrived. The advancement in generative AI that we now witness was made possible by this new model. It can be efficiently scaled to employ multi-core GPUs, process input data in parallel while using much bigger training datasets, and, most importantly, learn to pay attention to the meaning of the words it’s processing. And all you require is attention, as the paper title says!

3 Languge Models after Transformers

Large language models built using the transformer architecture performed significantly better on natural language tasks than the preceding generation of RNNs, which resulted in a huge increase in regeneration power. The transformer architecture’s strength comes in its capacity to comprehend the significance and context of each word in a phrase. Not simply to each word next to its neighbour as you can see below, but to every other word in the phrase. Applying attention weights to those associations will help the model understand how each word relates to the others, regardless of where it appears in the input.

This enables the algorithm to discover who owns the book, who might own the book, and whether the book is even pertinent to the document’s larger context using attention weights. These attention weights are acquired via LLM training. The attention weights between each word and each other can be demonstrated using this graphic, known as an attention map. You can see that the word book is closely related to or paying attention to the words student and teacher in this example. This process of learning a relationships between words throughout the entire input is known as self-attention, and it dramatically enhances the model’s capacity to encode language.

Let’s look at the transfomer model’s functionality at a high level. The transformer architecture is divided into two separate components: the encoder and the decoder. All of these parts cooperate with one another and have a lot in common. Also, keep in mind that the diagram you are looking at is a modification of the original paper; nothing else is required. The model’s inputs are at the bottom and its outputs are at the top. Machine-learning models today only function with numbers, not language, and are essentially just large statistical calculators.

3.1 Embedding Layers

So, you must tokenize the terms before giving them to the model for processing. To put it simply, this changes the words into numbers, with each number denoting a particular location in a dictionary of all the words that the model could be able to use. There are numerous tokenization techniques available. For instance, utilising token IDs to represent word fragments or matching two whole words. as demonstrated here. It’s crucial to utilise the same tokenizer when creating text after choosing one to use while training the model. After this step, you send your input to the embedding layer as it is represented as a set of numbers.

This layer is a high-dimensional, trainable vector embedding space where each token is represented as a vector and has a specific place in the space. The idea is that these multi-dimensional vectors, which are matched to each token ID in the vocabulary, gradually learn to capture the meaning and context of specific tokens in the input sequence. Natural language processing has long made use of embedding vector spaces; Word2vec and other older language algorithms do so.

Looking at the example sequence again, you can see that in this simple scenario, each word has been matched to a token ID, and each token has been mapped into a vector. The vector size in the original transformer paper was 512, which is far larger than what we can fit onto this graphic. For the sake of simplicity, if you consider a vector size of just three, you may plot the words into a three-dimensional space and observe the correlations between those words. Now that you’ve seen how to associate words that are close to one another in the embedding space and how to calculate the distance between the words as an angle, you can see how the model is able to comprehend language mathematically.

Positional encoding is added at the same time as the token vectors are added to the encoder or decoder’s base. The model performs parallel processing on each input token. Therefore, by including positional encoding, you maintain the information about word order and ensure that the position of the word in the sentence remains relevant.

3.2 Self-Attention Layers

The self-attention layer receives the output vectors after the input tokens and positional encodings have been added up. The model examines the connections between the tokens in your input sequence in this case. As you just saw, this enables the model to focus on various elements of the input sequence in order to more accurately represent how the words are related contextually. Each word’s relevance to the other words in the input sequence is reflected in the self-attention weights that are learned during training and stored in these layers.

However, this doesn’t simply happen once; the transformer architecture actually includes multiple heads for self-attention.

As a result, numerous sets of self-attention weights or heads are simultaneously and independently learnt. Although the number of attention heads contained in the attention layer varies from model to model, it typically ranges between 12 and 100. The underlying assumption is that each self-attention head will pick up on a unique component of language. One head, for instance, would understand how the human entities in our statement relate to one another. While another person’s attention might be drawn to the sentence’s activities. Another mind might be more interested in other characteristics, like whether the words rhyme.

It’s significant to remember that you cannot decide in advance which linguistic concepts the attention heads will learn. Each head’s weights are randomly initialised, and given enough training data and time, they will each learn certain facets of language. While certain attention maps, are simple to understand, others might not be.

3.3 Feed-Forward Layers & Final Output

Once all of the attention weights have been applied to the input data, a fully linked feed-forward network processes the output. A vector of logits proportional to the likelihood score for each and every token in the tokenizer dictionary is the layer’s output.

The final softmax layer can then receive these logits and normalise them to get a likelihood score for each word. There are probably hundreds of scores in this output because it includes probabilities for each word in the lexicon.

There will be one token with a greater score than the others, this will be the token that will be output next. The ultimate choice from this vector of probabilities can be changed, though, using a variety of techniques.

4 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.