Choosing a Pre-Trained Large Language Model

In this article we will look at different types of pre-trained models and see how these are suited for different tasks - this can help you choose the best model for your LLM use-case

Pranath Fernando


July 6, 2023

1 Introduction

The project life cycle for generative AI was introduced in this previous article. There are a few tasks to complete before you can launch your generative AI app, as we saw there. Selecting a model to work with comes after you have defined your use case and chosen how the LLM will operate within your application. Working with an existing model or creating your own from scratch will be your first option. In some situations, it may be advantageous to build your own model from scratch. In most cases, though, you’ll use an existing foundation model to start the process of constructing your application.

In this article we will look at different types of pre-trained models, and see how these are suited for different tasks. This can help you choose the best model for your LLM use-case.

2 Choosing Open Source Models

Members of the AI community can employ a wide variety of open-source models for an application. Hugging Face and PyTorch, two well-known frameworks for creating generative AI applications, have created curated hubs where you can browse these models. The inclusion of model cards in these hubs, which define key information such as the best use cases for each model, how it was trained, and known constraints, is a tremendously helpful feature.

The specific model you decide on will rely on the specifics of the activity you must do. Because of changes in the models’ training methods, different transformer model architectures are better suited to certain linguistic tasks. Let’s take a closer look at how large language models are trained in order to assist you comprehend these variations and build intuition about which model to utilise for a specific task. With this information in hand, navigating the model hubs and selecting the ideal model for your use case will be simpler. Let’s start by taking a broad look at the LLMs’ basic training programme.

3 The Training Process for Large Language Models

This stage is frequently known as pre-training. Deep statistical representations of language are encoded by LLMs. During the pre-training phase of the model, when it is learning from a sizable amount of unstructured textual data, this knowledge is generated. This amount of text can range from gigabytes to petabytes. This information is gathered from a variety of sources, including Internet scraping and corpora of texts that have been produced especially for the purpose of training language models. The model internalises the linguistic structures and patterns during this stage of self-supervised learning.

Depending on the model’s design, these patterns then allow the model to accomplish its training aim. The model weights are changed during pre-training to reduce the loss of the training aim. Each token is given a vector representation by the encoder in the form of an embedding. The utilisation of GPUs and a lot of computation are also needed for pre-training. It should be noted that after gathering training data from open sources like the Internet, processing is frequently required to improve quality, correct bias, and remove negative information. Because of this data quality curation, only 1% to 3% of tokens are frequently used for pre-training.

4 The Three Varients of the Transfomer Model

If you choose to pre-train your own model, you should take this into account when estimating how much data you need to gather. The transformer model comes in three different variations: encoder-only, encoder-decoder, and decode-only. Each of them receives training with a distinct goal in mind, learning how to perform various jobs in the process.

4.1 Autoenconder Models

Autoencoding models, also referred to as encoder-only models, are pre-trained utilising masked language modelling. In this case, the training goal is to predict the mask tokens in order to reconstruct the original text. Tokens in the input sequence or randomly mask. This is sometimes referred to as a denoising goal.

Autoencoding models spilled bi-directional representations of the input sequence, which indicates that the model is aware of the entire context of a token and not just its immediate surroundings. The tasks that benefit from these bi-directional contexts are best suited for encoder-only models. They can be used for tasks like sentiment analysis, token-level activities like named entity recognition, or tasks at the word or sentence level like classification of words. BERT and RoBERTa are two well-known autoencoder model examples.

4.2 Autoregressive Models

Using causal language modelling, decoder-only or autoregressive models are pre-trained. Here, predicting the following token using the preceding sequence of tokens is the training goal. Researchers occasionally refer to full language modelling as token prediction. The input sequence is hidden in decoder-based autoregressive models, which can only observe the input tokens preceding the token in question. The conclusion of the phrase is unknown to the model. To forecast the next token, the model repeats the input sequence one at a time. This indicates that the context is unidirectional in contrast to the encoder architecture.

The model constructs a statistical model of language by learning to predict the following token from a large number of examples. The decoder portion of the original design, not the encoder, is used in models of this type. Larger decoder-only models provide high zero-shot inference capabilities and are frequently capable of a variety of tasks, although they are rarely employed for text production. GBT and BLOOM are well-known illustrations of decoder-based autoregressive models.

4.3 Sequence to Sequence Models

The sequence-to-sequence transformer model, which utilises both the encoder and decoder components of the original transformer architecture, is the last transformer model variant. The pre-training objective’s specifics differ from model to model. T5, a well-known sequence-to-sequence model, pre-trains the encoder via span corruption, which hides input tokens in random sequences. The unique Sentinel token, shown above as x, is then used to replace those mass sequences. Sentinel tokens are additional special tokens to the vocabulary that don’t actually correspond to any words from the input text. The work of rebuilding the mask token sequences automatically falls to the decoder after that. The Sentinel token and the projected tokens are the output.

Translation, summarization, and question-answering are all possible with sequence-to-sequence models. In general, they come in handy when you need to input and output a large body of text. In addition to T5, BART is a well-known encoder-decoder model.

5 Overview of the Three Transformer Models

Here is a quick summary of the various transformer model architectures and their targets in relation to the pre-training objectives. Using masked language modelling, autoencoding models are trained beforehand. They are commonly employed in conjunction with sentence classification or token classification, and they relate to the encoder portion of the original transformer architecture. Using causal language modelling, autoregressive models are pre-trained. These models make use of the decoder element of the original transformer architecture, which is frequently employed for text production. In sequence-to-sequence models, the encoder and decoder components of the original transformer architecture are used.

The pre-training objective’s specifics differ from model to model. Span corruption is used for pre-training the T5 model. For translation, summarization, and question-answering, sequence-to-sequence models are frequently employed. You can now choose the type of model that is most appropriate for your use case after seeing how the various model architectures are trained and the particular tasks they are ideally suited to. Another thing to keep in mind is that larger models of any architecture are often better at doing the jobs they are designed to do. According to research, the bigger the model, the more probable it is to perform as required without the need for further in-context learning or additional training.

6 The Future of Transfomer Models

In recent years, the construction of larger and larger models has been motivated by the observed pattern of enhanced model capabilities with size. The development of more potent compute resources, access to enormous amounts of data for training, and the introduction of the highly scalable transformer architecture are just a few examples of how inflection points and research have contributed to this expansion. A new Moore’s law for LLMs may have emerged as a result of the constant increase in model size, according to some academics.

Can we just keep adding parameters to improve performance and make models smarter, you could be asking? Where might the expansion of this model go? While this may sound exciting, it turns out that training these massive models is challenging and expensive, possibly making training larger and larger models impossible.

7 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.