Scaling Laws and Compute Optimal Large Language Models

In this article we look at research that has looked at the relationship between model size, training, configuration, and performance to try to pinpoint the optimal size for large language models

Pranath Fernando


July 8, 2023

1 Introduction

In this article we’ll look at research that has looked at the relationship between model size, training, configuration, and performance to try to pinpoint the optimal size for large language models. It’s important to keep in mind that the objective of pre-training is to maximise the model’s achievement of its learning objective, which is to minimise the loss while predicting tokens.

2 Dataset vs Model Size

Expanding the dataset size used to train your model and expanding the number of model parameters are two ways to improve performance. Theoretically, scaling one or the other of these amounts would enhance performance. Your compute budget, which includes elements like the number of GPUs you have access to and the amount of time you have available for model training, is an additional problem to take into account. Let’s first create a unit of compute that quantifies the necessary resources so that you may better grasp some of the discussion that will follow.

The number of floating point operations carried out at a rate of one petaFLOP per second over the course of a single day is referred to as a petaFLOP per second day. Remember that 1 quadrillion floating point operations per second equates to 1 petaFLOP. One petaFLOP per second day is about similar to eight NVIDIA V100 GPUs running at maximum efficiency for a complete day when primarily considering training transformers. A petaFLOP per second day can be achieved with fewer chips if you have a more potent processor that is capable of processing more tasks simultaneously. For instance, eight V100 chips are equivalently computed by two NVIDIA A100 GPUs.

This graph compares the petaFLOP per second days needed to pre-train different variances of Bert and Roberta, both encoder only models, to give you an idea of the scope of these compute budgets. T5 and the encoder-decoder and decoder-only GPT-3 models. The number of parameters that were trained, which ranges from a few hundred million for Bert basic to 175 billion for the largest GPT-3 variation, distinguishes the models in each family. The y-axis is logarithmic, as you can see. A power of 10 is added vertically with each step. We can observe that T5 XL took around 100 petaFLOP per second days to run with three billion parameters.

As opposed to the larger GPT-3 175 billion parameter model, which needed about 3,700 petaFLOP per second days. This graph clearly shows how many computers were needed to train the biggest models. As you can see, larger models require more computing power to train and typically more data to function well. It turns out that the relationships between these three scaling options are fairly clearly defined. The trade-offs between the size of the training dataset, the size of the model, and the compute budget have been studied. Here is a diagram from a work by OpenAI researchers that examines how the compute budget affects model performance.

The test loss is plotted on the y-axis, where lesser values indicate better model performance. The compute budget is represented on the x-axis as petaFLOP per second days. As you’ve just shown, higher numbers can be obtained by increasing compute power, extending training periods, or doing both. The model loss over a single training run is represented by each fine blue line in this diagram. A obvious correlation between the compute budget and the model’s performance can be seen by examining the point at which the loss starts to decrease more slowly for each run. This can be roughly represented by a power-law connection, as this pink line demonstrates.

A mathematical relationship between two variables in which one is proportionate to the other raised to a certain power is known as a power law. Power-law relationships are shown as straight lines on a graph with two logarithmic axes. As long as model size and training dataset size do not hinder the training process, the link in this case is valid. If taken at face value, this would imply that all you need to do to improve model performance is raise your compute budget. In reality, however, the computing resources you have at your disposal for training will typically be subject to strict limitations imposed by things like the hardware you have access to, the amount of time you have for training, and the project’s budget.

The size of the training dataset and the number of parameters in your model are the two levers you have to increase the performance of your model if you keep your compute budget fixed. When the other two variables are maintained constant, the OpenAI researchers discovered that these two quantities likewise exhibit a power-law connection with a test loss. This graphic, which examines the effect of training dataset size on model performance, is taken from the same paper. Here, the training dataset size is variable but the compute budget and model size are remained constant. The graph demonstrates that the model’s performance keeps getting better as the amount of training data grows.

Both the compute budget and the size of the training dataset are constant in the second graph. Models with various parameter counts are trained. The test loss diminishes as the model’s size grows, showing improved performance.

3 The Chinchilla paper

You might be wondering at this point what the best ratio between these three values is? It seems that a lot of people are curious about this issue. Numerous empirical data for pre-training compute optimum models have been published by both the research and business worlds.

A team of scientists led by Jordan Hoffmann, Sebastian Borgeaud, and Arthur Mensch conducted a thorough analysis of the performance of language models with varying sizes and amounts of training data, and the results were published in a paper in 2022.

Finding the ideal quantity of training data and number of parameters for a specific compute budget was the objective. Chinchilla is the name of the computed optimal model. This paper is frequently called the Chinchilla paper. Let’s look at a few of their conclusions. The Chinchilla paper makes a suggestion that many of the 100 billion parameter large language models, like GPT-3, may actually be undertrained and over parameterized, meaning they have more parameters than are necessary to achieve a thorough understanding of language. Smaller models might be able to perform as well as much larger ones, according to the scientists’ theory, if they are trained on larger datasets.

You can see a number of models in this table, along with their sizes and details on the training dataset. The size of the ideal training dataset for a particular model is around 20 times bigger than the number of parameters in the model, according to the Chinchilla study. Chinchilla was found to be the most efficient to compute. The ideal training dataset has 1.4 trillion tokens, or 20 times the number of parameters, for a model with 70 billion parameters. The table’s final three models were developed using datasets that were smaller than the Chinchilla ideal size.

These models might not have received enough training. LLaMA, on the other hand, was trained on a dataset size of 1.4 trillion tokens, which is close to the Chinchilla suggested amount. The Chinchilla model, which is compute optimal, outperforms non-compute optimal models, like GPT-3, on a variety of downstream evaluation tasks, which is another significant finding from the research. As a result of the Chinchilla paper’s findings, teams have lately begun to create smaller models that produced outcomes that were comparable to or even superior to those of bigger models that had undergone suboptimal training.

As more teams or developers begin to optimise their model design going forward, you can anticipate seeing a departure from the bigger is always better tendencies of the last several years. Bloomberg GPT, the last model displayed, is a really intriguing model. Following the Chinchilla loss, it was trained in a compute-optimal manner, and as a result, it performs well with a size of 50 billion parameters. It’s also an intriguing illustration of a circumstance in which good task performance required pre-training a model entirely from scratch.

And of course it should go without saying, the entire AI community is extremely grateful to all Chinchillas for their significant inspiration and contributions to AI research!

4 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.