## 1 Introduction

Running out of memory is one of the most frequent problems you still encounter when trying to train large language models. Compute Unified Device Architecture, or CUDA, is a group of tools and libraries created specifically for Nvidia GPUs. Libraries like PyTorch and TensorFlow make advantage of CUDA to improve performance on deep learning operations like metrics multiplication. Because most LLMs are large and need a lot of memory to store and train all of their parameters, you’ll run across these memory concerns.

In this article we look at strategies used to help train these models more efficiently.

## 2 Estimating the Computational Costs

Let’s quickly perform some maths to acquire an understanding of the problem’s scope. A 32-bit float, which is how computers represent real numbers, is often used to represent a single parameter. In a moment, you’ll see additional information about how numbers are stored in this format. Four bytes of RAM are required for a 32-bit float. Therefore, four bytes times one billion parameters, or four gigabytes of GPU RAM at 32-bit full precision, are required to hold one billion parameters. This is a lot of memory. You must prepare for additional components that require GPU RAM during training if you wish to train the model.

These consist of the temporary variables required by your functions, gradients, activations, and two Adam optimizer states. This can easily result in 20 extra RAM bytes being needed for each model parameter. In fact, you’ll need about 20 times as much GPU RAM as the model weights alone do in order to account for all of these costs during training. You will want about 80 gigabytes of GPU RAM to train a model with one billion parameters at 32-bit full precision. If you wish to train with a single GPU, this is clearly too huge for consumer hardware and even difficult for hardware used in data centres.

## 3 Quantization

One Nvidia A100 GPU, a popular processor for Cloud machine learning tasks, has a memory capacity of 80 gigabytes. What alternatives do you have to lessen the amount of memory needed for training? Quantization is one method for reducing the amount of memory. The key notion is that by lowering the precision of your model’s weights from 32-bit floating point numbers to 16-bit floating point numbers, or eight-bit integer numbers, you can store them in less memory. Deep learning frameworks and libraries employ the equivalent data types FP32 for 32-bit full position, FP16 or Bfloat16 for 16-bit half precision, and int8 for eight-bit integers.

The range of numbers that FP32 can represent is roughly 3*10-38 to 3*10-38. Model weights, activations, and other model variables are by default saved in FP32. Using scaling factors determined based on the range of the original 32-bit floating point numbers, quantization statistically projects the original 32-bit floating point numbers into a lower precision space. Let’s examine a case in point. Consider storing a PI to six decimal places in various locations. Bits of zeros and ones are used to represent floating point numbers in storage. One bit for the sign, where zero denotes a positive number and one denotes a negative number, makes up each of the 32 bits needed to hold numbers with FP32’s full precision.

Following that, there are eight bits for the number’s exponent and 23 bits for its fraction. The mantissa, or significant, is another name for the fraction. It represents the number’s precision bits. The little decrease in precision is apparent when you convert the 32-bit floating point value back to a decimal value. Here is Pi’s actual value to 19 decimal places for your reference. Let’s now examine the results of projecting this FP32 representation of Pi into the FP16, or 16-bit lower precision realm. As you saw with FP32, the 16 bits consist of one bit for the sign, but in FP16, only five bits are used to indicate the exponent and ten bits are used to represent the fraction.

Because of this, the range of numbers you can express with FP16 is much less between -65,504 and +65,504. In the 16-bit range, the original FP32 value is projected to 3.140625. You should be aware that this projection sacrifices some precision. There are currently just six positions available following the decimal point. Because you’re attempting to optimise for memory footprint, you’ll find that this reduction in precision is usually acceptable. In FP32, a value must be stored in four bytes of memory. In contrast, storing a value on FP16 only needs two bytes of memory, therefore you have cut the amount of memory needed in half by using quantization.

The field of AI research has looked into approaches to improve 16-bit quantization. One particular datatype, BFLOAT16, has lately gained popularity as an alternative to FP16. Deep learning now frequently uses the Brain Floating Point Format, sometimes known as BFLOAT16, which was created by Google Brain. With BFLOAT16, many LLMs have been pre-trained, notably FLAN-T5. A cross between FP16 with half the precision and FP32 with all the precision is BFLOAT16 or BF16. Newer GPUs like NVIDIA’s A100 enable BF16, which dramatically improves training stability. Since BFLOAT16 employs only 16 bits and captures the whole dynamic range of a full 32-bit float, it is frequently referred to as a truncated 32-bit float. The exponent is represented by BFLOAT16 using all eight bits, whereas the fraction is only represented by seven bits.

By accelerating calculations, this not only frees up memory but also improves model performance. Although these are not very common in deep learning, the drawback of BF16 is that it is not well adapted for integer calculations. Let’s look at what happens if you quantize Pi from the 32-bit into the even lower precision eight bit area to make sure we’re thorough. The remaining seven bits are used to represent INT8 values if you just utilise one bit for the sign. In the 8-bit lower precision domain, Pi gets projected two or three times, as expected, giving you a range to represent integers from -127 to 127.

As a result, the new memory demand is reduced from four bytes to only one byte, although there is obviously a very significant loss in precision. Let’s review what you’ve learnt and highlight the main ideas you should remember from this conversation. In order to decrease the amount of memory needed to store and train models, quantization reduces the precision of the model weights. Using scaling factors determined based on the range of the original 32-bit floats, quantization statistically projects the original 32-bit floating point numbers into lesser precision areas.

Quantization-aware training, which learns the quantization scaling factors during the training process, is supported by contemporary deep learning frameworks and libraries. The scope of this course does not include the specifics of this procedure. The important thing to remember is that quantization can be used to minimise the memory footprint of the model during training. Due to its ability to preserve the dynamic range of FP32 while reducing memory requirements by a factor of two, BFLOAT16 has gained popularity as a precision option in deep learning. With BFOLAT16 pre-training, many LLMs have been developed, notably FLAN-T5.

## 4 Using Quantization to reduce memory use

Let’s now go back to the issue of fitting models into GPU memory and examine the potential effects of quantization. By using quantization, you can reduce the amount of memory needed to store the model parameters from two gigabytes to just two gigabytes using 16-bit half precision, a saving of 50%. You can also further reduce the memory footprint by representing the model parameters as eight-bit integers, which only needs one gigabyte of GPU RAM, to reduce the memory footprint by another 50%. Keep in mind that you still have a model with 1 billion parameters in all of these scenarios. The circles that symbolise the models are, as you can see, the same size.

You will experience a similar level of training cost savings through quantization. As you already knew, a single NVIDIA A100 GPU with 80 GB of RAM will shortly reach its capacity. If you wish to train on a single GPU, you should think about utilising either 16-bit or eight-bit quantization when training a model with one billion parameters at 32-bit full precision. Also, keep in mind that many models currently have sizes of more than 50 billion or even 100 billion parameters. Meaning that to train them, you’d need tens of thousands of gigabytes of memory, which is up to 500 times more memory. The one billion parameter model we’ve been thinking about, which is depicted here to scale on the left, is dwarfed by these huge models.

It becomes hard to train modal models on a single GPU when the number of parameters increases beyond a few billion. As you train your model across numerous GPUs, you will need to use distributed computing methods. This can call for pricey access to hundreds of GPUs. Another justification for why, most of the time, you won’t pre-train your own model from start. However, a further training procedure known as fine-tuning exists.

Additionally, since it’s extremely likely you’ll need to fine-tune a model at some point, doing this necessitates keeping all training parameters in memory so these are important considerations to bear in mind.

## 5 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.