Pre-training Large Language Models for Domain Adaptation

Here we will examine particular use cases where it might make sense to train a large language model from scratch. These use cases are often characterised by situations that use language in a very unique way such as legal or medical text

Pranath Fernando


July 9, 2023

1 Introduction

Here we will examine particular use cases where it might make sense to train a large language model from scratch. These use cases are often characterised by situations that use language in a very unique way such as legal or medical text.

2 When to Train Your Own Model from Scratch

Typically, when you create your application, you’ll use an existing LLM. By doing this, you may create a working prototype much more quickly and save a lot of time. You might need to pretrain your own model from scratch in one specific circumstance, though. if your target domain use vocabulary and grammatical structures that are uncommon in everyday English. To get good model performance, domain adaptation may be necessary.

Consider developing an app to assist paralegals and attorneys in summarising legal briefs. Terms like mens rea in the first example and res judicata in the second are used in legal writing because they are quite particular. Since these terms are rarely used outside of the legal industry, it is unlikely that they were used frequently in the training materials for LLMs that have already been granted. As a result, it’s possible that the models won’t be able to comprehend or properly use these terms. Another problem is that legal terminology frequently uses common phrases in unusual contexts, as in the third example with consideration. Which is not a reference to being kind, but rather the essential component of a contract that renders the deal enforceable.

For identical reasons, using an existing LLM in a medical application may create difficulties. To describe medical illnesses and operations, the language of medicine uses many unusual words. Additionally, they might not show up frequently in training datasets made out of book texts and web scrapes. Furthermore, certain fields employ language in highly peculiar ways. The final piece of medical jargon could merely appear to be a collection of random characters, but it’s actually a shorthand that doctors use when composing prescriptions. Take one pill orally four times day, after meals and at night. This writing is quite apparent to a medical proffesional.

3 BloombergGPT

The initial pretraining task helps models gain their vocabulary and linguistic understanding. Better models will be produced by pretraining your model from scratch in highly specialised fields like law, medical, finance, or science. Let’s look at BloombergGPT, which was first introduced in a paper written in 2023 by Shijie Wu, Steven Lu, and other Bloomberg employees. A large language model that has been pretrained for a particular domain, in this case, finance. The Bloomberg researchers decided to pretrain a model that produces best-in-class outcomes on financial benchmarks by combining both finance data and general purpose tax data.

A competitive performance on all-purpose LLM benchmarks is also maintained. The researchers selected data that was made up of 49% public data and 51% financial data. The Bloomberg researchers go into greater detail about the model architecture in their paper. They also talk about how they used chinchilla scaling laws as a starting point and where they had to make compromises. These two graphs contrast various LLMs, such as BloombergGPT, with scaling laws that have been the subject of research. For a variety of compute budgets, the diagonal lines on the left trace the ideal model size in billions of parameters. The lines on the right show the number of tokens used to derive the ideal size of the training data set.

The compute budget that the Bloomberg team had available for training their new model is shown on each graph as a dashed pink line. The compute optimal scaling loss identified in the Chinchilla research is represented by the pink shaded regions. You can see that BloombergGPT generally follows the Chinchilla strategy for the compute budget of 1.3 million GPU hours, or about 230,000,000 petaflops, for the specified compute budget. The model is barely above the pink shaded area, indicating that the number of parameters is almost exactly ideal. The actual amount of tokens needed to pretrain BloombergGPT, 569,000,000,000, is less than the Chinchilla figure that is suggested for the compute budget that is available.

The lack of readily available financial domain data is the cause of the less than ideal training data set size, demonstrating that when pretraining your own models, real-world restrictions could require you to make trade-offs.

4 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.