LLM Application Considerations - Part 1

In this post we look at several aspects to consider when deploying a Large Language Model (LLM) into an application such as Model optimizations, a Generative AI project lifecycle cheat sheet, and how LLM’s can be turned into useful applications using external data sources and services.

Pranath Fernando


July 19, 2023

1 Introduction

In this article we will look at several aspects to consider when deploying a Large Language Model (LLM) into an application. We will look at Model optimizations, a Generative AI project lifecycle cheat sheet, and how LLM’s can be turned into useful applications using external data sources and services.

2 Model optimizations for deployment

At this stage of creating an LLM-based application, there are a number of crucial issues to ask. The first group of questions concerns the operation of your LLM during deployment. How quickly does your model need to produce completions, then? What computing budget do you have? How willing are you to sacrifice model performance in exchange for faster inference or less storage? The second series of queries relates to any additional resources that your model might require. Will your model communicate with other applications or external data? In this case, how will you access those resources? The question of how your model will be used comes later.

What will the planned application or API interface look like where your model will be used? Let’s begin by going over several techniques that may be utilised to improve your model before putting it to use for inference. The purpose of this section is to provide you an introduction to the most significant optimisation strategies, even if we might devote several classes to this topic. With regard to computing and storage needs, as well as ensuring low latency for consuming applications, large language models create inference issues. These difficulties continue whether you deploy locally or to the cloud, and they get significantly worse when you deploy to edge devices.

The size of the LLM can be decreased, which is one of the main approaches to enhance application performance. This may enable faster model loading, which lowers inference latency. The difficulty, though, lies in minimising the model’s size without sacrificing model performance. For generative models, some strategies work better than others, and accuracy and performance can be traded off. This section will teach you three techniques.

Distillation trains a smaller model, the student model, using a larger model, the teacher model. Then, in order to save money on storage and computing, you utilise the smaller model for inference.

Post training quantization changes a model’s weights to a lower precision representation, like a 16-bit floating point or eight-bit integer, similarly to quantization aware training. This lessens the memory footprint of your model. Model Pruning, the third strategy, eliminates superfluous model parameters that don’t significantly improve the performance of the model. Let’s go over each of these choices in greater depth.

2.1 Distillation

A method called “model distillation” focuses on training a larger teacher model to become a smaller student model. The student model picks up on the teacher model’s statistical mimicking behaviour, perhaps simply in the final prediction layer or in all of the model’s hidden layers.

You begin by creating a smaller LLM for your student model after fine-tuning your LLM as your instructor model. The weights of the teacher model are frozen, and it is used to produce completions for your training set. Using your student model, you produce completions for the training data concurrently.

By minimising a loss function known as the distillation loss, the knowledge distillation between the instructor and student models is accomplished. Distillation uses the probability distribution over tokens that the softmax layer of the instructor model generates to calculate this loss. On the basis of the training data, the teacher model has already been improved.

The probability distribution will probably have little variance in tokens and closely reflect the ground truth data. Due to this, Distillation uses a clever approach by modifying the softmax function’s temperature parameter. The model produces more inventive language at higher temperatures.

The probability distribution broadens and the peak weakens as the temperature parameter increases. You receive a set of tokens from this softer distribution that are comparable to the ground truth tokens. In the context of distillation, the predictions made by the student model are frequently referred to as soft predictions and the output of the instructor model as soft labels. On the basis of your ground truth training data, you simultaneously train the student model to produce the accurate predictions.

Here, you use the conventional softmax function rather than changing the temperature setting. The student model outputs are referred to as the “hard predictions” and “hard labels” in distillation. The student loses out between these two. The weights of the student model are updated through back propagation using the combined distillation and student losses. The main advantage of distillation approaches is that the smaller student model, rather than the teacher model, can be employed for inference in deployment.

In actuality, generative decoder model distillation is less efficient. It usually works better for encoder-only models with a lot of representation redundancy, like BERT. Be aware that while using distillation, you train a second, smaller model that will be used for inference. In no way are you diminishing the initial LLM’s model size.

2.2 Quantization

Let’s look at the next model optimisation method that actually makes your LLM smaller. In an earlier article, we learned about the second technique, quantization. in particular Quantization Aware Training, or just QAT. However, we can use post training quantization, or PTQ for short, to optimise a model for deployment after it has been trained. The PTQ algorithm converts a model’s weights to a representation with less accuracy, such as an 8-bit integer or 16-bit floating point. Quantization can be performed to either the model weights or to both weights and activation layers in order to minimise the model size, memory footprint, and compute resources required for model servicing.

In general, quantization methods that take activations into account can affect model performance more. In order to statistically represent the dynamic range of the original parameter values, quantization also necessitates an additional calibration step. There are drawbacks because quantization occasionally causes a tiny percentage loss in model assessment measures, just like with other approaches. However, the cost savings and performance improvements can frequently outweigh that reduction.

2.3 Pruning

Pruning is the final model optimisation method. At a high level, the objective is to decrease the size of the model for inference by removing weights that don’t significantly improve the performance of the model as a whole. These weights have values that are extremely close to or equal to zero. Be aware that while some pruning techniques need for a complete retraining of the model, others, like LoRA, fall under the category of parameter efficient fine tuning. Additionally, there are techniques that emphasise post-training pruning.

Theoretically, this shrinks the size of the model and enhances functionality. However, in practise, if only a tiny portion of the model weights are close to zero, there might not be much of an impact on the size and performance.

All three techniques — Quantization, Distillation, and Pruning — seek to shrink the size of the model while improving inference performance. By making your model deployment-ready, you can make sure that your application runs smoothly and gives users the greatest possible experience.

3 Generative AI Project Lifecycle Cheat Sheet

This cheat sheet could provide you some idea of how much time and effort will be needed for each stage of the generative AI project life cycle planning process.

It can take a lot of work to pre-train a big language model, as we saw in earlier articles. The judgements you’ll need to make regarding the model design, the quantity of training data needed, and the level of knowledge necessary make this stage the most challenging. But keep in mind that you will often begin your development work with an established foundation model. Most likely, you can skip this step.

Working with a foundation model will probably allow you to evaluate the model’s performance by prompt engineering, which needs less technical know-how and doesn’t require further model training.

Next, you’ll consider prompt and fine tuning if your model isn’t performing as you require. The strategies you’ll test could range from full fine-tuning to parameter-efficient fine tuning techniques (PEFT) like LoRA or prompt tuning, depending on your use case, performance requirements, and compute budget. For this activity, some level of technical proficiency is necessary. However, since fine-tuning can be quite effective with a modest training dataset, this stage might be finished in a single day.

Once you have your trained reward model, aligning your model using reinforcement learning from human feedback can be done quickly. You’ll probably find out if you can use an existing reward model for this task. However, given the time and effort required to acquire human feedback, it can take a while to train a reward model from scratch.

Last but not least, optimisation approaches, usually lie somewhere in the centre in terms of complexity and effort, but they can move along quite fast if the changes to the model don’t significantly affect performance. After completing all of these stages, you should have a great LLM that has been trained, tweaked, and is ready for deployment for your particular use case.

4 Using the LLM in applications

Even though the training, adjusting, and alignment methods we’ve looked at can all help you create a fantastic model for your application, there are some more general issues with large language models that training alone cannot address. Let’s look at a few instances.

One problem is that a model’s internal knowledge ends at the point of pretraining. For instance, a model trained in early 2022 would likely respond with Boris Johnson if you asked it who the British Prime Minister is. To be fair to the model, it’s easy to get out of date with UK Prime Ministers given they seem to change every 3 days. This information is clearly outdated, Johnson resigned from his position in late 2022, although the model is unaware of this because it was trained before that time.

Complex maths can be challenging for models as well. Depending on how challenging the task is, if you ask a model to act like a calculator, it might not come up with the correct answer. Here, you instruct the model to solve a division puzzle. The model provides a result that is close to the true value, but it is unreliable. Keep in mind that the LLMs don’t perform mathematical calculations. They are still only attempting to forecast the subsequent best token based on their prior experience, which makes it easy for them to guess incorrectly. Last but not least, one of the most well-known issues with LLMs is their propensity to produce text even when they are unsure of the solution to a problem.

Here, you can see the model explicitly inventing a description of an inexistent plant, the Martian Dunetree. This is sometimes referred to as hallucination. The model will joyfully tell you that there is life on Mars even though there is still no proof of it. By connecting to other data sources and applications, you’ll learn about several approaches in this part that you may use to assist your LLM in resolving these problems. To connect your LLM to these external components and completely integrate everything for deployment within your application, you’ll need to put in a little more effort.

Your application must control how user input is passed to the extensive language model and how completions are returned. Usually, some sort of orchestration library is used for this. This layer has the potential to enable certain potent technologies that will supplement and improve the LLM’s runtime performance. by allowing users to connect to existing APIs of other programmes or giving them access to additional data sources. Langchain is one application framework that can greatly help with this, which I have written many posts about previously. Let’s begin by thinking about how to link LLMs to other data sources.

4.1 Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation, also known as RAG, is a framework for creating LLM-powered systems that utilise outside data sources and applications to get over some of the drawbacks of these models. The knowledge cutoff problem can be solved and the model’s grasp of the world is updated with the use of RAG. While you could retrain the model using fresh data, doing so would quickly get highly expensive and necessitate further retraining in order to keep the model current. Giving your model access to more data at the inference stage is a more adaptable and affordable technique to get around knowledge cutoffs. RAG is helpful in any situation when you need to provide the language model access to information that it might not otherwise have.

This might be fresh information documents that weren’t part of the initial training materials or confidential information kept in your company’s secure databases. External data can help your model produce completions that are more accurate and relevant. Let’s examine this process in more detail. Retrieval augmented generation is a framework for giving LLMs access to data they were not shown during training, rather than a specific collection of technology. There are numerous implementations available, and which one you select will depend on the specifics of your assignment and the format of the data you must use. Here, you may follow along with the implementation described in one of Facebook researchers’ earlier publications on RAG, which was first released in 2020.

The Retriever model component, which comprises of a query encoder and an external data source, is the brains behind this system. The encoder converts the user’s input request into a form that can be used to query the data source after receiving it from the user. The external data in the Facebook paper is a vector store, which we’ll go into more detail about in a moment. But it might also be a CSV file, SQL database, or another type of data storage format. Together, these two components are trained to locate the external data documents that are most pertinent to the input query.

The best single or group of documents are returned by the Retriever from the data source, and the fresh data is combined with the initial user query. The language model is then given the new enlarged prompt, and it creates a completion using the data.

Let’s look at a more particular illustration. Consider yourself a lawyer who uses a sizable language model to aid in case discovery. You can use a Rag architecture to query a corpus of documents, such as earlier court filings. You enquire about the plaintiff in this case by giving the model the case number.

The query encoder receives the command and encodes the data according to the format of the external documents. and then looks for a pertinent entry in the document corpus. The Retriever then integrates the new text with the old query after locating a passage of text that includes the needed information. The LLM is then given the enlarged prompt, which now includes details on the particular case of interest.

The model creates a completion that provides the right response using the data in the context of the prompt. The use case you have seen so far is very straightforward and only provides access to one piece of information that is readily available elsewhere. But consider Rag’s capability to produce file summaries or locate particular individuals, places, or businesses within the entire corpus of legal records. The model’s usefulness for this particular use case is considerably increased by giving it access to the data in this external data source.

Rag assists you in avoiding the issue of the model hallucinating when it doesn’t know the solution in addition to overcoming knowledge cutoffs. Several different kinds of external information sources can be integrated using RAG designs. Access to local documents, such as private wikis and expert systems, can be used to supplement huge language models. Rag can also make it possible to access the Internet and retrieve data from websites like Wikipedia. RAG may communicate with databases by encoding the user input prompt as a SQL query. A Vector Store, which includes text-based vector representations, is another significant method of data storing.

Since language models internally generate text using vector representations of language, this data format is especially helpful for them. A quick and effective search based on similarity is made possible by vector storage. It should be noted that implementing RAG entails a bit more work than merely including text in the large language model. Starting with the size of the context window, there are a few important factors to be mindful of. The majority of text sources are too lengthy to fit into the model’s constrained context window, which is still only a few thousand tokens at most. The external data sources are instead divided into numerous pieces that can all fit within the context window. You may let packages like Langchain take care of this for you.

Second, the data must be accessible in a way that makes it simple to find the most pertinent content. Remember that large language models produce vector representations of each token in an embedding space rather than working directly with text. By using metrics like cosine similarity, which you previously taught about, these embedding vectors enable the LLM to find words that are semantically connected to one another. Rag techniques use a broad language model to transform the small pieces of external data into embedding vectors for each. These novel data representations can be stored in vector store structures, which facilitate quick dataset searches and accurate text identification based on semantic similarity.

An implementation of a vector store where each vector is also given a key is called a vector database. This might enable the text produced by RAG, for example, to also contain a citation to the source document. You’ve now seen how a model can overcome internal knowledge constraints with access to external data sources. You may significantly enhance your users’ experience with your application by giving them current, pertinent information and avoiding hallucinations.

5 Interacting with external applications

You saw how LLMs can interact with outside datasets in the previous section. Let’s now examine how they can communicate with outside apps. Let’s look at the connectors required to enable an app to fully handle a return request during this demonstration of one customer’s contact with ShopBot. The customer has stated in this conversation that they wish to return some genes they purchased. When the consumer responds, ShopBot prompts them for the order number.

The order number is then searched for in the transaction database by ShopBot. A rag implementation similar to the one you saw earlier in the last video is one way it might accomplish this. Instead of retrieving data from a corpus of documents in this instance, you would probably be retrieving data through a SQL query to a back-end order database. The next step is to verify the items that will be returned when ShopBot has retrieved the customer’s order. If the customer would like to return anything other than the jeans, the bot will ask them.

After the user responds, the bot sends a request for a return label to the company’s shipping partner. The body asks for the label using the standard Python API. The shipping label will be sent to the purchaser through email by ShopBot. Additionally, it requests their email address confirmation. When the consumer responds, the bot includes their email address in the API request to the shipper. The customer receives an email confirming receipt of the label after the API request is finished, and the dialogue ends. This little example demonstrates just one conceivable set of interactions that an LLM would be required to be effective in.

LLMs’ utility may generally be increased beyond linguistic tasks by linking them to external applications and enabling the model to communicate with the outside world. LLMs can be used to initiate actions when given the capacity to communicate with APIs, as demonstrated by the example from the store-bought item. LLMs can link up with additional programming resources. Using a Python interpreter as an illustration, models may be made to include precise computations in their outputs. It’s crucial to keep in mind that the core of these workflows is completions and prompts. The LLM, which serves as the application’s reasoning engine, will choose the actions that the app will perform in response to user requests.

The completions produced by the LLM must have specific crucial information in order to set off activities. For the application to know what steps to take, the model must first be able to provide a set of instructions. These guidelines must be clear and align with permitted behaviour. For instance, in the ShopBot example, the crucial steps included validating the user’s email address, asking for a shipping label, and emailing the label to the user. The completion must also be formatted so that the larger programme can understand it. This could be as straightforward as using a particular sentence style or as complicated as creating a Python script or a SQL statement.

Here is an example of a SQL query that checks to see if an order is stored in the database of all orders. Finally, the model might need to gather data that enables action validation. For instance, the programme had to confirm the email address the consumer had used to place the initial order during the ShopBot chat. Any data that must be collected from the user and included in the completion in order for it to be transmitted to the programme. For each of these activities, structuring the prompts correctly is crucial since it can have a significant impact on the quality of the plan that is developed or the adherence to a specified output format.

6 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by DeepLearning.ai and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.