Generative Configuration for Large Language Models

In this article we will take a high level non-technical view of what generative configuration options for Large language models allow you to do

Pranath Fernando


July 3, 2023

1 Introduction

Recent advances in AI such as ChatGPT have demonstrated impressive abilities for performing a wide range of tasks previously only done by humans, these are known as Large Language Models (LLM’s). Once you have chosen a model, you often have some configuration options that let you control the outputs. In this article we will take a high level non-technical view of what these generative configuration options allow you to do.

2 Common LLM Generative Configuration Options

You might have seen controls to modify how the LLM acts if you’ve used LLMs in playgrounds like the Hugging Face website or an AWS. Every model makes a set of configuration options available, and these settings might affect the model’s output during inference. Keep in mind that these are not the training parameters that are discovered during training. Instead, you may regulate things like the maximum number of tokens in the completion and the level of creativity in the output by using these configuration parameters, which are invoked at the time of inference.

The parameter max new tokens is arguably the simplest one of these, and you can use it to restrict how many tokens the model will produce. This can be compared to setting a limit on how many times the model will go through the selection procedure. Here are some instances when the maximum number of new tokens is set to 100, 150, or 200. However, take note of how much shorter the completion is in the example for 200. This is due to the fact that another stop condition, such as the model prediction or end of sequence token, was met. Keep in mind that there is a cap on fresh token generation, not a fixed number.

A probability distribution for the full model’s word dictionary is the output of the softmax layer of the transformer model behind LLMs. Here, you can view a list of terms with their probability score next to them. Even though we are only displaying four words at a time, consider that this is a list that extends throughout the entire dictionary. The majority of complex language models will by default use so-called greedy decoding. The model will always select the word with the highest probability in this, the most straightforward type of next-word prediction. Short text generation can benefit greatly from this strategy, but it is vulnerable to repeated words or word sequences.

You can use other controls to produce writing that sounds more natural, is more inventive, and doesn’t repeat words. The simplest technique to add some diversity is by random sampling. The model chooses an output word at random using the probability distribution to weight the selection rather than always choosing the most likely word as is done with random sampling. The likelihood score for the word “banana” in the illustration, for instance, is 0.02. This translates to a 2% probability that this word will be chosen via random sampling. By employing this sampling strategy, we lessen the possibility of word repetition.

However, depending on the environment, it’s possible that the output will be too creative, leading to words that lead the generation to go off into irrelevant themes or words that are just unintelligible. Be aware that in some models, you might need to explicitly turn off greedy and turn on random sampling. For instance, setting sample to equal true is necessary for the lab’s Hugging Face transformers implementation.

In order to reduce the random sampling and raise the likelihood that the result would be better, let’s investigate top k and top p strategies. We can utilise top p and top k settings, two sampling approaches, to assist restrict random sampling and raise the likelihood that the output will be good. You can enter a top k number that tells the model to select only the k tokens with the highest probability in order to restrict the options while still allowing for some flexibility. Since k in this example is set to 3, the model can only select one of these alternatives. The model then chooses the next word using the probability weighting from among these choices, and in this instance, it selects doughnut.

By using this technique, it is possible to provide the model some unpredictability while avoiding the selection of extremely unlikely completion words. Your text production is hence more likely to sound fair and make sense. As an alternative, you can restrict the random sampling to forecasts whose total probability do not exceed p by using the top p parameter. For instance, cake and donut are the options if you set p to be equal to 0.3 because their probability of 0.2 and 0.1 sum up to 0.3. The model then selects one of these tokens using the random probability weighting technique. With top k, you specify the number of tokens to randomly choose from, and with top p, you specify the total probability that you want the model to choose from.

The temperature parameter is another one that you can use to regulate the randomness of the model output. The probability distribution that the model determines for the following token takes into account this parameter. In general, unpredictability increases with temperature and decreases with temperature. The shape of the probability distribution for the next token is affected by the temperature value, a scaling factor that is applied within the model’s final softmax layer.

Changing the temperature really affects the predictions that the model will produce, unlike changing the top k and top p parameters. If you select a low temperature, let’s say less than one, the probability distribution that results from the softmax layer is more steeply peaked and the probability is concentrated in fewer words. This is demonstrated in the blue bars next to the table, which display a probability bar chart that has been turned on its side. The word cake is where the majority of the likelihood lies in this sentence.

When the model uses random sampling to choose from this distribution, the final text will be less random and more closely adhere to the most probable word sequences that the network discovered during training. Instead, if you increase the temperature to a higher number—let’s say larger than one—the model will determine a flatter, broader probability distribution for the subsequent token. You’ll notice that the likelihood is distributed more evenly throughout the tokens than it is across the blue bars. In comparison to a cool temperature setting, this causes the model to produce text with a higher level of randomness and variability.

This can aid in producing text that sounds more imaginative. The softmax function will be utilised by default and the unchanged probability distribution will be used if you leave the temperature value at 1.

3 Acknowledgements

I’d like to express my thanks to the wonderful Generative AI with Large Language Models Course by and AWS - which i completed, and acknowledge the use of some images and other materials from the course in this article.