An overview Language Models, the Chat format and Tokens

Here we give a brief overview of how LLM’s work, how they are trained, what is a tokeniser and how a choice of different tokenisers can effect the output of the LLM. We will also look at what the ‘chat format’ for LLM’s is all about

Pranath Fernando


June 18, 2023

1 Introduction

Large language models such as ChatGPT can generate text responses based on a given prompt or input. Writing prompts allow users to guide the language model’s output by providing a specific context or topic for the response. This feature has many practical applications, such as generating creative writing prompts, assisting in content creation, and even aiding in customer service chatbots.

In earlier articles i’ve looked at how you can use ChatGPT to solve some of these tasks with simple prompts. But in many use cases, what is required is not just one prompt but a sequence of prompts where we need to also consider the outputs at each stage, before providing a final output - for example with a customer service chatbot.

In this article, we will give a brief overview of how LLM’s work. We will look at how they are trained, as well as other details like what is a tokeniser and how a choice of different tokenisers can effect the output of the LLM. We will also look at what the ‘chat format’ for LLM’s is all about, distinguishing what are system vs user messages are as well as understanding the different things they do.

2 Setup

First we need to load certain python libs and connect the OpenAi api.

The OpenAi api library needs to be configured with an account’s secret key, which is available on the website.

You can either set it as the OPENAI_API_KEY environment variable before using the library: !export OPENAI_API_KEY='sk-...'

Or, set openai.api_key to its value:

import openai
openai.api_key = "sk-..."
import os
import openai
import tiktoken
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']
# Define helper function
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
    return response.choices[0].message["content"]

3 Different ways of using a Language Model

3.1 How does an LLM get created?

How does a large language model get created? You might be familiar with the text generation process where you can provide a prompt, such as “I love eating,” and ask an LLM to fill in what the items are likely completions given this prompt. Additionally, it might mention “Bagels with cream cheese, my mom’s meatloaf or going out with friends.” But how did the model acquire this skill?

Supervised learning is the main training method for an LLM. A computer learns an input-output or X or Y mapping using labelled training data in supervised learning. For instance, if you’re using supervised learning to figure out how to categorise the sentiment of restaurant reviews, you might gather a training set like this, where a review like “The pastrami sandwich is great!” is labelled as a positive sentiment review, and so on so “The earl grey tea was fantastic” has a positive label and “Service was slow, the food was so-so.”

In order to perform supervised learning, labelled data must first be obtained before an AI model can be trained using the data. After training, you may deploy the model, give it a call, and tell it what restaurant has the best pizza you’ve ever had. So the fundamental building component for training large language models is supervised learning. In particular, a Large Language Model can be created by repeatedly predicting the next word using supervised learning.

3.2 Types of LLM

So today there are broadly two major types of Large Language Models. The first is a “Base LLM” and the second, which is what is increasingly used, is the “Instruction Tuned LLM”.

Using text training data, the base LLM continuously predicts the subsequent word. In other words, if I give it the prompt, “Once upon a time there was a unicorn,” it might, by repeatedly predicting one word at a time, produce a completion that describes a unicorn living in a lovely woodland with all of her unicorn pals. An issue with this is that if you were to ask it, “What is the capital of France?” it’s entirely conceivable that a list of quiz questions about France would appear online. Therefore, it may finish this by asking questions like “What is the largest city in France? How many people lives there? In contrast, you presumably want to know what the capital of France is rather than having all these questions listed.

A LLM that is tuned to instructions will therefore attempt to do so and, perhaps, respond with the statement, “The capital of France is Paris.” How does a Base LLM become an Instruction Tuned LLM?

To make an Instruction Tuned LLM, a Base LLM must first be trained on a large amount of data, or perhaps even more than a hundred billion words. On a big supercomputing system, this operation could take months. After training the Base LLM, you would refine the model on a smaller collection of cases in which the output complies with an input directive. In order to write a number of examples of instructions and then a solid answer to instructions, for instance, you might hire contractors. So a training set for performing this additional fine-tuning is created. Therefore, if it is trying to follow instructions, it learns to anticipate what word will come next.

The next step is to gather human evaluations of the output quality of numerous distinct LLMs on a variety of criteria, such as whether the output is beneficial, truthful, and safe, in order to improve the quality of the LLM’s output. To enhance the likelihood of the LLM producing outputs with higher ratings, you can then further fine-tune it. RLHF, or Reinforcement Learning from Human Feedback, is the method used most frequently to do this. And whereas training the Base LLM can take months, the process of moving from the Base LLM to the Instruction Tuned LLM can be completed in a matter of days using significantly smaller data sets and computational resources.

4 Prompt the model and get a completion

Let’s try using the LLM for one of the basic tasks we discussed which is to complete a bit of text.

response = get_completion("What is the capital of France?")
The capital of France is Paris.

5 Tokens

Let’s try something different. If you were to tell the LLM to reverse the letters in the word “lollipop,” it would appear like a simple assignment that even a four-year-old could complete. However, if you ask ChatGPT to do this, it really produces something that is a little jumbled. This isn’t Lollipop, and the letters aren’t backwards either. So why can’t ChatGPT do what looks to be a very straightforward task?

response = get_completion("Take the letters in lollipop \
and reverse them")

“lollipop” in reverse should be “popillol”

response = get_completion("""Take the letters in \
l-o-l-l-i-p-o-p and reverse them""")

It turns out that there is one more crucial aspect of how a large language model functions, namely that it repeatedly predicts the next token rather than the next word.

In fact, an LLM groups characters together to create tokens, which are collections of frequently occurring character sequences, by taking a series of characters, such as “Learning new things is fun!” . Learning new things is enjoyable in this situation since each token stands for one word, one word with a space between them, or an exclamation point.

The word prompting is still not that widespread in the English language, although it is undoubtedly rising in popularity. However, if you were to input other rather less commonly used words, such as “Prompting as powerful developer tool.” Because those three letter combinations are frequently found, prompting is actually divided into three tokens: “prom”, “pt”, and “ing”. Additionally, if you were to give it the word “lollipop,” the tokenizer actually splits it into the letters “l,” “oll,” and “ipop.” Furthermore, since ChatGPT only sees these three tokens rather than the individual characters, it has a harder time printing out these letters in the proper reverse order.

So, here’s a hack you can employ to resolve this.

It actually does a much better job, this L-O-L-L-I-P-O-P, if I were to add dashes to the word dashes, between these letters; spaces would also work, or other things would work; and tell it to take the letters and lollipop and reverse them. Because it tokenizes each of these characters into an own token when you provide it a lollipops with dashes between the letters, it can more easily discern the individual letters and print them out in reverse order. Therefore, this clever tip makes it easier for ChatGPT to distinguish between the different letters in words if you ever wish to use it to play a word game like Word or Scrabble.

One token often corresponds to four letters, or roughly three-quarters of a word, in the English language. As a result, the maximum amount of input + output tokens that a given large language model can handle will frequently vary. The result is frequently referred to as completion, while the input is frequently referred to as the context. And the most widely used conversation GPT model, the GPT 3.5 Turbo, has a cap of about 4,000 tokens in the input plus output. As a result, if you attempt to provide it with an input context that is significantly longer than this, it will throw an exception or produce an error.

6 Using system-user-assistant format messages

Let’s take a look at yet another effective LLM API use case that includes specifying distinct system, user, and assistant messages. Let’s look at an illustration before going into greater detail about what it does. We are going to prompt this LLM using a new helper function named “get_completion_from_messages” and a number of messages. Here is an illustration of what you can do.

# Define helper function
def get_completion_from_messages(messages, 
    response = openai.ChatCompletion.create(
        temperature=temperature, # this is the degree of randomness of the model's output
        max_tokens=max_tokens, # the maximum number of tokens the model can ouptut 
    return response.choices[0].message["content"]

I’m going to start off by defining what a system message looks like. Since this is a system message, its content is “You are an assistant who responds in the style of Dr. Seuss.” The second message’s role is “role: user” and its content is “write me a very short poem about a happy carrot.” I’ll then define a user message. Let’s run it, and with “temperature = 1”, which has the most unpredictability/creativity in its output.

messages =  [  
 'content':"""You are an assistant who\
 responds in the style of Dr Seuss."""},    
 'content':"""write me a very short poem\
 about a happy carrot"""},  
response = get_completion_from_messages(messages, temperature=1)
Oh, this carrot is a happy chap
Bright orange and full of sap
With a smile so wide and free
He's the happiest veggie I do see

Well done ChatGPT! Not a bad result. Thus, in this illustration, the system message describes the general tone of what you want the Large Language Model to accomplish, whereas the user message is a specific directive you wished to implement in light of the higher level behaviour that was stated in the system message.

So this is how the chat format works.

The system message establishes the Large Language Model’s or the assistant’s general behaviour, and when you give it a user message—for example, “Tell me a joke” or “Write me a poem”—it responds appropriately while maintaining the general behaviour established in the system message. In addition, even though we are not illustrating it here, if you want to use this in a multi-term conversation, you can input assistant messages in this messages format to let ChatGPT know what it had previously said if you wanted to continue the conversation based on things that it had also said previously.

But let’s look at a few more instances. I can specify in the system message that all of your responses must be one sentence long if you want to establish the tone and instruct it to produce only one sentence of text.

And it only produces one sentence when I run this. It is now only one statement and no longer a poetry in the Dr. Seuss vein. The joyful carrot is the subject of a tale. I can utilise the system message to say, “You are an assistant who responds in the style of Dr. Seuss,” if we want to specify both the style and the length. Your sentences must all be composed of one sentence. Finally, this results in a lovely one-sentence poetry.

# length
messages =  [  
 'content':'All your responses must be \
one sentence long.'},    
 'content':'write me a story about a happy carrot'},  
response = get_completion_from_messages(messages, temperature =1)
There once was a happy carrot named Carl, who lived in a lush vegetable garden and enjoyed the warm sunshine and rain showers, until one day he was picked by a kind farmer who complimented him on his bright orange color and brought him home to be the star ingredient in a delicious carrot cake that his family loved.
# combined
messages =  [  
 'content':"""You are an assistant who \
responds in the style of Dr Seuss. \
All your responses must be one sentence long."""},    
 'content':"""write me a story about a happy carrot"""},
response = get_completion_from_messages(messages, 
                                        temperature =1)
Once there was a carrot so round and so bright, who lived in a garden soaking up sunlight; he worked with his friends and grew every day, and when he was picked, he was overjoyed to say, "I'm happy and grateful, thank you for the care, I'll nourish and delight, don't you dare despair!"

And finally, just for fun, here is a helper function that is a little bit more sophisticated that will tell you how many prompt tokens, completion tokens, and total tokens were used in your API call if you are using an LLM and want to know how many tokens you are using. It does this by getting a response from the OpenAI API endpoint and using other values in the response.

def get_completion_and_token_count(messages, 
    response = openai.ChatCompletion.create(
    content = response.choices[0].message["content"]
    token_dict = {

    return content, token_dict
messages = [
 'content':"""You are an assistant who responds\
 in the style of Dr Seuss."""},    
 'content':"""write me a very short poem \ 
 about a happy carrot"""},  
response, token_dict = get_completion_and_token_count(messages)
Oh, the happy carrot, so bright and so bold,
With a smile on its face, and a story untold.
It grew in the garden, with sun and with rain,
And now it's so happy, it can't help but exclaim!
{'prompt_tokens': 39, 'completion_tokens': 52, 'total_tokens': 91}

And this is a list of the tokens we use. In contrast to the prompt input’s 37 tokens, its output had 55. Therefore, 92 tokens were consumed in total. To be honest, I don’t really give the quantity of tokens I use while utilising LL Models much thought.

If you are concerned that the user may have provided you with an input that is too long and goes beyond ChatGPT’s 4,000 or so token limits, it may be worthwhile to check the number of tokens. In this case, you should double check the number of tokens and truncate the input to ensure that you are staying within the large language model’s input token limits.

7 How Prompting is revolutionisng AI application development

In the conventional supervised machine learning workflow, such as the example of classifying the positive and negative sentiments in restaurant reviews, if you want to build a classifier to categorise restaurant review positive and negative sentiments, you first get a tonne of label data, possibly hundreds of examples. I have no idea how long this will take—possibly a month. The next step would be to tune the model after it had been trained on data, obtain an adequate open source model, and evaluate it. That can require a few days, a few weeks, or even a few months. And after that, you might need to locate a cloud service to deploy it, upload your model there, execute it, and then you can finally call your model. And it’s again not uncommon for this to take a team a few months to get working.

In contrast with prompting-based machine learning, when you have a text application, you can specify a prompt. This can take minutes, maybe hours, if you need to iterate a few times to get an effective prompt. And then in hours, maybe at most days, but frankly more often hours, you can have this running using API calls and start making calls to the model. And once you’ve done that, in just again, maybe minutes or hours, you can start calling the model and start making inferences. And so there are applications that used to take me maybe six months or a year to build, that you can now build in minutes or hours, maybe very small numbers of days using prompting.

This is revolutionising the types of AI applications that can be developed quickly. One key caveat: although vision technology is now far less developed, it’s kind of getting there. This applies to many unstructured data applications, including text applications in particular and perhaps increasingly vision applications. This formula doesn’t really work for machine learning applications on tabular data with plenty of numerical values in Excel spreadsheets. The speed with which AI components can be developed, however, is altering the process by which a whole system might be developed for the applications to which this does apply. Even though the full system might still take days, weeks, or something, at least this component can be built considerably more quickly.

8 Newline Characters

Here, we are using a backslash \ to make the text fit on the screen without inserting newline ‘’ characters. GPT-3 isn’t really affected whether you insert newline characters or not. But when working with LLMs in general, you may consider whether newline characters in your prompt may affect the model’s performance.

9 Acknowledgements

I’d like to express my thanks to the wonderful Building Systems with the ChatGPT API Course by and OpenAI - which i completed, and acknowledge the use of some images and other materials from the course in this article.