LLaMa-2 70B Chatbot in Hugging Face and LangChain

In this article we will look at how we can use the open source Llama-70b-chat model in both Hugging Face transformers and LangChain
natural-language-processing
deep-learning
langchain
hugging-face
Author

Pranath Fernando

Published

July 26, 2023

1 Introduction

In this article we’ll explore how we can use the open source Llama-70b-chat model in both Hugging Face transformers and LangChain.

Full credit for this article content needs to be given to James Briggs whos YouTube video Llama 2 in LangChain β€” FIRST Open Source Conversational Agent! this article is based on with some relatively minor additions from me. There are also some formatting issues at the end of this article I have been unable to resolve!

🚨 Note that running this on CPU is practically impossible. It will take a very long time. If running on Google Colab you go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > A100. Using this notebook requires ~38GB of GPU RAM.

However I have also discovered that running this on Google Collab free version is unfeasible as you usually get allocated the less powerful T4 GPU, as well as running out of disk space while trying to download the 80 GB model! So I would recommend if using Collab to use the Pro version.

We start by doing a pip install of all required libraries.

!pip install -qU transformers accelerate einops langchain xformers bitsandbytes
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 56.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 28.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.2/42.2 kB 5.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 79.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.1/109.1 MB 17.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.5/92.5 MB 16.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 32.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 88.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 81.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.0/90.0 kB 12.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.4/49.4 kB 5.1 MB/s eta 0:00:00

2 Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a text-generation pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

  • A LLM, in this case it will be meta-llama/Llama-2-70b-chat-hf.

  • The respective tokenizer for the model.

  • A stopping criteria object.

We’ll explain these as we get to them, let’s begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-70b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = '<YOUR_API_KEY>'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Model loaded on cuda:0

The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 70B models were trained using the Llama 2 70B tokenizer, which we initialize like so:

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1714: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(

Finally we need to define the stopping criteria of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit of a tangent after answering the initial question.

```json
{
    "action": "Calculator",
    "action_input": "2+2"
}
```
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

We need to convert these into LongTensor objects:

import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

We can do a quick spot check that no <unk> token IDs (0) appear in the stop_token_ids β€” there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied β€” meaning whether any of these token ID combinations have been generated.

from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

Now we’re ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    #stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

res = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(res[0]["generated_text"])
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
  warnings.warn(
Explain to me the difference between nuclear fission and fusion.
Nuclear fission is a process in which an atomic nucleus splits into two or more smaller nuclei, releasing a large amount of energy in the process. This occurs when an atom's nucleus is bombarded with a high-energy particle, such as a neutron. The resulting nuclei are typically smaller and lighter than the original nucleus, and the excess energy is released as radiation. Fission is the process used in nuclear power plants to generate electricity.
Nuclear fusion, on the other hand, is the process by which two or more atomic nuclei combine to form a single, heavier nucleus. This process also releases a large amount of energy, but it requires the nuclei to be brought together at extremely high temperatures and pressures, typically found in the core of stars. Fusion is the process that powers the sun and other stars.
The key difference between fission and fusion is the direction of the energy release. In fission, the energy is released outward from the nucleus, while in fusion, the energy is released inward, binding the nuclei together. Additionally, fission typically involves the splitting of heavy elements, while fusion involves the combining of light elements.

Now to implement this in LangChain

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)
llm(prompt="Explain to me the difference between nuclear fission and fusion.")
"\nNuclear fission is a process in which an atomic nucleus splits into two or more smaller nuclei, releasing a large amount of energy in the process. This occurs when an atom's nucleus is bombarded with a high-energy particle, such as a neutron. The resulting nuclei are typically smaller and lighter than the original nucleus, and the excess energy is released as radiation. Fission is the process used in nuclear power plants to generate electricity.\nNuclear fusion, on the other hand, is the process by which two or more atomic nuclei combine to form a single, heavier nucleus. This process also releases a large amount of energy, but it requires the nuclei to be brought together at extremely high temperatures and pressures, typically found in the core of stars. Fusion is the process that powers the sun and other stars.\nThe key difference between fission and fusion is the direction of the energy release. In fission, the energy is released outward from the nucleus, while in fusion, the energy is released inward, binding the nuclei together. Additionally, fission typically involves the splitting of heavy elements, while fusion involves the combining of light elements."

We still get the same output as we’re not really doing anything differently here, but we have now added Llama 2 70B Chat to the LangChain library. Using this we can now begin using LangChain’s advanced agent tooling, chains, etc, with Llama 2.

3 Initializing an Conversational Agent

Getting a conversational agent to work with open source models is incredibly hard. However, with Llama 2 70B it is now possible. Let’s see how we can get it running!

We first need to initialize the agent. Conversational agents require several things such as conversational memory, access to tools, and an llm (which we have already initialized).

from langchain.memory import ConversationBufferWindowMemory
from langchain.agents import load_tools

memory = ConversationBufferWindowMemory(
    memory_key="chat_history", k=5, return_messages=True, output_key="output"
)
tools = load_tools(["llm-math"], llm=llm)
from langchain.agents import AgentOutputParser
from langchain.agents.conversational_chat.prompt import FORMAT_INSTRUCTIONS
from langchain.output_parsers.json import parse_json_markdown
from langchain.schema import AgentAction, AgentFinish

class OutputParser(AgentOutputParser):
    def get_format_instructions(self) -> str:
        return FORMAT_INSTRUCTIONS

    def parse(self, text: str) -> AgentAction | AgentFinish:
        try:
            # this will work IF the text is a valid JSON with action and action_input
            response = parse_json_markdown(text)
            action, action_input = response["action"], response["action_input"]
            if action == "Final Answer":
                # this means the agent is finished so we call AgentFinish
                return AgentFinish({"output": action_input}, text)
            else:
                # otherwise the agent wants to use an action, so we call AgentAction
                return AgentAction(action, action_input, text)
        except Exception:
            # sometimes the agent will return a string that is not a valid JSON
            # often this happens when the agent is finished
            # so we just return the text as the output
            return AgentFinish({"output": text}, text)

    @property
    def _type(self) -> str:
        return "conversational_chat"

# initialize output parser for agent
parser = OutputParser()
from langchain.agents import initialize_agent

# initialize agent
agent = initialize_agent(
    agent="chat-conversational-react-description",
    tools=tools,
    llm=llm,
    verbose=True,
    early_stopping_method="generate",
    memory=memory,
    #agent_kwargs={"output_parser": parser}
)
agent.agent.llm_chain.prompt
ChatPromptTemplate(input_variables=['input', 'chat_history', 'agent_scratchpad'], output_parser=None, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='Assistant is a large language model trained by OpenAI.\n\nAssistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.\n\nAssistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.\n\nOverall, Assistant is a powerful system that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.', template_format='f-string', validate_template=True), additional_kwargs={}), MessagesPlaceholder(variable_name='chat_history'), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], output_parser=None, partial_variables={}, template='TOOLS\n------\nAssistant can ask the user to use tools to look up information that may be helpful in answering the users original question. The tools the human can use are:\n\n> Calculator: Useful for when you need to answer questions about math.\n\nRESPONSE FORMAT INSTRUCTIONS\n----------------------------\n\nWhen responding to me, please output a response in one of two formats:\n\n**Option 1:**\nUse this if you want the human to use a tool.\nMarkdown code snippet formatted in the following schema:\n\n```json\n{{\n    "action": string, \\ The action to take. Must be one of Calculator\n    "action_input": string \\ The input to the action\n}}\n```\n\n**Option #2:**\nUse this if you want to respond directly to the human. Markdown code snippet formatted in the following schema:\n\n```json\n{{\n    "action": "Final Answer",\n    "action_input": string \\ You should put what you want to return to use here\n}}\n```\n\nUSER\'S INPUT\n--------------------\nHere is the user\'s input (remember to respond with a markdown code snippet of a json blob with a single action, and NOTHING else):\n\n{input}', template_format='f-string', validate_template=True), additional_kwargs={}), MessagesPlaceholder(variable_name='agent_scratchpad')])

We need to add special tokens used to signify the beginning and end of instructions, and the beginning and end of system messages. These are described in the Llama-2 model cards on Hugging Face.

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
sys_msg = B_SYS + """Assistant is a expert JSON builder designed to assist with a wide range of tasks.

Assistant is able to respond to the User and use tools using JSON strings that contain "action" and "action_input" parameters.

All of Assistant's communication is performed using this JSON format.

Assistant can also use tools by responding to the user with tool use instructions in the same "action" and "action_input" JSON format. Tools available to Assistant are:

- "Calculator": Useful for when you need to answer questions about math.
  - To use the calculator tool, Assistant should write like so:
    ```json
    {{"action": "Calculator",
      "action_input": "sqrt(4)"}}
    ```

Here are some previous conversations between the Assistant and User:

User: Hey how are you today?
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "I'm good thanks, how are you?"}}
\```
User: I'm great, what is the square root of 4?
Assistant: ```json
{{"action": "Calculator",
 "action_input": "sqrt(4)"}}
\```
User: 2.0
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "It looks like the answer is 2!"}}
\```
User: Thanks could you tell me what 4 to the power of 2 is?
Assistant: ```json
{{"action": "Calculator",
 "action_input": "4**2"}}
\```
User: 16.0
Assistant: ```json
{{"action": "Final Answer",
 "action_input": "It looks like the answer is 16!"}}
\```

\Here is the latest conversation between Assistant and User.""" + E_SYS
new_prompt = agent.agent.create_prompt(
    system_message=sys_msg,
    tools=tools
)
agent.agent.llm_chain.prompt = new_prompt

In the Llama 2 paper they mentioned that it was difficult to keep Llama 2 chat models following instructions over multiple interactions. One way they found that improves this is by inserting a reminder of the instructions to each user query. We do that here:

instruction = B_INST + " Respond to the following in JSON with 'action' and 'action_input' values " + E_INST
human_msg = instruction + "\nUser: {input}"

agent.agent.llm_chain.prompt.messages[2].prompt.template = human_msg
agent.agent.llm_chain.prompt
ChatPromptTemplate(input_variables=['input', 'chat_history', 'agent_scratchpad'], output_parser=None, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], output_parser=None, partial_variables={}, template='<<SYS>>\nAssistant is a expert JSON builder designed to assist with a wide range of tasks.\n\nAssistant is able to respond to the User and use tools using JSON strings that contain "action" and "action_input" parameters.\n\nAll of Assistant\'s communication is performed using this JSON format.\n\nAssistant can also use tools by responding to the user with tool use instructions in the same "action" and "action_input" JSON format. Tools available to Assistant are:\n\n- "Calculator": Useful for when you need to answer questions about math.\n  - To use the calculator tool, Assistant should write like so:\n    ```json\n    {{"action": "Calculator",\n      "action_input": "sqrt(4)"}}\n    ```\n\nHere are some previous conversations between the Assistant and User:\n\nUser: Hey how are you today?\nAssistant: ```json\n{{"action": "Final Answer",\n "action_input": "I\'m good thanks, how are you?"}}\n```\nUser: I\'m great, what is the square root of 4?\nAssistant: ```json\n{{"action": "Calculator",\n "action_input": "sqrt(4)"}}\n```\nUser: 2.0\nAssistant: ```json\n{{"action": "Final Answer",\n "action_input": "It looks like the answer is 2!"}}\n```\nUser: Thanks could you tell me what 4 to the power of 2 is?\nAssistant: ```json\n{{"action": "Calculator",\n "action_input": "4**2"}}\n```\nUser: 16.0\nAssistant: ```json\n{{"action": "Final Answer",\n "action_input": "It looks like the answer is 16!"}}\n```\n\nHere is the latest conversation between Assistant and User.\n<</SYS>>\n\n', template_format='f-string', validate_template=True), additional_kwargs={}), MessagesPlaceholder(variable_name='chat_history'), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], output_parser=None, partial_variables={}, template="[INST] Respond to the following in JSON with 'action' and 'action_input' values [/INST]\nUser: {input}", template_format='f-string', validate_template=True), additional_kwargs={}), MessagesPlaceholder(variable_name='agent_scratchpad')])

Now we can begin asking questions…

agent("hey how are you today?")


> Entering new AgentExecutor chain...

Assistant: ```json
{"action": "Final Answer",
 "action_input": "I'm good thanks, how are you?"}
```

> Finished chain.
{'input': 'hey how are you today?',
 'chat_history': [],
 'output': "I'm good thanks, how are you?"}
agent("what is 4 to the power of 2.1?")


> Entering new AgentExecutor chain...

Assistant: ```json
{"action": "Calculator",
 "action_input": "4**2.1"}
```
Observation: Answer: 18.37917367995256
Thought:

AI: 
Assistant: ```json
{"action": "Final Answer",
 "action_input": "It looks like the answer is 18.37917367995256!"}
```

> Finished chain.
{'input': 'what is 4 to the power of 2.1?',
 'chat_history': [HumanMessage(content='hey how are you today?', additional_kwargs={}, example=False),
  AIMessage(content="I'm good thanks, how are you?", additional_kwargs={}, example=False)],
 'output': 'It looks like the answer is 18.37917367995256!'}
agent("can you multiply that by 3?")


> Entering new AgentExecutor chain...

Assistant: ```json
{"action": "Calculator",
 "action_input": "18.37917367995256 * 3"}
```
Observation: Answer: 55.13752103985769
Thought:

AI: 
Assistant: ```json
{"action": "Final Answer",
 "action_input": "It looks like the answer is 55.13752103985769!"}
```

> Finished chain.
/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:1083: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  warnings.warn(
{'input': 'can you multiply that by 3?',
 'chat_history': [HumanMessage(content='hey how are you today?', additional_kwargs={}, example=False),
  AIMessage(content="I'm good thanks, how are you?", additional_kwargs={}, example=False),
  HumanMessage(content='what is 4 to the power of 2.1?', additional_kwargs={}, example=False),
  AIMessage(content='It looks like the answer is 18.37917367995256!', additional_kwargs={}, example=False)],
 'output': 'It looks like the answer is 55.13752103985769!'}

With that we have our open source conversational agent running on Colab with ~38GB of RAM.

4 Acknowledgements

I’d like to give full credit for this article content to James Briggs and his YouTube video Llama 2 in LangChain β€” FIRST Open Source Conversational Agent! and his accompanying notebook on which this article is based on.

Subscribe