Using the Open Source GPT4All Large Language Model

There are many Large Language Models many are not fully accesible - access to the weights and architecture of these models is limited, and even if one does it requires a large amount of resources to carry out any activities and building on top of these APIs is not free. Open-source models like GPT4All get over these limitations and increase everyones access to the LLMs
natural-language-processing
deep-learning
langchain
activeloop
Author

Pranath Fernando

Published

July 30, 2023

1 Introduction

The OpenAI GPT-family models that we have discussed in previous articles are without a doubt effective. However, access to the weights and architecture of these models is limited, and even if one does, it requires a large amount of resources to carry out any activity. It is important to note that, according to a number of benchmarks, the most recent CPU generation from Intel® Xeon® 4s can perform language models more effectively.

Furthermore, building on top of the accessible APIs is not free. The ongoing study of large language models (LLMs) may be constrained by these restrictions. The goal of the alternative open-source models (like GPT4All) is to get over these limitations and increase everyone’s access to the LLMs.

2 How GPT4All works

GPT4All is trained using weights from Facebook’s LLaMA-1 model, which was made available for non-commercial use. However, because there are so many parameters (7 billion), it is difficult to operate the aforementioned design on your own computer. Two techniques were used by the authors to do efficient fine-tuning and inference. Since the fine-tuning procedure is outside the purview of this article, we will concentrate on inference.

The ability to run GPT4All models on a CPU is their main appeal. The fact that modern PCs have potent Central Processing Units makes testing these models essentially free. Quantization is the name of the underlying algorithm that facilitates its implementation. In essence, it uses the GGML format to translate the pre-trained model weights to 4-bit precision. Therefore, the model represents the numbers with less bits. The use of this method has two key benefits:

  1. Reducing Memory Usage: It makes deploying the models more efficient on low-resource devices.
  2. Faster Inference: The models will be faster during the generation process since there will be fewer computations.

It is true that when utilising this method, we are slightly compromising quality. The choice is between having access to a model that is slightly underpowered or having none at all.

By incorporating the models into its infrastructure and using libraries like “Intel® Extension for PyTorch” and “Intel® Neural Compressor,” it is possible to further improve the models and unlock the capabilities of the Intel® CPU. OneAPI Math Kernel Library (oneMKL), which delivers extremely effective and parallelized math routines, and Intel® Advanced Matrix Extensions (Intel® AMX), which optimises matrix operations, are just a couple of the accelerations that their CPUs offer.

As well as Intel® Streaming SIMD Extensions (Intel® SIMD) to enable parallel data processing or Intel® AVX-512 to improve performance and speed up the calculations by expanding the CPU’s register size. Due to these developments, the 4th generation Intel® Xeon® processors are capable pieces of hardware for adjusting and inferring deep learning models in accordance with the cited benchmarks.

3 Import Libs & Setup

from langchain import PromptTemplate, LLMChain
from langchain.llms import GPT4All
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

4 Using GPT4All

4.1 Convert the Model

The first step is to get the weights and convert them from the old format to the new one using a script from the LLaMAcpp source. This step is necessary for the LangChain library to recognise the checkpoint file.

The weights file needs to be downloaded. The weights can be downloaded at url (be sure to get the one that ends in *.ggml.bin), or you can use the Python code snippet below to gradually download each piece of the file. The final folder is specified by the local_path variable.

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])
local_path = './models/gpt4all-lora-quantized-ggml.bin'  # replace with your desired local file path
import requests
from pathlib import Path
from tqdm import tqdm

local_path = './models/gpt4all-lora-quantized-ggml.bin'
Path(local_path).parent.mkdir(parents=True, exist_ok=True)

# Example model. Check https://github.com/nomic-ai/pyllamacpp for the latest models.
url = 'https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized-ggml.bin'
# url = 'http://gpt4all.io/models/ggml-gpt4all-l13b-snoozy.bin'

# send a GET request to the URL to download the file. Stream since it's large
response = requests.get(url, stream=True)

# open the file in binary mode and write the contents of the response to it in chunks
# This is a large file, so be prepared to wait.
with open(local_path, 'wb') as f:
    for chunk in tqdm(response.iter_content(chunk_size=8192)):
        if chunk:
            f.write(chunk)
514266it [01:37, 5285.36it/s]

Given that the file is 4GB in size, this operation can take some time. The downloaded file needs to be converted to the newest format at that point. The first step is to either fork the LLaMAcpp repository’s source code or to obtain it. (The git command must be set up on your computer.) Send the downloaded file to the convert.py script, then use a Python interpreter to run it.

https://github.com/ggerganov/llama.cpp#using-gpt4all

!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && git checkout 2b26469
Cloning into 'llama.cpp'...
remote: Enumerating objects: 4187, done.
remote: Counting objects: 100% (1748/1748), done.
remote: Compressing objects: 100% (195/195), done.
remote: Total 4187 (delta 1654), reused 1573 (delta 1553), pack-reused 2439
Receiving objects: 100% (4187/4187), 3.64 MiB | 15.72 MiB/s, done.
Resolving deltas: 100% (2836/2836), done.
Note: switching to '2b26469'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 2b26469 convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

It takes a few seconds to finish. The script will produce a new file with the name ggml-model-q4_0.bin in the same directory as the original, which can be used in the part that follows.

!python3 llama.cpp/convert.py ./models/gpt4all-lora-quantized-ggml.bin
Loading model file models/gpt4all-lora-quantized-ggml.bin
Writing vocab...
[  1/291] Writing tensor tok_embeddings.weight                  | size  32001 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  2/291] Writing tensor norm.weight                            | size   4096           | type UnquantizedDataType(name='F32')
[  3/291] Writing tensor output.weight                          | size  32001 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  4/291] Writing tensor layers.0.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  5/291] Writing tensor layers.0.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  6/291] Writing tensor layers.0.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  7/291] Writing tensor layers.0.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[  8/291] Writing tensor layers.0.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[  9/291] Writing tensor layers.0.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 10/291] Writing tensor layers.0.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 11/291] Writing tensor layers.0.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 12/291] Writing tensor layers.0.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 13/291] Writing tensor layers.1.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 14/291] Writing tensor layers.1.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 15/291] Writing tensor layers.1.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 16/291] Writing tensor layers.1.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 17/291] Writing tensor layers.1.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 18/291] Writing tensor layers.1.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 19/291] Writing tensor layers.1.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 20/291] Writing tensor layers.1.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 21/291] Writing tensor layers.1.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 22/291] Writing tensor layers.2.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 23/291] Writing tensor layers.2.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 24/291] Writing tensor layers.2.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 25/291] Writing tensor layers.2.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 26/291] Writing tensor layers.2.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 27/291] Writing tensor layers.2.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 28/291] Writing tensor layers.2.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 29/291] Writing tensor layers.2.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 30/291] Writing tensor layers.2.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 31/291] Writing tensor layers.3.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 32/291] Writing tensor layers.3.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 33/291] Writing tensor layers.3.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 34/291] Writing tensor layers.3.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 35/291] Writing tensor layers.3.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 36/291] Writing tensor layers.3.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 37/291] Writing tensor layers.3.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 38/291] Writing tensor layers.3.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 39/291] Writing tensor layers.3.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 40/291] Writing tensor layers.4.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 41/291] Writing tensor layers.4.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 42/291] Writing tensor layers.4.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 43/291] Writing tensor layers.4.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 44/291] Writing tensor layers.4.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 45/291] Writing tensor layers.4.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 46/291] Writing tensor layers.4.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 47/291] Writing tensor layers.4.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 48/291] Writing tensor layers.4.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 49/291] Writing tensor layers.5.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 50/291] Writing tensor layers.5.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 51/291] Writing tensor layers.5.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 52/291] Writing tensor layers.5.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 53/291] Writing tensor layers.5.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 54/291] Writing tensor layers.5.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 55/291] Writing tensor layers.5.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 56/291] Writing tensor layers.5.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 57/291] Writing tensor layers.5.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 58/291] Writing tensor layers.6.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 59/291] Writing tensor layers.6.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 60/291] Writing tensor layers.6.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 61/291] Writing tensor layers.6.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 62/291] Writing tensor layers.6.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 63/291] Writing tensor layers.6.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 64/291] Writing tensor layers.6.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 65/291] Writing tensor layers.6.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 66/291] Writing tensor layers.6.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 67/291] Writing tensor layers.7.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 68/291] Writing tensor layers.7.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 69/291] Writing tensor layers.7.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 70/291] Writing tensor layers.7.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 71/291] Writing tensor layers.7.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 72/291] Writing tensor layers.7.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 73/291] Writing tensor layers.7.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 74/291] Writing tensor layers.7.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 75/291] Writing tensor layers.7.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 76/291] Writing tensor layers.8.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 77/291] Writing tensor layers.8.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 78/291] Writing tensor layers.8.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 79/291] Writing tensor layers.8.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 80/291] Writing tensor layers.8.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 81/291] Writing tensor layers.8.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 82/291] Writing tensor layers.8.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 83/291] Writing tensor layers.8.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 84/291] Writing tensor layers.8.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 85/291] Writing tensor layers.9.attention.wq.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 86/291] Writing tensor layers.9.attention.wk.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 87/291] Writing tensor layers.9.attention.wv.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 88/291] Writing tensor layers.9.attention.wo.weight           | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 89/291] Writing tensor layers.9.attention_norm.weight         | size   4096           | type UnquantizedDataType(name='F32')
[ 90/291] Writing tensor layers.9.feed_forward.w1.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 91/291] Writing tensor layers.9.feed_forward.w2.weight        | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 92/291] Writing tensor layers.9.feed_forward.w3.weight        | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 93/291] Writing tensor layers.9.ffn_norm.weight               | size   4096           | type UnquantizedDataType(name='F32')
[ 94/291] Writing tensor layers.10.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 95/291] Writing tensor layers.10.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 96/291] Writing tensor layers.10.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 97/291] Writing tensor layers.10.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[ 98/291] Writing tensor layers.10.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[ 99/291] Writing tensor layers.10.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[100/291] Writing tensor layers.10.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[101/291] Writing tensor layers.10.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[102/291] Writing tensor layers.10.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[103/291] Writing tensor layers.11.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[104/291] Writing tensor layers.11.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[105/291] Writing tensor layers.11.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[106/291] Writing tensor layers.11.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[107/291] Writing tensor layers.11.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[108/291] Writing tensor layers.11.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[109/291] Writing tensor layers.11.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[110/291] Writing tensor layers.11.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[111/291] Writing tensor layers.11.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[112/291] Writing tensor layers.12.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[113/291] Writing tensor layers.12.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[114/291] Writing tensor layers.12.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[115/291] Writing tensor layers.12.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[116/291] Writing tensor layers.12.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[117/291] Writing tensor layers.12.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[118/291] Writing tensor layers.12.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[119/291] Writing tensor layers.12.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[120/291] Writing tensor layers.12.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[121/291] Writing tensor layers.13.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[122/291] Writing tensor layers.13.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[123/291] Writing tensor layers.13.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[124/291] Writing tensor layers.13.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[125/291] Writing tensor layers.13.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[126/291] Writing tensor layers.13.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[127/291] Writing tensor layers.13.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[128/291] Writing tensor layers.13.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[129/291] Writing tensor layers.13.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[130/291] Writing tensor layers.14.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[131/291] Writing tensor layers.14.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[132/291] Writing tensor layers.14.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[133/291] Writing tensor layers.14.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[134/291] Writing tensor layers.14.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[135/291] Writing tensor layers.14.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[136/291] Writing tensor layers.14.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[137/291] Writing tensor layers.14.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[138/291] Writing tensor layers.14.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[139/291] Writing tensor layers.15.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[140/291] Writing tensor layers.15.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[141/291] Writing tensor layers.15.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[142/291] Writing tensor layers.15.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[143/291] Writing tensor layers.15.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[144/291] Writing tensor layers.15.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[145/291] Writing tensor layers.15.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[146/291] Writing tensor layers.15.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[147/291] Writing tensor layers.15.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[148/291] Writing tensor layers.16.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[149/291] Writing tensor layers.16.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[150/291] Writing tensor layers.16.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[151/291] Writing tensor layers.16.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[152/291] Writing tensor layers.16.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[153/291] Writing tensor layers.16.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[154/291] Writing tensor layers.16.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[155/291] Writing tensor layers.16.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[156/291] Writing tensor layers.16.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[157/291] Writing tensor layers.17.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[158/291] Writing tensor layers.17.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[159/291] Writing tensor layers.17.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[160/291] Writing tensor layers.17.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[161/291] Writing tensor layers.17.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[162/291] Writing tensor layers.17.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[163/291] Writing tensor layers.17.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[164/291] Writing tensor layers.17.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[165/291] Writing tensor layers.17.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[166/291] Writing tensor layers.18.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[167/291] Writing tensor layers.18.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[168/291] Writing tensor layers.18.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[169/291] Writing tensor layers.18.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[170/291] Writing tensor layers.18.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[171/291] Writing tensor layers.18.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[172/291] Writing tensor layers.18.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[173/291] Writing tensor layers.18.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[174/291] Writing tensor layers.18.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[175/291] Writing tensor layers.19.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[176/291] Writing tensor layers.19.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[177/291] Writing tensor layers.19.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[178/291] Writing tensor layers.19.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[179/291] Writing tensor layers.19.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[180/291] Writing tensor layers.19.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[181/291] Writing tensor layers.19.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[182/291] Writing tensor layers.19.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[183/291] Writing tensor layers.19.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[184/291] Writing tensor layers.20.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[185/291] Writing tensor layers.20.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[186/291] Writing tensor layers.20.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[187/291] Writing tensor layers.20.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[188/291] Writing tensor layers.20.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[189/291] Writing tensor layers.20.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[190/291] Writing tensor layers.20.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[191/291] Writing tensor layers.20.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[192/291] Writing tensor layers.20.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[193/291] Writing tensor layers.21.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[194/291] Writing tensor layers.21.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[195/291] Writing tensor layers.21.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[196/291] Writing tensor layers.21.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[197/291] Writing tensor layers.21.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[198/291] Writing tensor layers.21.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[199/291] Writing tensor layers.21.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[200/291] Writing tensor layers.21.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[201/291] Writing tensor layers.21.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[202/291] Writing tensor layers.22.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[203/291] Writing tensor layers.22.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[204/291] Writing tensor layers.22.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[205/291] Writing tensor layers.22.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[206/291] Writing tensor layers.22.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[207/291] Writing tensor layers.22.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[208/291] Writing tensor layers.22.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[209/291] Writing tensor layers.22.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[210/291] Writing tensor layers.22.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[211/291] Writing tensor layers.23.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[212/291] Writing tensor layers.23.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[213/291] Writing tensor layers.23.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[214/291] Writing tensor layers.23.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[215/291] Writing tensor layers.23.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[216/291] Writing tensor layers.23.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[217/291] Writing tensor layers.23.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[218/291] Writing tensor layers.23.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[219/291] Writing tensor layers.23.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[220/291] Writing tensor layers.24.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[221/291] Writing tensor layers.24.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[222/291] Writing tensor layers.24.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[223/291] Writing tensor layers.24.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[224/291] Writing tensor layers.24.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[225/291] Writing tensor layers.24.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[226/291] Writing tensor layers.24.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[227/291] Writing tensor layers.24.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[228/291] Writing tensor layers.24.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[229/291] Writing tensor layers.25.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[230/291] Writing tensor layers.25.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[231/291] Writing tensor layers.25.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[232/291] Writing tensor layers.25.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[233/291] Writing tensor layers.25.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[234/291] Writing tensor layers.25.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[235/291] Writing tensor layers.25.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[236/291] Writing tensor layers.25.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[237/291] Writing tensor layers.25.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[238/291] Writing tensor layers.26.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[239/291] Writing tensor layers.26.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[240/291] Writing tensor layers.26.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[241/291] Writing tensor layers.26.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[242/291] Writing tensor layers.26.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[243/291] Writing tensor layers.26.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[244/291] Writing tensor layers.26.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[245/291] Writing tensor layers.26.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[246/291] Writing tensor layers.26.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[247/291] Writing tensor layers.27.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[248/291] Writing tensor layers.27.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[249/291] Writing tensor layers.27.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[250/291] Writing tensor layers.27.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[251/291] Writing tensor layers.27.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[252/291] Writing tensor layers.27.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[253/291] Writing tensor layers.27.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[254/291] Writing tensor layers.27.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[255/291] Writing tensor layers.27.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[256/291] Writing tensor layers.28.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[257/291] Writing tensor layers.28.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[258/291] Writing tensor layers.28.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[259/291] Writing tensor layers.28.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[260/291] Writing tensor layers.28.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[261/291] Writing tensor layers.28.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[262/291] Writing tensor layers.28.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[263/291] Writing tensor layers.28.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[264/291] Writing tensor layers.28.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[265/291] Writing tensor layers.29.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[266/291] Writing tensor layers.29.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[267/291] Writing tensor layers.29.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[268/291] Writing tensor layers.29.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[269/291] Writing tensor layers.29.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[270/291] Writing tensor layers.29.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[271/291] Writing tensor layers.29.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[272/291] Writing tensor layers.29.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[273/291] Writing tensor layers.29.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[274/291] Writing tensor layers.30.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[275/291] Writing tensor layers.30.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[276/291] Writing tensor layers.30.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[277/291] Writing tensor layers.30.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[278/291] Writing tensor layers.30.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[279/291] Writing tensor layers.30.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[280/291] Writing tensor layers.30.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[281/291] Writing tensor layers.30.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[282/291] Writing tensor layers.30.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
[283/291] Writing tensor layers.31.attention.wq.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[284/291] Writing tensor layers.31.attention.wk.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[285/291] Writing tensor layers.31.attention.wv.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[286/291] Writing tensor layers.31.attention.wo.weight          | size   4096 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[287/291] Writing tensor layers.31.attention_norm.weight        | size   4096           | type UnquantizedDataType(name='F32')
[288/291] Writing tensor layers.31.feed_forward.w1.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[289/291] Writing tensor layers.31.feed_forward.w2.weight       | size   4096 x  11008  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[290/291] Writing tensor layers.31.feed_forward.w3.weight       | size  11008 x   4096  | type QuantizedDataType(groupsize=32, have_addends=False, have_g_idx=False)
[291/291] Writing tensor layers.31.ffn_norm.weight              | size   4096           | type UnquantizedDataType(name='F32')
Wrote models/ggml-model-q4_0.bin

4.2 Load the Model and Generate

The converted GPT4All weights are loaded into the LangChain library using the PyLLaMAcpp module. Install the package with the command pip install pyllamacpp==1.0.7 and import all the required functions. The functions will be thoroughly explained as we use them.

It’s time to load the model using the converted file after we’ve set the desired behaviour.

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager
llm = GPT4All(model="./models/ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True)
llm_chain = LLMChain(prompt=prompt, llm=llm)

The default behaviour is to wait until the model’s inference phase is complete before printing its results. However, due to the vast number of parameters in the model, it can take more than an hour (depending on your hardware) to react to one query. The StreamingStdOutCallbackHandler() callback allows us to immediately display the most recent created token. In this manner, we can be certain that the generation process is active and that the model behaves as intended. Otherwise, the inference can be stopped and the prompt changed.

The weights file must be read, initialised, and the necessary callbacks must be set by the GPT4All class. The LLMChain class can then be used to connect the language model and the prompt. Using the run() method, we will be able to query the model.

5 Documentation Example

Let’s start by saying that defining the prompt is, probably, the most important step in communicating with LLMs. To establish certain ground rules for the model during creation, LangChain makes use of a ProptTemplate object. You may demonstrate how we prefer the model to write, for instance. known as few-shot learning

question = "What NFL team won the Super Bowl in the year Justin Bieber was born?"
llm_chain.run(question)
 Question: What NFL team won the Super Bowl in the year Justin Bieber was born?

Answer: Let's think step by step. First, we need to identify when JB (Justin Beiber) was born - May 1st, 1988 according to his Wikipedia profile page. Then, since NFL team refers to a Football or American football team in professional National Football League league that plays during the season and is considered as one of the major sports leagues from America which has won Super Bowl on Justin Beiber's birth year; we need to look for teams winning Super Bowls before his birth.
As per Wikipedia, "The 2019 NFL Draft will be held April 25–37 in Nashville". Since JB was born in May of the same year as draft (the month and day are both zero), we can confidently conclude that he must have been alive for at least one Super Bowl win by an NFL team. However, since his birth occurred after a couple of teams winning their first ever Super Bowls such as Denver Broncos' victory in 1978 & Dallas Cowboys’ triumph over Miami Dolphins in the same year; we can only guess that JB might have been alive for any one among those mentioned NFL Teams who won Super Bowl after his birth.
' Question: What NFL team won the Super Bowl in the year Justin Bieber was born?\n\nAnswer: Let\'s think step by step. First, we need to identify when JB (Justin Beiber) was born - May 1st, 1988 according to his Wikipedia profile page. Then, since NFL team refers to a Football or American football team in professional National Football League league that plays during the season and is considered as one of the major sports leagues from America which has won Super Bowl on Justin Beiber\'s birth year; we need to look for teams winning Super Bowls before his birth.\nAs per Wikipedia, "The 2019 NFL Draft will be held April 25–37 in Nashville". Since JB was born in May of the same year as draft (the month and day are both zero), we can confidently conclude that he must have been alive for at least one Super Bowl win by an NFL team. However, since his birth occurred after a couple of teams winning their first ever Super Bowls such as Denver Broncos\' victory in 1978 & Dallas Cowboys’ triumph over Miami Dolphins in the same year; we can only guess that JB might have been alive for any one among those mentioned NFL Teams who won Super Bowl after his birth.\n'

The overall framework of the interaction is defined by the template string. In our instance, the interface is a question-and-answer format where the model responds to a user’s query. There are two crucial components:

We declare the placeholder “question” and send it as an input_variable to the template object so that it can subsequently be initialised (by the user).

It establishes a behaviour or style for the model creation process based on our preferences. In the aforementioned sample code, for instance, we want the model to demonstrate its logic step-by-step. There are countless opportunities; it is possible to ask the model a question without giving any specifics, have her respond in one word, and make a joke.

6 Poetic Example

question = "Write a poem about friendship that rhymes."
llm_chain.run(question)
 Question: Write a poem about friendship that rhymes.

Answer: Let's think step by step. First, let me tell you what it means to be friends with someone who is kind and caring like the sunshine on my face when we meet again after our long separation!  This could be considered as one of those rare gifts that makes life worthwhile in this world full of selfishness and dangers. Our friendship will shatter all kinds of challenges or barriers, for I believe there is always hope till the last breath has been breathed out. And we'll share tears together through good times as well as hardships like brave soldiers who are willing to take risks because they value our precious lives dearly! So friends this poem that rhymed may not be perfect but it surely says everything I have in mind for you all my life long, so please enjoy and let's cherish each other always.
" Question: Write a poem about friendship that rhymes.\n\nAnswer: Let's think step by step. First, let me tell you what it means to be friends with someone who is kind and caring like the sunshine on my face when we meet again after our long separation!  This could be considered as one of those rare gifts that makes life worthwhile in this world full of selfishness and dangers. Our friendship will shatter all kinds of challenges or barriers, for I believe there is always hope till the last breath has been breathed out. And we'll share tears together through good times as well as hardships like brave soldiers who are willing to take risks because they value our precious lives dearly! So friends this poem that rhymed may not be perfect but it surely says everything I have in mind for you all my life long, so please enjoy and let's cherish each other always."

7 Mother’s day Example

question = "Write a social media post to celebrate mother's day."
llm_chain.run(question)
 Question: Write a social media post to celebrate mother's day.

Answer: Let's think step by step. First, we need an attractive and catchy headline that will make your mom feel appreciated on this special occasion! Here are some examples of what you could write as the heading:"Mommy, You Are The Best Thing That Has Ever Happened To Me" or "I Am So Lucky to Have a Mother Like YOU".
Now it's time for an introduction. Write about your mom’s accomplishments and achievements that make her special:“My mother has the most contagious laugh I have ever heard, she can turn any ordinary day into a happy one.” or “Mommy, You are always there to listen when we need you”
Then comes the body of your post. This is where you'll share all those adorable childhood memories with her and show how much they mean to both of you:“My mom has taught me so many things that I would never have learned otherwise.” or “I can’t imagine a day without my Mommy”
In conclusion, write something heartfelt about your mother. Express all the love in our hearts for her and say thank you as well because she is always there to support us:“My mom has taught me what uncond
' Question: Write a social media post to celebrate mother\'s day.\n\nAnswer: Let\'s think step by step. First, we need an attractive and catchy headline that will make your mom feel appreciated on this special occasion! Here are some examples of what you could write as the heading:"Mommy, You Are The Best Thing That Has Ever Happened To Me" or "I Am So Lucky to Have a Mother Like YOU".\nNow it\'s time for an introduction. Write about your mom’s accomplishments and achievements that make her special:“My mother has the most contagious laugh I have ever heard, she can turn any ordinary day into a happy one.” or “Mommy, You are always there to listen when we need you”\nThen comes the body of your post. This is where you\'ll share all those adorable childhood memories with her and show how much they mean to both of you:“My mom has taught me so many things that I would never have learned otherwise.” or “I can’t imagine a day without my Mommy”\nIn conclusion, write something heartfelt about your mother. Express all the love in our hearts for her and say thank you as well because she is always there to support us:“My mom has taught me what uncond'

8 Explain Rain Example

question = "What happens when it rains somewhere?"
llm_chain.run(question)
 Question: What happens when it rains somewhere?

Answer: Let's think step by step. When rain falls, first of all the water vaporizes from clouds and travel to lower altitude where air is denser. Then these drops hit surfaces like land or trees etc., which are considered as a target for this falling particles known as rainfall. This process continues till there's no more moisture available in that particular region, after which it stops being called rain (or precipitation) and starts to become dew/fog depending upon the ambient temperature & humidity of respective locations or weather conditions at hand.
" Question: What happens when it rains somewhere?\n\nAnswer: Let's think step by step. When rain falls, first of all the water vaporizes from clouds and travel to lower altitude where air is denser. Then these drops hit surfaces like land or trees etc., which are considered as a target for this falling particles known as rainfall. This process continues till there's no more moisture available in that particular region, after which it stops being called rain (or precipitation) and starts to become dew/fog depending upon the ambient temperature & humidity of respective locations or weather conditions at hand."

9 Rain Example, but one sentence and funny

It is advised to test various prompt templates to discover the one that best suits your requirements. The same issue is posed in the next example, but the model is only allowed to produce two sentences and is expected to be humorous.

template = """Question: {question}

Answer: Let's answer in two sentence while being funny."""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
llm_chain.run(question)
 Question: What happens when it rains somewhere?

Answer: Let's answer in two sentence while being funny. 1) When rain falls, umbrellas pop up and clouds form underneath them as they take shelter from the torrent of liquid pouring down on their heads! And...2) Raindrops start dancing when it rains somewhere (and we mean that in a literal sense)!
" Question: What happens when it rains somewhere?\n\nAnswer: Let's answer in two sentence while being funny. 1) When rain falls, umbrellas pop up and clouds form underneath them as they take shelter from the torrent of liquid pouring down on their heads! And...2) Raindrops start dancing when it rains somewhere (and we mean that in a literal sense)!"

10 Conclusion

We learnt about open-source large language models, how to install one on your personal computer running an Intel® CPU, and how to use the question prompt template. We also spoke about the quantization technique that enables this.

11 Acknowledgements

I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.

Subscribe