Llama CPP
Only available on Node.js.
This module is based on the node-llama-cpp Node.js bindings for llama.cpp, allowing you to work with a locally running LLM. This allows you to work with a much smaller quantized model capable of running on a laptop environment, ideal for testing and scratch padding ideas without running up a bill!
Setup
You'll need to install the node-llama-cpp module to communicate with your local model.
- npm
- Yarn
- pnpm
npm install -S node-llama-cpp
yarn add node-llama-cpp
pnpm add node-llama-cpp
You will also need a local Llama 2 model (or a model supported by node-llama-cpp). You will need to pass the path to this model to the LlamaCpp module as a part of the parameters (see example).
Out-of-the-box node-llama-cpp
is tuned for running on a MacOS platform with support for the Metal GPU of Apple M-series of processors. If you need to turn this off or need support for the CUDA architecture then refer to the documentation at node-llama-cpp.
A note to LangChain.js contributors: if you want to run the tests associated with this module you will need to put the path to your local model in the environment variable LLAMA_PATH
.
Guide to installing Llama2
Getting a local Llama2 model running on your machine is a pre-req so this is a quick guide to getting and building Llama 7B (the smallest) and then quantizing it so that it will run comfortably on a laptop. To do this you will need python3
on your machine (3.11 is recommended), also gcc
and make
so that llama.cpp
can be built.
Getting the Llama2 models
To get a copy of Llama2 you need to visit Meta AI and request access to their models. Once Meta AI grant you access, you will receive an email containing a unique URL to access the files, this will be needed in the next steps. Now create a directory to work in, for example:
mkdir llama2
cd llama2
Now we need to get the Meta AI llama
repo in place so we can download the model.
git clone https://github.com/facebookresearch/llama.git
Once we have this in place we can change into this directory and run the downloader script to get the model we will be working with. Note: From here on its assumed that the model in use is llama-2–7b
, if you select a different model don't forget to change the references to the model accordingly.
cd llama
/bin/bash ./download.sh
This script will ask you for the URL that Meta AI sent to you (see above), you will also select the model to download, in this case we used llama-2–7b
. Once this step has completed successfully (this can take some time, the llama-2–7b
model is around 13.5Gb) there should be a new llama-2–7b
directory containing the model and other files.
Converting and quantizing the model
In this step we need to use llama.cpp
so we need to download that repo.
cd ..
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Now we need to build the llama.cpp
tools and set up our python
environment. In these steps it's assumed that your install of python can be run using python3
and that the virtual environment can be called llama2
, adjust accordingly for your own situation.
make
python3 -m venv llama2
source llama2/bin/activate
After activating your llama2 environment you should see (llama2)
prefixing your command prompt to let you know this is the active environment. Note: if you need to come back to build another model or re-quantize the model don't forget to activate the environment again also if you update llama.cpp
you will need to rebuild the tools and possibly install new or updated dependencies! Now that we have an active python environment, we need to install the python dependencies.
python3 -m pip install -r requirements.txt
Having done this, we can start converting and quantizing the Llama2 model ready for use locally via llama.cpp
.
First, we need to convert the model, prior to the conversion let's create a directory to store it in.
mkdir models/7B
python3 convert.py --outfile models/7B/gguf-llama2-f16.bin --outtype f16 ../../llama2/llama/llama-2-7b --vocab-dir ../../llama2/llama/llama-2-7b
This should create a converted model called gguf-llama2-f16.bin
in the directory we just created. Note that this is just a converted model so it is also around 13.5Gb in size, in the next step we will quantize it down to around 4Gb.
./quantize ./models/7B/gguf-llama2-f16.bin ./models/7B/gguf-llama2-q4_0.bin q4_0
Running this should result in a new model being created in the models\7B
directory, this one called gguf-llama2-q4_0.bin
, this is the model we can use with langchain. You can validate this model is working by testing it using the llama.cpp
tools.
./main -m ./models/7B/gguf-llama2-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt
Running this command fires up the model for a chat session. BTW if you are running out of disk space this small model is the only one we need, so you can backup and/or delete the original and converted 13.5Gb models.
Usage
import { LlamaCpp } from "langchain/llms/llama_cpp";
const llamaPath = "/Replace/with/path/to/your/model/gguf-llama2-q4_0.bin";
const question = "Where do Llamas come from?";
const model = new LlamaCpp({ modelPath: llamaPath });
console.log(`You: ${question}`);
const response = await model.invoke(question);
console.log(`AI : ${response}`);
API Reference:
- LlamaCpp from
langchain/llms/llama_cpp
Streaming
import { LlamaCpp } from "langchain/llms/llama_cpp";
const llamaPath = "/Replace/with/path/to/your/model/gguf-llama2-q4_0.bin";
const model = new LlamaCpp({ modelPath: llamaPath, temperature: 0.7 });
const prompt = "Tell me a short story about a happy Llama.";
const stream = await model.stream(prompt);
for await (const chunk of stream) {
console.log(chunk);
}
/*
Once
upon
a
time
,
in
the
rolling
hills
of
Peru
...
*/
API Reference:
- LlamaCpp from
langchain/llms/llama_cpp