Instructions to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rozek/LLaMA-2-7B-32K-Instruct_GGUF", filename="LLaMA-2-7B-32K-Instruct-Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
Use Docker
docker model run hf.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rozek/LLaMA-2-7B-32K-Instruct_GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rozek/LLaMA-2-7B-32K-Instruct_GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
- Ollama
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with Ollama:
ollama run hf.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
- Unsloth Studio new
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rozek/LLaMA-2-7B-32K-Instruct_GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rozek/LLaMA-2-7B-32K-Instruct_GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rozek/LLaMA-2-7B-32K-Instruct_GGUF to start chatting
- Docker Model Runner
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with Docker Model Runner:
docker model run hf.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
- Lemonade
How to use rozek/LLaMA-2-7B-32K-Instruct_GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rozek/LLaMA-2-7B-32K-Instruct_GGUF:Q4_K_M
Run and chat with the model
lemonade run user.LLaMA-2-7B-32K-Instruct_GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:# Run inference directly in the terminal:
llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:# Run inference directly in the terminal:
./llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:Use Docker
docker model run hf.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF:LLaMA-2-7B-32K-Instruct_GGUF
Together Computer, Inc. has released Llama-2-7B-32K-Instruct, a model based on Meta AI's LLaMA-2-7B, but fine-tuned for context lengths up to 32K using "Position Interpolation" and "Rotary Position Embeddings" (RoPE).
While the current version of llama.cpp already supports such large context lengths, it requires quantized files in the new GGUF format - and that's where this repo comes in: it contains the following quantizations of the original weights from Together's fined-tuned model
- Q2_K
- Q3_K_S, Q3_K_M (aka Q3_K) and Q3_K_L
- Q4_0, Q4_1, Q4_K_S and Q4_K_M (aka Q4_K)
- Q5_0, Q5_1, Q5_K_S and Q5_K_M (aka Q5_K)
- Q6_K,
- Q8_0 and
- F16 (unquantized)
Nota bene: while RoPE makes inferences with large contexts possible, you still need an awful lot of RAM when doing so. And since "32K" does not mean that you always have to use a context size of 32768 (only that the model was fine-tuned for that size), it is recommended that you keep your context as small as possible
If you need quantizations for Together Computer's Llama-2-7B-32K model, then look for LLaMA-2-7B-32K_GGUF
How Quantization was done
Since the author does not want arbitrary Python stuff to loiter on his computer, the quantization was done using Docker.
Assuming that you have the Docker Desktop installed on your system and also have a basic knowledge of how to use it, you may just follow the instructions shown below in order to generate your own quantizations:
Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization
- create a new folder called
llama.cpp_in_Docker
this folder will later be mounted into the Docker container and store the quantization results - download the weights for the fine-tuned LLaMA-2 model from
Hugging Face into a subfolder of
llama.cpp_in_Docker(let's call the new folderLLaMA-2-7B-32K-Instruct) - within the Docker Desktop, search for and download a
basic-pythonimage - just use one of the most popular ones - from a terminal session on your host computer (i.e., not a Docker container!), start a new container
for the downloaded image which mounts the folder we created before:
docker run --rm \
-v ./llama.cpp_in_Docker:/llama.cpp \
-t basic-python /bin/bash
(you may have to adjust the path to your local folder)
- back in the Docker Desktop, open the "Terminal" tab of the started container and enter the
following commands (one after the other - copying the complete list and pasting it into the terminal
as a whole does not always seems to work properly):
apt update
apt-get install software-properties-common -y
apt-get update
apt-get install g++ git make -y
cd /llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
- now open the "Files" tab and navigate to the file
/llama.cpp/llama.cpp/Makefile, right-click on it and choose "Edit file" - search for
aarch64, and - in the line found (which looks likeifneq ($(filter aarch64%,$(UNAME_M)),)) - changeifneqtoifeq - save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal" tab again
- now enter the following commands:
make
python3 -m pip install -r requirements.txt
python3 convert.py ../LLaMA-2-7B-32K-Instruct
- you are now ready to run the actual quantization, e.g., using
./quantize ../LLaMA-2-7B-32K-Instruct/ggml-model-f16.gguf \
../LLaMA-2-7B-32K-Instruct/LLaMA-2-7B-32K-Instruct-Q4_0.gguf Q4_0
- run any quantizations you need and stop the container when finished (the container will automatically be deleted but the generated files will remain available on your host computer)
- the
basic-pythonimage may also be deleted (manually) unless you plan to use it again in the near future
You are now free to move the quanitization results to where you need them and run inferences with context lengths up to 32K (depending on the amount of memory you will have available - long contexts need a lot of RAM)
License
Concerning the license(s):
- the original model (from Meta AI) was released under a rather permissive license
- the fine tuned model from Together Computer uses the same license
- as a consequence, this repo does so as well
- Downloads last month
- 335
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF:# Run inference directly in the terminal: llama-cli -hf rozek/LLaMA-2-7B-32K-Instruct_GGUF: