How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
Use Docker
docker model run hf.co/ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF:Q2_K
Quick Links

Llama-3.1-8B-Instruct GGUF DASLab Quantization

This repository contains advanced quantized versions of Llama 3.1 8B Instruct using GPTQ quantization and GPTQ+EvoPress optimization from the DASLab GGUF Toolkit.

Models

  • GPTQ Uniform: High-quality GPTQ quantization at 2-6 bit precision
  • GPTQ+EvoPress: Non-uniform per-layer quantization discovered via evolutionary search

Performance

Our GPTQ-based quantization methods achieve superior quality-compression tradeoffs compared to standard quantization:

  • Better perplexity at equivalent bitwidths vs. naive quantization approaches
  • Error-correcting updates during calibration for improved accuracy
  • Optimized configurations that allocate bits based on layer sensitivity (EvoPress)

Usage

Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required.

Full documentation, evaluation results, and toolkit source: https://github.com/IST-DASLab/gguf-toolkit


Downloads last month
90
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF

Quantized
(636)
this model

Collection including ISTA-DASLab/Llama-3.1-8B-Instruct-GGUF