z-lab/Llama-3.1-8B-Instruct-PARO

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Paper Blog Models PyPI

ParoQuant is the state-of-the-art INT4 quantization for LLMs. It closes the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX). For more information, see https://github.com/z-lab/paroquant.

z-lab/Llama-3.1-8B-Instruct-PARO is a 4-bit meta-llama/Llama-3.1-8B-Instruct quantized with ParoQuant. Check out other ParoQuant models from the Hugging Face collection.

Quick Start

Installation

# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"

# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.19.1" \
  --extra-index-url https://wheels.vllm.ai/0.19.1/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130

# Apple Silicon
pip install "paroquant[mlx]"

Interactive Chat

python -m paroquant.cli.chat --model z-lab/Llama-3.1-8B-Instruct-PARO

OpenAI-Compatible API Server

For vLLM, you can directly use vllm serve to serve ParoQuant models:

vllm serve z-lab/Llama-3.1-8B-Instruct-PARO --port 8000

For other frameworks:

python -m paroquant.cli.serve --model z-lab/Llama-3.1-8B-Instruct-PARO --port 8000

Docker (NVIDIA GPU)

The following commands map the local cache directory to the container in order to persist kernel cache across runs. Remove -v ... to disable this behavior.

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:chat --model z-lab/Llama-3.1-8B-Instruct-PARO

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve --model z-lab/Llama-3.1-8B-Instruct-PARO

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}
Downloads last month
1,164
Safetensors
Model size
1B params
Tensor type
F16
·
I32
·
I16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for z-lab/Llama-3.1-8B-Instruct-PARO

Quantized
(635)
this model

Collection including z-lab/Llama-3.1-8B-Instruct-PARO

Paper for z-lab/Llama-3.1-8B-Instruct-PARO