Instructions to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ChenMnZ/Llama-2-13b-BlockAP-w3g128")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ChenMnZ/Llama-2-13b-BlockAP-w3g128")
model = AutoModelForCausalLM.from_pretrained("ChenMnZ/Llama-2-13b-BlockAP-w3g128")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ChenMnZ/Llama-2-13b-BlockAP-w3g128"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ChenMnZ/Llama-2-13b-BlockAP-w3g128

SGLang

How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ChenMnZ/Llama-2-13b-BlockAP-w3g128" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ChenMnZ/Llama-2-13b-BlockAP-w3g128" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with Docker Model Runner:
```
docker model run hf.co/ChenMnZ/Llama-2-13b-BlockAP-w3g128
```

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Block-AP (EfficientQAT w/o E2E-AP)

EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP).

In this repo, we provide the quantized checkpoints of Block-AP. Anyone can use them to reproduce our results or carry following research.

Performance

Model	Quantization	WikiText2 PPL	Avg. Accuracy	Model Size (GB)	Hub link
Llama-2-7B	fp16	5.47	64.86	13.2	-
Llama-2-7B	w4g128	5.56	64.07	3.7	Link
Llama-2-7B	w3g128	5.89	63.96	3.1	Link
Llama-2-7B	w2g64	7.65	59.54	2.3	Link
Llama-2-7B	w2g128	7.94	58.72	2.2	Link
Llama-2-13B	fp16	4.88	67.81	25.4	-
Llama-2-13B	w4g128	4.96	67.27	6.8	Link
Llama-2-13B	w3g128	5.20	67.30	5.6	Link
Llama-2-13B	w2g64	6.55	63.10	4.0	Link
Llama-2-13B	w2g128	6.68	63.49	3.8	Link
Llama-2-70B	fp16	3.32	72.41	131.6	-
Llama-2-70B	w4g128	3.41	72.54	35.8	Link
Llama-2-70B	w3g128	3.65	71.88	29.1	Link
Llama-2-70B	w2g64	4.96	69.44	20.1	Link
Llama-2-70B	w2g128	5.26	68.73	18.9	Link
Llama-3-8B	fp16	6.14	68.58	13.0	-
Llama-3-8B	w4g128	6.50	68.43	5.4	Link
Llama-3-8B	w3g128	7.34	66.72	4.7	Link
Llama-3-8B	w2g64	12.47	58.65	3.9	Link
Llama-3-8B	w2g128	13.25	58.23	3.8	Link
Llama-3-70B	fp16	2.85	75.33	137.8	-
Llama-3-70B	w4g128	3.18	74.50	38.9	Link
Llama-3-70B	w3g128	4.88	71.90	32.2	Link
Llama-3-70B	w2g64	13.75	66.70	23.2	Link
Llama-3-70B	w2g128	16.79	65.06	22.0	Link
Llama-3-8B-Instruct	fp16	8.29	68.43	13.0	-
Llama-3-8B-Instruct	w4g128	8.76	67.80	5.4	Link
Llama-3-8B-Instruct	w3g128	9.83	66.54	4.7	Link
Llama-3-8B-Instruct	w2g64	16.77	58.62	3.9	Link
Llama-3-8B-Instruct	w2g128	18.02	57.19	3.8	Link
Llama-3-70B-Instruct	fp16	5.33	73.78	137.8	-
Llama-3-70B-Instruct	w4g128	5.77	73.52	38.9	Link
Llama-3-70B-Instruct	w3g128	7.25	69.80	32.2	Link
Llama-3-70B-Instruct	w2g64	12.48	65.60	23.2	Link
Llama-3-70B-Instruct	w2g128	13.48	61.75	22.0	Link

Usage

Please refer https://github.com/OpenGVLab/EfficientQAT for details. These checkpoints can be used to following E2E-AP, as well as be inferenced directly.

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

I32

F16

Collection including ChenMnZ/Llama-2-13b-BlockAP-w3g128

EfficientQAT(w/o E2E-FT)

Collection

This collection provides quantized checkpoints of • 28 items • Updated Jul 21, 2024

Paper for ChenMnZ/Llama-2-13b-BlockAP-w3g128

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Paper • 2407.11062 • Published Jul 10, 2024 • 10