Instructions to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ChenMnZ/Llama-2-13b-BlockAP-w3g128")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ChenMnZ/Llama-2-13b-BlockAP-w3g128") model = AutoModelForCausalLM.from_pretrained("ChenMnZ/Llama-2-13b-BlockAP-w3g128") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ChenMnZ/Llama-2-13b-BlockAP-w3g128" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ChenMnZ/Llama-2-13b-BlockAP-w3g128
- SGLang
How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ChenMnZ/Llama-2-13b-BlockAP-w3g128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ChenMnZ/Llama-2-13b-BlockAP-w3g128" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ChenMnZ/Llama-2-13b-BlockAP-w3g128", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ChenMnZ/Llama-2-13b-BlockAP-w3g128 with Docker Model Runner:
docker model run hf.co/ChenMnZ/Llama-2-13b-BlockAP-w3g128
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Block-AP (EfficientQAT w/o E2E-AP)
EfficientQAT involves two consecutive training phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP).
In this repo, we provide the quantized checkpoints of Block-AP. Anyone can use them to reproduce our results or carry following research.
Performance
| Model | Quantization | WikiText2 PPL | Avg. Accuracy | Model Size (GB) | Hub link |
|---|---|---|---|---|---|
| Llama-2-7B | fp16 | 5.47 | 64.86 | 13.2 | - |
| Llama-2-7B | w4g128 | 5.56 | 64.07 | 3.7 | Link |
| Llama-2-7B | w3g128 | 5.89 | 63.96 | 3.1 | Link |
| Llama-2-7B | w2g64 | 7.65 | 59.54 | 2.3 | Link |
| Llama-2-7B | w2g128 | 7.94 | 58.72 | 2.2 | Link |
| Llama-2-13B | fp16 | 4.88 | 67.81 | 25.4 | - |
| Llama-2-13B | w4g128 | 4.96 | 67.27 | 6.8 | Link |
| Llama-2-13B | w3g128 | 5.20 | 67.30 | 5.6 | Link |
| Llama-2-13B | w2g64 | 6.55 | 63.10 | 4.0 | Link |
| Llama-2-13B | w2g128 | 6.68 | 63.49 | 3.8 | Link |
| Llama-2-70B | fp16 | 3.32 | 72.41 | 131.6 | - |
| Llama-2-70B | w4g128 | 3.41 | 72.54 | 35.8 | Link |
| Llama-2-70B | w3g128 | 3.65 | 71.88 | 29.1 | Link |
| Llama-2-70B | w2g64 | 4.96 | 69.44 | 20.1 | Link |
| Llama-2-70B | w2g128 | 5.26 | 68.73 | 18.9 | Link |
| Llama-3-8B | fp16 | 6.14 | 68.58 | 13.0 | - |
| Llama-3-8B | w4g128 | 6.50 | 68.43 | 5.4 | Link |
| Llama-3-8B | w3g128 | 7.34 | 66.72 | 4.7 | Link |
| Llama-3-8B | w2g64 | 12.47 | 58.65 | 3.9 | Link |
| Llama-3-8B | w2g128 | 13.25 | 58.23 | 3.8 | Link |
| Llama-3-70B | fp16 | 2.85 | 75.33 | 137.8 | - |
| Llama-3-70B | w4g128 | 3.18 | 74.50 | 38.9 | Link |
| Llama-3-70B | w3g128 | 4.88 | 71.90 | 32.2 | Link |
| Llama-3-70B | w2g64 | 13.75 | 66.70 | 23.2 | Link |
| Llama-3-70B | w2g128 | 16.79 | 65.06 | 22.0 | Link |
| Llama-3-8B-Instruct | fp16 | 8.29 | 68.43 | 13.0 | - |
| Llama-3-8B-Instruct | w4g128 | 8.76 | 67.80 | 5.4 | Link |
| Llama-3-8B-Instruct | w3g128 | 9.83 | 66.54 | 4.7 | Link |
| Llama-3-8B-Instruct | w2g64 | 16.77 | 58.62 | 3.9 | Link |
| Llama-3-8B-Instruct | w2g128 | 18.02 | 57.19 | 3.8 | Link |
| Llama-3-70B-Instruct | fp16 | 5.33 | 73.78 | 137.8 | - |
| Llama-3-70B-Instruct | w4g128 | 5.77 | 73.52 | 38.9 | Link |
| Llama-3-70B-Instruct | w3g128 | 7.25 | 69.80 | 32.2 | Link |
| Llama-3-70B-Instruct | w2g64 | 12.48 | 65.60 | 23.2 | Link |
| Llama-3-70B-Instruct | w2g128 | 13.48 | 61.75 | 22.0 | Link |
Usage
Please refer https://github.com/OpenGVLab/EfficientQAT for details. These checkpoints can be used to following E2E-AP, as well as be inferenced directly.
- Downloads last month
- 1