LLM-Pruner: On the Structural Pruning of Large Language Models
Paper
•
2305.11627
•
Published
•
3
This model was structurally pruned using the taylor method from LLM-Pruner.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"naveedashfaq/llama-3-8b-pruned-30-percent-taylor",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("naveedashfaq/llama-3-8b-pruned-30-percent-taylor")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This model is compatible with llama.cpp. Convert to GGUF format:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download and convert the model
huggingface-cli download naveedashfaq/llama-3-8b-pruned-30-percent-taylor --local-dir ./llama-3-8b-pruned
python convert.py ./llama-3-8b-pruned/
# Optional: Quantize to reduce size further
./quantize ./llama-3-8b-pruned/ggml-model-f16.gguf ./llama-3-8b-pruned/ggml-model-q4_0.gguf q4_0
If you use this model, please cite the original LLM-Pruner work:
@article{ma2023llmpruner,
title={LLM-Pruner: On the Structural Pruning of Large Language Models},
author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
journal={arXiv preprint arXiv:2305.11627},
year={2023}
}
This model inherits the license from the base Llama 3.1 model. Please refer to Meta's Llama 3.1 license for usage terms.
Base model
meta-llama/Llama-3.1-8B