Llama-3.1 8B Instruct (Pruned 30%)

This model was structurally pruned using the taylor method from LLM-Pruner.

Model Details

  • Base Model: Meta Llama 3.1 8B Instruct
  • Pruning Ratio: 30% (MLP layers only)
  • Attention Layers: Intact (Protected to preserve GQA)
  • Pruning Method: taylor
  • Parameters: ~6.36B (reduced from 8.03B)
  • Model Size: ~12.68 GB (FP16)

Architecture Changes

  • Hidden Size: 4096 (unchanged)
  • Intermediate Size: ~10035 (reduced from 14336)
  • Attention Heads: 32 (unchanged)
  • Key-Value Heads: 8 (unchanged, GQA preserved)

How to Load

Standard Transformers (Recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "naveedashfaq/llama-3-8b-pruned-30-percent-taylor",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("naveedashfaq/llama-3-8b-pruned-30-percent-taylor")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

llama.cpp / GGUF Conversion

This model is compatible with llama.cpp. Convert to GGUF format:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download and convert the model
huggingface-cli download naveedashfaq/llama-3-8b-pruned-30-percent-taylor --local-dir ./llama-3-8b-pruned
python convert.py ./llama-3-8b-pruned/

# Optional: Quantize to reduce size further
./quantize ./llama-3-8b-pruned/ggml-model-f16.gguf ./llama-3-8b-pruned/ggml-model-q4_0.gguf q4_0

Performance

  • Size Reduction: 30% fewer parameters
  • Memory Savings: ~3 GB less memory usage
  • Speed: Slightly faster inference due to reduced computations
  • Quality: Minimal quality degradation on most tasks

Limitations

  • This is a structurally pruned model with modified dimensions
  • Performance may vary depending on the task
  • Further evaluation needed for domain-specific applications

Citation

If you use this model, please cite the original LLM-Pruner work:

@article{ma2023llmpruner,
  title={LLM-Pruner: On the Structural Pruning of Large Language Models},
  author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  journal={arXiv preprint arXiv:2305.11627},
  year={2023}
}

License

This model inherits the license from the base Llama 3.1 model. Please refer to Meta's Llama 3.1 license for usage terms.

Downloads last month
32
Safetensors
Model size
6B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for naveedashfaq/llama-3-8b-pruned-30-percent-taylor

Finetuned
(2173)
this model

Paper for naveedashfaq/llama-3-8b-pruned-30-percent-taylor