Llama-3.1 8B Instruct (Pruned 30%)

This model was structurally pruned using the taylor method from LLM-Pruner.

Model Details

Base Model: Meta Llama 3.1 8B Instruct
Pruning Ratio: 30% (MLP layers only)
Attention Layers: Intact (Protected to preserve GQA)
Pruning Method: taylor
Parameters: ~6.36B (reduced from 8.03B)
Model Size: ~12.68 GB (FP16)

Architecture Changes

Hidden Size: 4096 (unchanged)
Intermediate Size: ~10035 (reduced from 14336)
Attention Heads: 32 (unchanged)
Key-Value Heads: 8 (unchanged, GQA preserved)

How to Load

Standard Transformers (Recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "naveedashfaq/llama-3-8b-pruned-30-percent-taylor",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("naveedashfaq/llama-3-8b-pruned-30-percent-taylor")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

llama.cpp / GGUF Conversion

This model is compatible with llama.cpp. Convert to GGUF format:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download and convert the model
huggingface-cli download naveedashfaq/llama-3-8b-pruned-30-percent-taylor --local-dir ./llama-3-8b-pruned
python convert.py ./llama-3-8b-pruned/

# Optional: Quantize to reduce size further
./quantize ./llama-3-8b-pruned/ggml-model-f16.gguf ./llama-3-8b-pruned/ggml-model-q4_0.gguf q4_0

Performance

Size Reduction: 30% fewer parameters
Memory Savings: ~3 GB less memory usage
Speed: Slightly faster inference due to reduced computations
Quality: Minimal quality degradation on most tasks

Limitations

This is a structurally pruned model with modified dimensions
Performance may vary depending on the task
Further evaluation needed for domain-specific applications

Citation

If you use this model, please cite the original LLM-Pruner work:

@article{ma2023llmpruner,
  title={LLM-Pruner: On the Structural Pruning of Large Language Models},
  author={Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
  journal={arXiv preprint arXiv:2305.11627},
  year={2023}
}

License

This model inherits the license from the base Llama 3.1 model. Please refer to Meta's Llama 3.1 license for usage terms.

Downloads last month: 32

Safetensors

Model size

6B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for naveedashfaq/llama-3-8b-pruned-30-percent-taylor

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2173)

this model

Paper for naveedashfaq/llama-3-8b-pruned-30-percent-taylor

LLM-Pruner: On the Structural Pruning of Large Language Models

Paper • 2305.11627 • Published May 19, 2023 • 3