Instructions to use MerantixMomentum/acip_qwen25_14b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MerantixMomentum/acip_qwen25_14b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MerantixMomentum/acip_qwen25_14b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MerantixMomentum/acip_qwen25_14b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MerantixMomentum/acip_qwen25_14b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MerantixMomentum/acip_qwen25_14b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MerantixMomentum/acip_qwen25_14b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MerantixMomentum/acip_qwen25_14b

SGLang

How to use MerantixMomentum/acip_qwen25_14b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MerantixMomentum/acip_qwen25_14b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MerantixMomentum/acip_qwen25_14b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MerantixMomentum/acip_qwen25_14b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MerantixMomentum/acip_qwen25_14b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MerantixMomentum/acip_qwen25_14b with Docker Model Runner:
```
docker model run hf.co/MerantixMomentum/acip_qwen25_14b
```

[ 🤖 GitHub | 📄 Paper | 🌐 Website ]

ACIP applied to Qwen/Qwen2.5-14B

This model repository is part of the ACIP Project and provides a compressible version of Qwen/Qwen2.5-14B. For more details, please visit our code repo.

Quick Start

Just load the ACIP model via from_pretrained:

from transformers import AutoModel

model = AutoModel.from_pretrained("MerantixMomentum/acip_qwen25_14b", trust_remote_code=True)

This will download and create a fully parameterized ACIP model that can be pruned to any compression rate you wish. For example,

model.prune_model_by_score(size_ratio=0.4)

will prune model to 40% if its original size measured in number of parameters, i.e., 60% compression rate. A unique feature of ACIP is that this operation is revertible in the sense that you can rerun model.prune_model_by_score as often as you like to evaluate your model at different sizes. Finally, you can "commit" to a certain ratio and run

model.compress()

which will discard all pruned mask values of compressible linear layers. Now the model is actually compressed and you should observe a significant decrease of memory usage (this step is not revertible without reloading the ACIP model). If you like, you can also run

model.quantize()

to save even more memory (we have only tested 4bit quantization with bitsandbytes, but you could also customize this).

🚀 That's it! You can now use your compressed model for inference or fine-tuning as any other Causal Language Model from 🤗 transformers.

Note: The parameter size_ratio ranges from 1.0 to 0.0, indicating the model size after compression. For example, 0.4 means that the model has only 40% of the original number of parameters and 1.0 means no compression at all. Alternatively, you can also set compression_rate in prune_model_by_score, which is equivalent to size_ratio = 1.0 - compression_rate.

Dependencies

To run an ACIP model from our hub, you only need minimal dependencies, namely torch, transformers, peft, and optionally, bitsandbytes in case you want to quantize your model. See requirements.txt for pip-installable dependencies with exact version pins (newer version should work as well).

License

This model is released under the apache-2.0 license.

Citation

When using or referring to this model, please cite our paper:

@article{mxm2025acip,
  title={Choose Your Model Size: Any Compression by a Single Gradient Descent}, 
  author={M. Genzel, P. Putzky, P. Zhao, S. Schulze, M. Mollenhauer, R. Seidel, S. Dietzel, T. Wollmann},
  year={2025},
  journal={Preprint arXiv:2502.01717}
}