🐙 Github | 📄 Paper

A-SINQ 4-bit Quantized Apertus-8B-2509 model

This repository contains the official 4-bit quantized version of the Apertus-8B-2509 model using the calibrated version of SINQ (Sinkhorn-Normalized Quantization) method.
SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.

To support the project please put a star ⭐ in the official SINQ github repository.

Model Details

Model Name: Apertus-8B-2509-4bit-ASINQ
Base Model: swiss-ai/Apertus-8B-2509
Task: Text Generation
Framework: PyTorch / Transformers
License: Apache-2.0
Quantized By: Huawei - Computing Systems Lab

Quantization Details

Quantization Method: A-SINQ (Sinkhorn-Normalized Quantization)
Precision: INT4
Group Size: 64
Framework: PyTorch
Quantization Library: sinq

🚀 Usage

Prerequisite

Before running the quantization script, make sure the SINQ library is installed. Installation instructions and setup details are available in the SINQ official github repository.
For optimal inference speed, ensure that the GemLite library is installed.

Usage example

You can load and use the model with our wrapper based on the 🤗 Transformers library:

from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Apertus-8B-2509-4bit-ASINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

# OPTIONAL: use it if you want to further increase the inference speed
sinq_model.forward(torch.tensor([[0]]).to(device))
sinq_model.forward = torch.compile(sinq_model.forward, dynamic=True, fullgraph=False, backend='inductor', mode='reduce-overhead')

template = """{% for m in messages -%}
{{ m['role'] }}: {{ m['content'] }}
{% endfor -%}
{% if add_generation_prompt %}assistant: {% endif %}"""

tokenizer.chat_template = template  # set once per tokenizer

# prepare the model input
prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device)

# Generate the output
generated_ids = sinq_model.generate(**model_inputs, max_new_tokens=100)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

You can optionally compile the model’s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run). Please consider that the first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.

🧩 Quantization Process

The quantized model was obtained using the SINQ quantization library, following the steps below:

from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "swiss-ai/Apertus-8B-2509"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="asinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)

Reproducibility Note: This model was quantized using the SINQ implementation from commit bbbc657 of the SINQ repository.

🧾 How to Cite This Work

If you find SINQ useful in your research or applications, please

Put a star ⭐ in the official SINQ github repository.
Cite our paper:

@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}