Logo

๐Ÿ™ Github   |   ๐Ÿ“„ Paper

A-SINQ 4-bit Quantized Apertus-8B-2509 model

This repository contains the official 4-bit quantized version of the Apertus-8B-2509 model using the calibrated version of SINQ (Sinkhorn-Normalized Quantization) method.
SINQ is a novel, fast and high-quality quantization method designed to make any Large Language Models smaller while keeping their accuracy almost intact.

To support the project please put a star โญ in the official SINQ github repository.

Model Details

  • Model Name: Apertus-8B-2509-4bit-ASINQ
  • Base Model: swiss-ai/Apertus-8B-2509
  • Task: Text Generation
  • Framework: PyTorch / Transformers
  • License: Apache-2.0
  • Quantized By: Huawei - Computing Systems Lab

Quantization Details

  • Quantization Method: A-SINQ (Sinkhorn-Normalized Quantization)
  • Precision: INT4
  • Group Size: 64
  • Framework: PyTorch
  • Quantization Library: sinq

๐Ÿš€ Usage

Prerequisite

  • Before running the quantization script, make sure the SINQ library is installed. Installation instructions and setup details are available in the SINQ official github repository.

  • For optimal inference speed, ensure that the GemLite library is installed.

Usage example

You can load and use the model with our wrapper based on the ๐Ÿค— Transformers library:

from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Apertus-8B-2509-4bit-ASINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

# OPTIONAL: use it if you want to further increase the inference speed
sinq_model.forward(torch.tensor([[0]]).to(device))
sinq_model.forward = torch.compile(sinq_model.forward, dynamic=True, fullgraph=False, backend='inductor', mode='reduce-overhead')

template = """{% for m in messages -%}
{{ m['role'] }}: {{ m['content'] }}
{% endfor -%}
{% if add_generation_prompt %}assistant: {% endif %}"""

tokenizer.chat_template = template  # set once per tokenizer

# prepare the model input
prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device)

# Generate the output
generated_ids = sinq_model.generate(**model_inputs, max_new_tokens=100)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

You can optionally compile the modelโ€™s forward pass using torch.compile, which can provide a significant speed boost (especially after the first run). Please consider that the first run will take longer because PyTorch compiles optimized kernels, but subsequent runs will be much faster.

๐Ÿงฉ Quantization Process

The quantized model was obtained using the SINQ quantization library, following the steps below:

from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "swiss-ai/Apertus-8B-2509"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="asinq"       # quantization method ("asinq" for the calibrated version)
)

qmodel = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)

Reproducibility Note: This model was quantized using the SINQ implementation from commit bbbc657 of the SINQ repository.



๐Ÿงพ How to Cite This Work

If you find SINQ useful in your research or applications, please

  • Put a star โญ in the official SINQ github repository.
  • Cite our paper:
@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}
Downloads last month
2
Safetensors
Model size
5B params
Tensor type
BF16
ยท
F16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for huawei-csl/Apertus-8B-2509-4bit-ASINQ

Quantized
(6)
this model

Collection including huawei-csl/Apertus-8B-2509-4bit-ASINQ

Paper for huawei-csl/Apertus-8B-2509-4bit-ASINQ