Instructions to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ")
model = AutoModelForImageTextToText.from_pretrained("btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ

SGLang

How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with Docker Model Runner:
```
docker model run hf.co/btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ
```

Devstral-Small-2-24B-Instruct - Mixed Precision GPTQ (INT4/INT8)

Mixed-precision GPTQ quantization of mistralai/Devstral-Small-2-24B-Instruct-2512.

Quantization Details

The original model uses FP8 weights which perform best using hardware with native FP8 support. This GPTQ quantization provides compatibility with a wider range of GPUs. Mixed quantization used to reduce model size without significant performance degredation. This quantization sacrifices some performance and memory for better accuracy.

Quantization scheme:

Attention layers (q_proj, k_proj, v_proj, o_proj): INT4, group_size=128
MLP layers (gate_proj, up_proj, down_proj): INT8, group_size=128
Vision layers : unmodified

All layers use group quantization (not channelwise) for ROCm compatibility.

Quantized using llmcompressor with GPTQ. See quantize.py for the full quantization script.

Calibration

Dataset: theblackcat102/evol-codealpaca-v1
Samples: 256
Sequence Length: 8192 tokens

Model Size

Version	Size
Original (FP8)	~25 GB
Quantized (INT4/INT8)	~24 GB

Perplexity

Evaluated on wikitext-2-raw-v1 (test set):

Model	Perplexity	Degradation
Original (FP8)	4.5408	-
Quantized (INT4/INT8)	4.6044	+1.4%
cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit	5.0161	+10.5%

The 4bit AWQ quantization performs significantly faster in my testing, but you do lose some accuracy relative to the original model. That trade off may be worth it depending on your system hardware.

Usage

You will need transformers v5+ to run ministral models. Install via:

pip install transformers>=5.0.0

vLLM (Recommended)

vllm serve btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ \
  --tensor-parallel-size 4 \
  --quantization compressed-tensors

Consider using VLLM_DISABLED_KERNELS=ConchLinearKernel for ROCm. On MI100s performance was degraded using these kernels.

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
)

Hardware Compatibility

AMD GPUs (ROCm): Tested on MI100. Used exllama kernels with group quantization by disabling ConchLinearKernels. Conch kernels work, but may cause a performance degredation.
NVIDIA GPUs (CUDA): Untested, but should work with marlin or exllama kernels.

Credits

Base Model: Mistral AI - Devstral-Small-2-24B-Instruct
Quantization: GPTQ via llmcompressor
Quantized by: btbtyler09

License

This model inherits the license from the base model. See LICENSE for details.

Downloads last month: 2,286

Safetensors

Model size

44B params

Tensor type

I64

I32

BF16

Model tree for btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Quantized

(34)

this model