Instructions to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ") model = AutoModelForImageTextToText.from_pretrained("btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ
- SGLang
How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ with Docker Model Runner:
docker model run hf.co/btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ
Devstral-Small-2-24B-Instruct - Mixed Precision GPTQ (INT4/INT8)
Mixed-precision GPTQ quantization of mistralai/Devstral-Small-2-24B-Instruct-2512.
Quantization Details
The original model uses FP8 weights which perform best using hardware with native FP8 support. This GPTQ quantization provides compatibility with a wider range of GPUs. Mixed quantization used to reduce model size without significant performance degredation. This quantization sacrifices some performance and memory for better accuracy.
Quantization scheme:
- Attention layers (
q_proj,k_proj,v_proj,o_proj): INT4, group_size=128 - MLP layers (
gate_proj,up_proj,down_proj): INT8, group_size=128 - Vision layers : unmodified
All layers use group quantization (not channelwise) for ROCm compatibility.
Quantized using llmcompressor with GPTQ. See quantize.py for the full quantization script.
Calibration
- Dataset: theblackcat102/evol-codealpaca-v1
- Samples: 256
- Sequence Length: 8192 tokens
Model Size
| Version | Size |
|---|---|
| Original (FP8) | ~25 GB |
| Quantized (INT4/INT8) | ~24 GB |
Perplexity
Evaluated on wikitext-2-raw-v1 (test set):
| Model | Perplexity | Degradation |
|---|---|---|
| Original (FP8) | 4.5408 | - |
| Quantized (INT4/INT8) | 4.6044 | +1.4% |
| cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit | 5.0161 | +10.5% |
The 4bit AWQ quantization performs significantly faster in my testing, but you do lose some accuracy relative to the original model. That trade off may be worth it depending on your system hardware.
Usage
You will need transformers v5+ to run ministral models. Install via:
pip install transformers>=5.0.0
vLLM (Recommended)
vllm serve btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ \
--tensor-parallel-size 4 \
--quantization compressed-tensors
Consider using VLLM_DISABLED_KERNELS=ConchLinearKernel for ROCm. On MI100s performance was degraded using these kernels.
HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ"
)
Hardware Compatibility
- AMD GPUs (ROCm): Tested on MI100. Used exllama kernels with group quantization by disabling ConchLinearKernels. Conch kernels work, but may cause a performance degredation.
- NVIDIA GPUs (CUDA): Untested, but should work with marlin or exllama kernels.
Credits
- Base Model: Mistral AI - Devstral-Small-2-24B-Instruct
- Quantization: GPTQ via llmcompressor
- Quantized by: btbtyler09
License
This model inherits the license from the base model. See LICENSE for details.
- Downloads last month
- 2,286
Model tree for btbtyler09/Devstral-Small-2-24B-Instruct-INT4-INT8-Mixed-GPTQ
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503