Instructions to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="adamjen/Devstral-Small-2-24B-Opus-Reasoning",
	filename="Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Use Docker

docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

LM Studio
Jan

vLLM

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adamjen/Devstral-Small-2-24B-Opus-Reasoning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamjen/Devstral-Small-2-24B-Opus-Reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Ollama
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Ollama:
```
ollama run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
```

Unsloth Studio new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Pi new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Docker Model Runner:
```
docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
```

Lemonade

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Run and chat with the model

lemonade run user.Devstral-Small-2-24B-Opus-Reasoning-Q4_K_M

List all available models

lemonade list

Devstral-Small-2-24B Opus Reasoning

A LoRA fine-tune of Devstral-Small-2-24B distilled on Claude 4.6 Opus <think>...</think> reasoning traces. The goal: give Devstral's strong coding foundation explicit chain-of-thought reasoning before it writes code.

Model Details


Base model	mistralai/Devstral-Small-2-24B-Instruct-2512
Fine-tune type	QLoRA (4-bit NF4 base + BF16 LoRA adapters)
LoRA rank	r=16, alpha=16
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data	nohurry/Opus-4.6-Reasoning-3000x-filtered (2,322 samples)
Checkpoint used	checkpoint-1200 (end of epoch 2 — best generalisation)
Hardware	RTX 3090 24GB VRAM
Framework	Unsloth 2026.3.10 + TRL SFTTrainer
Sequence length	2048

Files

File	Description
`adapter_model.safetensors`	LoRA adapter weights (~400MB)
`adapter_config.json`	LoRA config (rank, target modules, base model path)
`Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf`	Quantised GGUF — ready for llama.cpp / Ollama / llama-swap
`Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf`	Higher quality GGUF — recommended for local use

Training Data

nohurry/Opus-4.6-Reasoning-3000x-filtered — 2,324 problems with Claude 4.6 Opus <think> reasoning traces and solutions, filtered to < 20,000 characters combined length.

Each sample was formatted as:

[INST] {problem} [/INST]<think>
{thinking}
</think>

{solution}

Loss was computed on the assistant turn only (train_on_responses_only).

Training Loss

Step	Epoch	Loss
5	0.01	0.7949
100	0.17	0.5708
300	0.52	0.5800
600	1.03	0.3559
900	1.55	0.3858
1100	1.89	0.3469
1160	2.00	0.3752
1200	2.07	0.1493

Checkpoint 1200 (end of epoch 2) was selected over the full epoch 3 run — for reasoning distillation tasks, epoch 3 typically overfits to the trace style while epoch 2 gives the best generalisation.

Usage

GGUF (llama.cpp / Ollama / llama-swap)

Download Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf for best quality, or Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf if VRAM is tight.

# llama.cpp
./llama-cli -m unsloth.Q5_K_M.gguf \
  --chat-template mistral \
  -p "[INST] Write a Python function to find all prime numbers up to n using a sieve. [/INST]"

LoRA Adapter (Python)

Requires the base model. Because Devstral is a VLM (Pixtral vision encoder), the easiest path is the text-only extracted weights — see the technical notes below.

import torch
from unsloth import FastLanguageModel
from peft import PeftModel

base_model_path = "path/to/Devstral-Small-2-24B-textonly"  # see notes
adapter_path    = "adamjen/Devstral-Small-2-24B-Opus-Reasoning"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = base_model_path,
    max_seq_length = 2048,
    dtype          = torch.bfloat16,
    load_in_4bit   = True,
)
model = PeftModel.from_pretrained(model, adapter_path)

messages = [{"role": "user", "content": "Write a binary search in Python."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Template

This model uses Mistral's [INST]...[/INST] format. The model will produce a <think>...</think> block before its response.

[INST] Your question here [/INST]<think>
... reasoning ...
</think>

... answer ...

Technical Notes: The Devstral Extraction Problem

Devstral-Small-2-24B ships as a Mistral3ForConditionalGeneration (VLM) with a Pixtral vision encoder. Training it as a text-only model on a single 24GB GPU hits several problems:

FP8 weights: The official instruct release uses FP8 quantisation, which requires compute capability ≥ 8.9. RTX 3090 is 8.6 — incompatible. Requires dequantising to BF16 first.
Vision encoder VRAM: The Pixtral encoder consumes ~4GB VRAM, leaving insufficient headroom for 4-bit QLoRA + gradients.
Device map splitting: With a VLM loaded via device_map="auto", accelerate splits layers across GPU/CPU, breaking distributed training mode.
transformers 5.x concurrent loader: The async tensor loader materialises all BF16 tensors simultaneously before quantisation → OOM. Fix: HF_DEACTIVATE_ASYNC_LOAD=1.

Solution: Extract the Ministral3ForCausalLM language layers into a standalone text-only model directory (stripping vision_tower.* and multi_modal_projector.*, renaming language_model.model.* → model.*). This produces a clean 23B causal LM loadable by FastLanguageModel.

Full write-up with all fixes: Fine-tuning Devstral on an RTX 3090

Hardware Requirements

Format	Min VRAM
Q4_K_M GGUF	~16GB
Q5_K_M GGUF	~18GB
LoRA inference (4-bit)	~20GB
LoRA training (QLoRA)	24GB

Limitations

Trained on 2,322 samples — a small dataset. Performance gains on reasoning are real but limited in breadth.
Max sequence length 2048 tokens (training constraint). Longer contexts may degrade quality.
The <think> block reasoning style is inherited from Claude Opus traces — the model may produce verbose reasoning.
Not evaluated on formal benchmarks.

Author

Adam Jenner — adamjenner.com.au

Downloads last month: 175

GGUF

Model size

24B params

Architecture

mistral3

Hardware compatibility

4-bit

5-bit

Model tree for adamjen/Devstral-Small-2-24B-Opus-Reasoning

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Quantized

mistralai/Devstral-Small-2-24B-Instruct-2512

Adapter

(50)

this model

adamjen
/

Devstral-Small-2-24B-Opus-Reasoning