Instructions to use samwell/NV-Reason-CXR-3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use samwell/NV-Reason-CXR-3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="samwell/NV-Reason-CXR-3B-GGUF",
	filename="mmproj-nv-reason-cxr-3b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use samwell/NV-Reason-CXR-3B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16

Use Docker

docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16

LM Studio
Jan

vLLM

How to use samwell/NV-Reason-CXR-3B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "samwell/NV-Reason-CXR-3B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "samwell/NV-Reason-CXR-3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16

Ollama
How to use samwell/NV-Reason-CXR-3B-GGUF with Ollama:
```
ollama run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
```

Unsloth Studio new

How to use samwell/NV-Reason-CXR-3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting

Docker Model Runner
How to use samwell/NV-Reason-CXR-3B-GGUF with Docker Model Runner:
```
docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
```

Lemonade

How to use samwell/NV-Reason-CXR-3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull samwell/NV-Reason-CXR-3B-GGUF:F16

Run and chat with the model

lemonade run user.NV-Reason-CXR-3B-GGUF-F16

List all available models

lemonade list

NV-Reason-CXR-3B GGUF (Quantized for Edge)

Quantized GGUF versions of NVIDIA's NV-Reason-CXR-3B vision-language model optimized for edge deployment for Cactus Compute and llama.cpp.

Model Description

This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems).

Original Model: nvidia/NV-Reason-CXR-3B Base Architecture: Qwen2.5-VL 3B Instruct Conversion: llama.cpp Quantization: llama-cpp-python

Available Models

Filename	Format	Size	Use Case	Quality	Speed
`nv-reason-cxr-3b-fp16.gguf`	FP16	6.3 GB	Desktop with GPU (quality reference)	100%	Baseline
`nv-reason-cxr-3b-Q4_K_M.gguf`	Q4_K_M	1.96 GB	Recommended for edge devices	90-95%	Fast
`mmproj-nv-reason-cxr-3b-f16.gguf`	FP16 mmproj	1.25 GB	Vision encoder (required for image analysis)	100%	-

Model Details

Q4_K_M (Recommended):

Size: 1.96 GB (69% reduction from FP16)
Compression: 3.23x from original
Quality: 90-95% retention
Speed: 8-20 tokens/sec on mobile (device-dependent)
RAM Required: 3-4 GB
Best for: Mid-range to high-end mobile devices

FP16 (Reference):

Size: 6.3 GB
Quality: Original precision
Speed: Slower than quantized
RAM Required: 8+ GB
Best for: Desktop inference, quality comparison

Performance Benchmarks

Desktop (Apple M3 Mac)

Q4_K_M Performance:

Configuration	Load Time	Inference Speed	Memory Usage
CPU-only	1.87s	29.61 tok/s	~2 GB RAM
M3 GPU (Metal)	0.34s	33.24 tok/s	~2 GB RAM
Speedup	5.46x faster ⚡	1.12x faster	Same

Key Insights:

🚀 GPU provides 5.46x faster model loading - Huge benefit for app cold starts!
⚡ Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized
✅ Excellent CPU performance - GPU acceleration is optional, not required
💪 Mobile devices will run well even without dedicated GPU

Test Hardware: Apple M3 MacBook Pro (Metal GPU support)

Mobile Projections

Device	RAM	Expected Speed	Load Time	Rating
Budget Android	3GB	3-5 tok/s	30-45s	Poor
Mid-range Android	4GB	8-12 tok/s	20-30s	Good
High-end Android	6GB	15-20 tok/s	15-25s	Excellent
iPhone 12+	4-6GB	12-18 tok/s	15-20s	Excellent
iPhone 14+	6GB+	18-25 tok/s	10-15s	Optimal

Minimum Requirements:

4GB RAM
3GB free storage
iOS 14+ or Android 8+

Usage

With llama.cpp

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download model
huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \
  nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models

# Run inference
./llama-cli \
  -m models/nv-reason-cxr-3b-Q4_K_M.gguf \
  -p "Analyze this chest X-ray image." \
  --image xray.jpg \
  -n 512 \
  --temp 0.3

With llama-cpp-python

from llama_cpp import Llama

# Option 1: CPU-only (works great, 29.61 tok/s on M3)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=0,  # CPU-only
)

# Option 2: GPU acceleration (5.46x faster loading!)
llm = Llama(
    model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=4,
    n_gpu_layers=-1,  # Use GPU (Metal on Mac, CUDA on Linux/Windows)
)

# Analyze X-ray
response = llm(
    "Analyze this chest X-ray image and identify key findings.",
    max_tokens=512,
    temperature=0.3,  # Lower for medical = more deterministic
    top_p=0.9,
)

print(response['choices'][0]['text'])

With Cactus Compute (Flutter/Mobile)

Note: You need BOTH the model file AND the mmproj file for image analysis.

import 'package:cactus/cactus.dart';

// Initialize VLM with both model and mmproj files
final vlm = CactusVLM();
await vlm.init(
  modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf',      // Model file
  mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder
  contextSize: 2048,    // Context window (2K-4K for mobile)
  threads: 4,           // CPU threads
  gpuLayers: 0,         // CPU-only (GPU may cause issues on some devices)
);

// Create prompt
final messages = [
  ChatMessage(
    role: 'system',
    content: 'You are a helpful radiologist assistant.',
  ),
  ChatMessage(
    role: 'user',
    content: 'Describe what you see in this chest X-ray image.',
  ),
];

// Analyze X-ray
final response = await vlm.completion(
  messages,
  imagePaths: ['path/to/xray.jpg'],
  maxTokens: 150,
  temperature: 0.1,     // Lower for medical analysis (0.1-0.5)
);

print(response.text);

Mobile GPU Benefits:

🚀 5.46x faster model loading (critical for app startup)
📱 Better user experience on iOS (Metal) and Android (Vulkan/OpenCL)
🔋 Minimal battery impact during loading phase
✅ Falls back gracefully to CPU if GPU unavailable

Inference Parameters

Recommended settings for medical analysis:

{
    "temperature": 0.3,      # Lower = more deterministic (range: 0.1-0.5)
    "top_p": 0.9,            # Nucleus sampling
    "top_k": 40,             # Top-k sampling
    "repeat_penalty": 1.1,   # Avoid repetition
    "max_tokens": 512,       # Response length
    "n_ctx": 4096,          # Context window (2048-4096 for mobile)
}

Files Included

.
├── README.md                               # Model card and usage guide
├── LICENSE                                 # NSCLV1 license
├── CONVERSION_PROCESS.md                   # Technical conversion details
├── nv-reason-cxr-3b-Q4_K_M.gguf           # Q4_K_M quantized (1.96 GB) - Recommended
├── nv-reason-cxr-3b-fp16.gguf             # FP16 reference (6.3 GB)
└── mmproj-nv-reason-cxr-3b-f16.gguf       # Vision encoder (1.25 GB) - Required

Model Card

Model Details

Developed by: NVIDIA (original), quantized by samwell
Model type: Vision-Language Model (VLM)
Architecture: Qwen2.5-VL
Parameters: 3 billion
Language: English
License: NSCLV1 (see LICENSE)
Fine-tuned from: Qwen2.5-VL-3B-Instruct
Specialty: Chest X-ray analysis

Intended Use

Primary Use Cases:

Research in medical image analysis
Educational purposes for radiology students
Prototyping mobile medical AI applications
Edge deployment of medical VLMs

Out-of-Scope:

❌ Clinical diagnosis or treatment decisions
❌ Production medical applications without proper validation
❌ Replacing trained radiologists
❌ Any FDA-regulated medical use

Limitations

Not for Clinical Use: This model is for research and educational purposes only
Quality Trade-off: Quantization reduces model size but may affect accuracy
Domain Specific: Trained primarily on chest X-rays, may not generalize to other imaging
Requires Validation: All outputs should be verified by medical professionals
Mobile Performance: Speed varies significantly by device capabilities

Ethical Considerations

Model outputs should not be used for medical diagnosis
Always consult qualified healthcare professionals
Be aware of potential biases in training data
Ensure patient privacy when using with real medical images
Comply with local healthcare regulations (HIPAA, GDPR, etc.)

Bias and Fairness

The original model may have inherited biases from training data. Users should:

Test across diverse patient populations
Validate performance on their specific use cases
Monitor for unexpected outputs or biases
Not rely solely on model outputs

Technical Details

Quantization Method

Q4_K_M uses 4-bit quantization with K-means clustering:

Weights stored in 4 bits instead of 16 (FP16)
K-means clustering for optimal quantization scales
Medium variant balances size and quality
Per-block scales for better accuracy preservation

Vision Encoder

The vision encoder has been extracted into a separate mmproj file for compatibility:

File: mmproj-nv-reason-cxr-3b-f16.gguf (1.25 GB)
Required: Both the model file AND mmproj file are needed for image analysis
Format: FP16 (full precision vision encoder)
Extracted from: NVIDIA's NV-Reason-CXR-3B original model
Contains: Vision transformer blocks and multimodal projection layers (519 tensors)

Why separate mmproj?

Mobile frameworks (Cactus Compute) require separate mmproj architecture
Allows independent caching and loading strategies
Enables mixing different model quantizations with same vision encoder

Context Window

Training: 128,000 tokens
Recommended for mobile: 2,048-4,096 tokens
Desktop: Up to 128,000 tokens (RAM-dependent)

Citation

If you use this model, please cite the original work:

@misc{nvidia2024nvreasoncrx,
  title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/nvidia/NV-Reason-CXR-3B}
}

And optionally cite the quantization:

@misc{nvreasoncrx3b-gguf,
  title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment},
  author={samwell},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF}
}

Acknowledgments

NVIDIA for the original NV-Reason-CXR-3B model
Qwen Team for the Qwen2.5-VL architecture
llama.cpp contributors for the GGUF format and conversion tools
Cactus Compute for mobile VLM deployment framework

License

This model inherits the NSCLV1 license from the original NV-Reason-CXR-3B model. See LICENSE for details.

Key points:

Research and educational use permitted
Commercial use may require additional permissions
Not for clinical/diagnostic use
See original model card for complete license terms

Disclaimer

⚠️ IMPORTANT MEDICAL DISCLAIMER

This model is provided for RESEARCH AND EDUCATIONAL PURPOSES ONLY. It is:

NOT intended for clinical diagnosis or treatment
NOT FDA approved or clinically validated
NOT a substitute for professional medical advice
NOT validated for production medical use

Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software.

Contact & Support

Issues: Report issues on GitHub (link to your repo)
Questions: See documentation in this repository
Original Model: nvidia/NV-Reason-CXR-3B
Cactus Compute: GitHub

Version History

v1.0 (2025-11-05): Initial release
- FP16 GGUF conversion
- Q4_K_M quantization
- Tested on macOS and mobile projections
- Complete documentation and scripts

For research and educational purposes only. Not for clinical use.

Downloads last month: 92

GGUF

Model size

3B params

Architecture

qwen2vl

Hardware compatibility

4-bit

View +1 variant

Model tree for samwell/NV-Reason-CXR-3B-GGUF

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

nvidia/NV-Reason-CXR-3B

Quantized

(1)

this model