Instructions to use samwell/NV-Reason-CXR-3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use samwell/NV-Reason-CXR-3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="samwell/NV-Reason-CXR-3B-GGUF", filename="mmproj-nv-reason-cxr-3b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use samwell/NV-Reason-CXR-3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf samwell/NV-Reason-CXR-3B-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf samwell/NV-Reason-CXR-3B-GGUF:F16
Use Docker
docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use samwell/NV-Reason-CXR-3B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "samwell/NV-Reason-CXR-3B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samwell/NV-Reason-CXR-3B-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
- Ollama
How to use samwell/NV-Reason-CXR-3B-GGUF with Ollama:
ollama run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
- Unsloth Studio new
How to use samwell/NV-Reason-CXR-3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for samwell/NV-Reason-CXR-3B-GGUF to start chatting
- Docker Model Runner
How to use samwell/NV-Reason-CXR-3B-GGUF with Docker Model Runner:
docker model run hf.co/samwell/NV-Reason-CXR-3B-GGUF:F16
- Lemonade
How to use samwell/NV-Reason-CXR-3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull samwell/NV-Reason-CXR-3B-GGUF:F16
Run and chat with the model
lemonade run user.NV-Reason-CXR-3B-GGUF-F16
List all available models
lemonade list
NV-Reason-CXR-3B GGUF (Quantized for Edge)
Quantized GGUF versions of NVIDIA's NV-Reason-CXR-3B vision-language model optimized for edge deployment for Cactus Compute and llama.cpp.
Model Description
This repository contains quantized versions of NV-Reason-CXR-3B, a 3B parameter vision-language model specialized in chest X-ray analysis. The model has been converted to GGUF format and quantized for efficient deployment on edge devices (mobile, desktop, embedded systems).
Original Model: nvidia/NV-Reason-CXR-3B Base Architecture: Qwen2.5-VL 3B Instruct Conversion: llama.cpp Quantization: llama-cpp-python
Available Models
| Filename | Format | Size | Use Case | Quality | Speed |
|---|---|---|---|---|---|
nv-reason-cxr-3b-fp16.gguf |
FP16 | 6.3 GB | Desktop with GPU (quality reference) | 100% | Baseline |
nv-reason-cxr-3b-Q4_K_M.gguf |
Q4_K_M | 1.96 GB | Recommended for edge devices | 90-95% | Fast |
mmproj-nv-reason-cxr-3b-f16.gguf |
FP16 mmproj | 1.25 GB | Vision encoder (required for image analysis) | 100% | - |
Model Details
Q4_K_M (Recommended):
- Size: 1.96 GB (69% reduction from FP16)
- Compression: 3.23x from original
- Quality: 90-95% retention
- Speed: 8-20 tokens/sec on mobile (device-dependent)
- RAM Required: 3-4 GB
- Best for: Mid-range to high-end mobile devices
FP16 (Reference):
- Size: 6.3 GB
- Quality: Original precision
- Speed: Slower than quantized
- RAM Required: 8+ GB
- Best for: Desktop inference, quality comparison
Performance Benchmarks
Desktop (Apple M3 Mac)
Q4_K_M Performance:
| Configuration | Load Time | Inference Speed | Memory Usage |
|---|---|---|---|
| CPU-only | 1.87s | 29.61 tok/s | ~2 GB RAM |
| M3 GPU (Metal) | 0.34s | 33.24 tok/s | ~2 GB RAM |
| Speedup | 5.46x faster ⚡ | 1.12x faster | Same |
Key Insights:
- 🚀 GPU provides 5.46x faster model loading - Huge benefit for app cold starts!
- ⚡ Modest 1.12x inference speedup - Q4_K_M is already highly CPU-optimized
- ✅ Excellent CPU performance - GPU acceleration is optional, not required
- 💪 Mobile devices will run well even without dedicated GPU
Test Hardware: Apple M3 MacBook Pro (Metal GPU support)
Mobile Projections
| Device | RAM | Expected Speed | Load Time | Rating |
|---|---|---|---|---|
| Budget Android | 3GB | 3-5 tok/s | 30-45s | Poor |
| Mid-range Android | 4GB | 8-12 tok/s | 20-30s | Good |
| High-end Android | 6GB | 15-20 tok/s | 15-25s | Excellent |
| iPhone 12+ | 4-6GB | 12-18 tok/s | 15-20s | Excellent |
| iPhone 14+ | 6GB+ | 18-25 tok/s | 10-15s | Optimal |
Minimum Requirements:
- 4GB RAM
- 3GB free storage
- iOS 14+ or Android 8+
Usage
With llama.cpp
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download model
huggingface-cli download samwell/NV-Reason-CXR-3B-GGUF \
nv-reason-cxr-3b-Q4_K_M.gguf --local-dir ./models
# Run inference
./llama-cli \
-m models/nv-reason-cxr-3b-Q4_K_M.gguf \
-p "Analyze this chest X-ray image." \
--image xray.jpg \
-n 512 \
--temp 0.3
With llama-cpp-python
from llama_cpp import Llama
# Option 1: CPU-only (works great, 29.61 tok/s on M3)
llm = Llama(
model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
n_gpu_layers=0, # CPU-only
)
# Option 2: GPU acceleration (5.46x faster loading!)
llm = Llama(
model_path="nv-reason-cxr-3b-Q4_K_M.gguf",
n_ctx=4096,
n_threads=4,
n_gpu_layers=-1, # Use GPU (Metal on Mac, CUDA on Linux/Windows)
)
# Analyze X-ray
response = llm(
"Analyze this chest X-ray image and identify key findings.",
max_tokens=512,
temperature=0.3, # Lower for medical = more deterministic
top_p=0.9,
)
print(response['choices'][0]['text'])
With Cactus Compute (Flutter/Mobile)
Note: You need BOTH the model file AND the mmproj file for image analysis.
import 'package:cactus/cactus.dart';
// Initialize VLM with both model and mmproj files
final vlm = CactusVLM();
await vlm.init(
modelFilename: 'nv-reason-cxr-3b-Q4_K_M.gguf', // Model file
mmprojFilename: 'mmproj-nv-reason-cxr-3b-f16.gguf', // Vision encoder
contextSize: 2048, // Context window (2K-4K for mobile)
threads: 4, // CPU threads
gpuLayers: 0, // CPU-only (GPU may cause issues on some devices)
);
// Create prompt
final messages = [
ChatMessage(
role: 'system',
content: 'You are a helpful radiologist assistant.',
),
ChatMessage(
role: 'user',
content: 'Describe what you see in this chest X-ray image.',
),
];
// Analyze X-ray
final response = await vlm.completion(
messages,
imagePaths: ['path/to/xray.jpg'],
maxTokens: 150,
temperature: 0.1, // Lower for medical analysis (0.1-0.5)
);
print(response.text);
Mobile GPU Benefits:
- 🚀 5.46x faster model loading (critical for app startup)
- 📱 Better user experience on iOS (Metal) and Android (Vulkan/OpenCL)
- 🔋 Minimal battery impact during loading phase
- ✅ Falls back gracefully to CPU if GPU unavailable
Inference Parameters
Recommended settings for medical analysis:
{
"temperature": 0.3, # Lower = more deterministic (range: 0.1-0.5)
"top_p": 0.9, # Nucleus sampling
"top_k": 40, # Top-k sampling
"repeat_penalty": 1.1, # Avoid repetition
"max_tokens": 512, # Response length
"n_ctx": 4096, # Context window (2048-4096 for mobile)
}
Files Included
.
├── README.md # Model card and usage guide
├── LICENSE # NSCLV1 license
├── CONVERSION_PROCESS.md # Technical conversion details
├── nv-reason-cxr-3b-Q4_K_M.gguf # Q4_K_M quantized (1.96 GB) - Recommended
├── nv-reason-cxr-3b-fp16.gguf # FP16 reference (6.3 GB)
└── mmproj-nv-reason-cxr-3b-f16.gguf # Vision encoder (1.25 GB) - Required
Model Card
Model Details
- Developed by: NVIDIA (original), quantized by samwell
- Model type: Vision-Language Model (VLM)
- Architecture: Qwen2.5-VL
- Parameters: 3 billion
- Language: English
- License: NSCLV1 (see LICENSE)
- Fine-tuned from: Qwen2.5-VL-3B-Instruct
- Specialty: Chest X-ray analysis
Intended Use
Primary Use Cases:
- Research in medical image analysis
- Educational purposes for radiology students
- Prototyping mobile medical AI applications
- Edge deployment of medical VLMs
Out-of-Scope:
- ❌ Clinical diagnosis or treatment decisions
- ❌ Production medical applications without proper validation
- ❌ Replacing trained radiologists
- ❌ Any FDA-regulated medical use
Limitations
- Not for Clinical Use: This model is for research and educational purposes only
- Quality Trade-off: Quantization reduces model size but may affect accuracy
- Domain Specific: Trained primarily on chest X-rays, may not generalize to other imaging
- Requires Validation: All outputs should be verified by medical professionals
- Mobile Performance: Speed varies significantly by device capabilities
Ethical Considerations
- Model outputs should not be used for medical diagnosis
- Always consult qualified healthcare professionals
- Be aware of potential biases in training data
- Ensure patient privacy when using with real medical images
- Comply with local healthcare regulations (HIPAA, GDPR, etc.)
Bias and Fairness
The original model may have inherited biases from training data. Users should:
- Test across diverse patient populations
- Validate performance on their specific use cases
- Monitor for unexpected outputs or biases
- Not rely solely on model outputs
Technical Details
Quantization Method
Q4_K_M uses 4-bit quantization with K-means clustering:
- Weights stored in 4 bits instead of 16 (FP16)
- K-means clustering for optimal quantization scales
- Medium variant balances size and quality
- Per-block scales for better accuracy preservation
Vision Encoder
The vision encoder has been extracted into a separate mmproj file for compatibility:
- File:
mmproj-nv-reason-cxr-3b-f16.gguf(1.25 GB) - Required: Both the model file AND mmproj file are needed for image analysis
- Format: FP16 (full precision vision encoder)
- Extracted from: NVIDIA's NV-Reason-CXR-3B original model
- Contains: Vision transformer blocks and multimodal projection layers (519 tensors)
Why separate mmproj?
- Mobile frameworks (Cactus Compute) require separate mmproj architecture
- Allows independent caching and loading strategies
- Enables mixing different model quantizations with same vision encoder
Context Window
- Training: 128,000 tokens
- Recommended for mobile: 2,048-4,096 tokens
- Desktop: Up to 128,000 tokens (RAM-dependent)
Citation
If you use this model, please cite the original work:
@misc{nvidia2024nvreasoncrx,
title={NV-Reason-CXR-3B: A Specialized Vision-Language Model for Chest X-ray Analysis},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/nvidia/NV-Reason-CXR-3B}
}
And optionally cite the quantization:
@misc{nvreasoncrx3b-gguf,
title={NV-Reason-CXR-3B GGUF: Quantized for Edge Deployment},
author={samwell},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/samwell/NV-Reason-CXR-3B-GGUF}
}
Acknowledgments
- NVIDIA for the original NV-Reason-CXR-3B model
- Qwen Team for the Qwen2.5-VL architecture
- llama.cpp contributors for the GGUF format and conversion tools
- Cactus Compute for mobile VLM deployment framework
License
This model inherits the NSCLV1 license from the original NV-Reason-CXR-3B model. See LICENSE for details.
Key points:
- Research and educational use permitted
- Commercial use may require additional permissions
- Not for clinical/diagnostic use
- See original model card for complete license terms
Disclaimer
⚠️ IMPORTANT MEDICAL DISCLAIMER
This model is provided for RESEARCH AND EDUCATIONAL PURPOSES ONLY. It is:
- NOT intended for clinical diagnosis or treatment
- NOT FDA approved or clinically validated
- NOT a substitute for professional medical advice
- NOT validated for production medical use
Always consult qualified healthcare professionals for medical decisions. The creators and distributors of this model assume no liability for any use of this software.
Contact & Support
- Issues: Report issues on GitHub (link to your repo)
- Questions: See documentation in this repository
- Original Model: nvidia/NV-Reason-CXR-3B
- Cactus Compute: GitHub
Version History
- v1.0 (2025-11-05): Initial release
- FP16 GGUF conversion
- Q4_K_M quantization
- Tested on macOS and mobile projections
- Complete documentation and scripts
For research and educational purposes only. Not for clinical use.
- Downloads last month
- 92
4-bit