Mixtral-8x7B-Instruct-v0.1-NVFP4
NVIDIA NVFP4 quantized version of Mixtral 8x7B-Instruct for Blackwell architecture GPUs.
Model Description
This is a 4-bit floating-point (NVFP4) quantized version of mistralai/Mixtral-8x7B-Instruct-v0.1, created using NVIDIA TensorRT Model Optimizer (modelopt).
| Metric | Value |
|---|---|
| Original Size | 86.99 GB |
| Quantized Size | 24.82 GB |
| Compression Ratio | 3.50x |
| Size Reduction | 71.5% |
| Quantization Method | NVFP4 (calibration-based) |
| Calibration Samples | 512 (C4 dataset) |
| Quantization Time | ~5.6 hours |
What is NVFP4?
NVFP4 is NVIDIA's native 4-bit floating-point format, introduced with the Blackwell architecture. Unlike integer quantization (INT4), NVFP4 uses a micro-exponent floating-point format:
┌───────┬───────────┬──────────┐
│ Sign │ Exponent │ Mantissa │
│ 1 bit │ 2 bits │ 1 bit │
└───────┴───────────┴──────────┘
This provides better dynamic range for neural network weights compared to uniform integer quantization.
Hardware Requirements
⚠️ This model requires Blackwell architecture GPUs (GB10, GB100, GB200) and TensorRT-LLM for inference.
Standard HuggingFace transformers cannot load this model directly due to the packed FP4 weight format.
Current Compatibility Status (December 2025)
| Framework | Status |
|---|---|
| TensorRT-LLM | Partial support (GB10 not fully supported in v1.0.0) |
| vLLM | Not yet supported |
| Transformers | ❌ Cannot load packed FP4 weights |
This model is ready for when TensorRT-LLM and vLLM add full Blackwell support.
Intended Use
Once framework support is available:
# Build TensorRT engine
trtllm-build --checkpoint_dir ./Mixtral-8x7B-Instruct-v0.1-NVFP4 \
--output_dir ./engine \
--gemm_plugin nvfp4
# Serve with TensorRT-LLM
python -m tensorrt_llm.commands.serve --model_dir ./engine
Quantization Details
Weight Format
The model uses packed FP4 weights with block-wise scaling:
weight: uint8 (packed FP4, half original dimension)
weight_scale: float8_e4m3fn (per-block scales, group_size=16)
weight_scale_2: float32 (global scale)
input_scale: float32 (activation scale)
Layers Excluded from Quantization
lm_head(output layer)- All
block_sparse_moe.gatelayers (router networks)
Calibration
Quantization was performed with 512 calibration samples from the C4 dataset, running forward passes to collect weight and activation statistics for optimal scale factor determination.
Baseline Performance (BF16)
For reference, the original BF16 model on DGX Spark (GB10):
| Metric | Value |
|---|---|
| Tokens/Second | 4.05 tok/s |
| Latency (100 tokens) | 24.74 s |
| Perplexity (WikiText-2) | 3.70 |
NVFP4 inference benchmarks pending framework support.
Training/Quantization Environment
- Hardware: NVIDIA DGX Spark (GB10 Blackwell GPU)
- Memory: 128GB Unified Memory
- Software:
- Python 3.12
- PyTorch 2.11
- NVIDIA TensorRT Model Optimizer (modelopt) 0.40.0
- Transformers 4.57.3
Citation
If you use this model, please cite:
@misc{mixtral-nvfp4-2025,
title={Mixtral-8x7B-Instruct-v0.1-NVFP4},
author={Joseph Dowling},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/[username]/Mixtral-8x7B-Instruct-v0.1-NVFP4}}
}
License
This model inherits the Apache 2.0 license from the base Mixtral model.
Acknowledgments
- Mistral AI for the base Mixtral 8x7B model
- NVIDIA for TensorRT Model Optimizer and NVFP4 format
- Downloads last month
- 4
Model tree for josephdowling10/Mixtral-8x7B-Instruct-v0.1-NVFP4
Base model
mistralai/Mixtral-8x7B-v0.1