hipfire DeepSeek V4 Flash (mq2lloyd)
A mixed-precision packaging of deepseek-ai/DeepSeek-V4-Flash (284 B total / 13 B active per upstream) for the hipfire Rust-native inference engine on AMD RDNA GPUs.
Upstream is shipped in FP8 (e4m3) with FP4 routed-expert weights. This packaging rewrites the dominant weight class β the 256 routed MoE experts per layer β as 2-bit MagnumQuant-Lloyd (MQ2-Lloyd), and keeps everything else as Q8F16 or F16. The container suffix .mq2lloyd names that dominant class, but the file is not a uniform 2-bit dump.
The file format is hipfire's HFQ container and is not GGUF / safetensors / AWQ compatible β it only loads in hipfire.
What's inside (verified by enumerating the file's tensor table)
The 86.2 GB file contains 34 223 tensors. By hipfire QuantType enum:
33024 qt=19 (MQ2G256Lloyd) β 256 routed experts Γ 3 (w1/w2/w3) Γ 43 layers
389 qt=3 (Q8F16) β shared experts, main attn, embed, lm_head, router gates
807 qt=1 (F16) β norms, compressor, indexer, hc_*, attn_sink
3 qt=22 (TidI32) β hash-router fast-path (3 layers only)
| Tensor class | Storage | Weights (this file) | Storage bytes |
|---|---|---|---|
| Routed MoE experts | MQ2G256Lloyd, 2.25 bpw (G=256 β 64 B 2-bit indices + 8 B fp16 Lloyd codebook = 72 B/group) | 277.025 B | 77.913 GB |
| Shared expert + main attention (wq_a/b, wkv, wo_a/b) + embed + lm_head + MoE router gates | Q8F16, 8.5 bpw (GGML Q8_0 block: 2 B F16 scale + 32 B Q8 data per 32 weights = 34 B / 32 w) | 6.785 B | 7.209 GB |
Compressor / indexer matrices, RMSNorm scales, attn_sink, HC gating (hc_attn_*, hc_ffn_*, hc_head_*) |
F16, 16 bpw | 0.522 B | 1.043 GB |
tid2eid hash-router table (layers 0β2 only β num_hash_layers = 3) |
TidI32, 32 bpw | 2.3 M | 9.3 MB |
| Total | mixed β 2.425 bpw avg | 284.335 B | 86.175 GB weights + 9.2 MB header/metadata = 86.184 GB on disk (estimate matches the actual file to within 0.01 %) |
The runtime SWA K and V state (window = 128) is F32 and is allocated live at session start β it is not persisted in the file.
Files
| File | Size | Purpose |
|---|---|---|
deepseek-v4-flash.mq2lloyd |
86,184,307,283 B (β86.2 GB) | Main model β 43 layers, 256 routed + 1 shared expert per layer, attention, embed, lm_head |
deepseek-v4-flash-mtp.mq2lloyd |
1,998,047,355 B (β2.0 GB) | Single MTP layer (num_nextn_predict_layers = 1), opened automatically when present alongside the main file. Used for optional speculative decode. |
Loading
This model is registered in hipfire's CLI registry under deepseek-v4-flash (aliases: deepseek4, deepseek-v4). The registry entry pulls both the main file and the MTP companion together.
# 1. Install hipfire (one-time):
curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash
# 2. Pull the model (downloads both .mq2lloyd files into ~/.hipfire/models/):
hipfire pull deepseek-v4-flash
# 3. Run it:
hipfire run deepseek-v4-flash "Write a fibonacci function in C"
# or interactive chat:
hipfire run deepseek-v4-flash
# or expose an OpenAI-compatible HTTP server on :11435:
hipfire serve
hipfire run auto-pulls if the files aren't present, so step 2 is optional. Both HIPFIRE_DEEPSEEK4_MOE and HIPFIRE_DEEPSEEK4_UPLOAD_EXPERTS default to on at the engine level β no env vars are required.
From source (development path)
git clone https://github.com/Kaden-Schutt/hipfire
cd hipfire
cargo build --release -p hipfire-arch-deepseek4 --example deepseek4_chat
# Pull the files (uses `hf` CLI or hipfire's registry-driven pull above):
hf download nwoolmer/hipfire-deepseek-v4-flash \
deepseek-v4-flash.mq2lloyd deepseek-v4-flash-mtp.mq2lloyd \
--local-dir ~/.hipfire/models/
# Direct chat binary (DSML chat template, EOS stop, multi-turn KV):
HIPFIRE_DEEPSEEK4_MODEL=~/.hipfire/models/deepseek-v4-flash.mq2lloyd \
./target/release/examples/deepseek4_chat
For programmatic access, crates/hipfire-runtime/examples/daemon.rs exposes the engine as a JSON-lines IPC service over stdin/stdout (spawned by the Bun CLI front-end); it dispatches DeepSeek V4 Flash via arch_id = 9.
Architecture
The packaging follows the V4F architecture as described in the DeepSeek V4 paper. Every value below is read from the config JSON embedded in the HFQ file's metadata blob:
| Field | Value | Notes |
|---|---|---|
architectures |
["DeepseekV4ForCausalLM"] |
upstream class |
num_hidden_layers |
43 | |
hidden_size |
4096 | |
vocab_size |
129 280 | |
num_attention_heads / num_key_value_heads / head_dim |
64 / 1 / 512 | 64 query heads; KV is a single latent stream of dim 512 (MLA, joint K+V via wkv [512, 4096]) |
q_lora_rank / o_lora_rank / o_groups |
1024 / 1024 / 8 | Q low-rank factorisation wq_a [1024,4096] β wq_b [32768,1024]; grouped O projection wo_a [8192,4096] β wo_b [4096,8192] with intermediate = o_groups Β· o_lora_rank = 8192 |
qk_rope_head_dim |
64 | tail-split RoPE: only the last 64 channels per head get rotated |
n_routed_experts / n_shared_experts / num_experts_per_tok |
256 / 1 / 6 | top-6 routing per token |
moe_intermediate_size |
2048 | per-expert width |
hc_mult / hc_sinkhorn_iters / hc_eps |
4 / 20 / 1e-6 | 4-stream Hyper-Connections, 20-iter Sinkhorn normalisation |
index_n_heads / index_head_dim / index_topk |
64 / 128 / 512 | indexer-gated attention: top-512 over compressed-K |
compress_ratios |
[0, 0, 4, 128, 4, 128, β¦, 4, 0] (len 43) |
layers 0β1 are dense; compressed-KV attention from layer 2 onward |
num_hash_layers |
3 | first three layers carry the tid2eid router fast-path |
num_nextn_predict_layers |
1 | one MTP head (in the companion file) |
sliding_window |
128 | SWA window for the main attention path |
rope_theta / compress_rope_theta |
10 000 / 160 000 | |
rope_scaling |
YaRN, factor 16, original_max 65 536 β max_position 1 048 576 | |
expert_dtype (upstream) / quantization_config |
fp4 / fp8 e4m3 [128,128] |
upstream's quant; superseded here by MQ2-Lloyd + Q8F16 + F16 |
Tensor-presence cross-check against the file (also verified above):
- Layers 0 and 1: dense attention (
wq_a/b,wkv,wo_a/bonly β no compressor / no indexer). - Layers 2β42: compressed-KV attention (every layer has
attn.compressor.{wkv,wgate,ape,norm}). - Even layers 2, 4, β¦, 42 (21 layers): also carry the indexer block (
attn.indexer.{wq_b, weights_proj, compressor.*}). - Layers 0, 1, 2: also ship the
ffn.gate.tid2eidhash-router fast-path table. - The MTP companion file contains a single layer (
mtp.0.*) with its own attention block, 256 routed experts, a shared expert, a full mHC stack (hc_attn_*,hc_ffn_*,hc_head_*), and input-side projection/norm tensors (e_proj,h_proj,enorm,hnorm) used by the MTP head's input mixing.
Performance
Measured 2026-05-28 on AMD Radeon 8060S (gfx1151, Strix Halo APU, 128 GB UMA), hipfire at tag v0.2.0 (commit 3d456e5c), ROCm 7.2.1, SWA attention path, temp=0.7 top_k=40, prompt_normalize on.
| Mode | Throughput | How measured |
|---|---|---|
| Plain decode (TG) | ~13.9 tok/s | 13 warm-turn measurements across two chat processes (turn β₯ 2 of each, both 256-tok and 16-tok generations): median 13.86, range 13.61 β 14.00, Ο β 0.7 %. |
| Batched prefill (PP) | ~55 tok/s | 1 235-token fresh-KV prefill of a single-chunk system prompt (no internal blank lines). HIPFIRE_DEEPSEEK4_PP_BATCH default 1024. Sub-100-token chat-turn prompts are overhead-dominated (~40 tok/s) and are not the right point to cite. |
| Spec decode (MTP, K=3) | 14β19 tok/s | Median 16.4 tok/s, +19 % over plain. Draft accept rate ranged 41β62 % across the three turns of the same chat (highest on direct code generation, lowest on conversational follow-ups). Enable with HIPFIRE_DEEPSEEK4_SPEC_DECODE=1 HIPFIRE_DEEPSEEK4_SPEC_K=3. |
Cold-process load β weight upload from both HFQ files β is ~44 s on the 8060S (measured by an inline timestamping wrapper from process start until the engine prints DeepSeek V4 ready.). Plain decode is DRAM-bandwidth-bound; faster memory (RDNA3 desktop GDDR6, future RDNA4) scales decode roughly proportionally.
Compatibility
- GPU: AMD RDNA3 / RDNA3.5 with HIP + WMMA. Validated on gfx1151 (Radeon 8060S, Strix Halo) for this build of the V4F weights. The engine has gfx1100 (RX 7900-class) kernels in tree but they were not exercised against this specific file; RDNA1/2 (gfx1010 / gfx103x) and gfx12 are tracked targets of the broader hipfire project but not recommended for running this model.
- OS: Linux with
amdgpukernel driver. Built and measured against ROCm 7.2.1 / HIP 7.2. - Memory: 86.2 GB for the main file + ~2 GB for the MTP companion + several GB working set during decode. Strix Halo class systems with 128 GB UMA are the comfortable target; discrete-GPU configs need ~96 GB+ of VRAM.
- Tested context length: end-to-end inference was exercised at prompts up to ~1.2 k tokens and generations up to 256 tokens. Upstream's stated max position (1 048 576 via YaRN) was not exercised here.
License
The upstream model (deepseek-ai/DeepSeek-V4-Flash) is MIT-licensed; the weights in this packaging inherit those terms.
The hipfire engine that produced and consumes this format is dual-licensed MIT / Apache-2.0 at the user's option (see LICENSE-MIT, LICENSE-APACHE, and NOTICE in the engine repo).
Acknowledgements
- DeepSeek AI β original DeepSeek V4 Flash weights and architecture.
- Salvatore Sanfilippo (antirez) β DwarfStar, a focused C/Metal/CUDA/ROCm reference engine for DeepSeek V4 Flash. The hipfire impl cross-validates against it for MTP wiring, HC reduction, and KV layout.
Citation
If you use this packaging, please cite the upstream model, the hipfire engine (per its CITATION.cff), and this HF release:
@misc{deepseekai2026deepseekv4,
title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
author = {DeepSeek-AI},
year = {2026},
url = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash}
}
@software{hipfire,
title = {hipfire β Rust-native LLM inference for AMD RDNA / CDNA},
author = {Schutt, Kaden},
year = {2026},
version = {0.2.0},
url = {https://github.com/Kaden-Schutt/hipfire}
}
@misc{hipfire-deepseek-v4-flash,
title = {hipfire DeepSeek V4 Flash (MQ2-Lloyd)},
author = {Woolmer, Nick},
year = {2026},
url = {https://huggingface.co/nwoolmer/hipfire-deepseek-v4-flash}
}
Model tree for nwoolmer/hipfire-deepseek-v4-flash
Base model
deepseek-ai/DeepSeek-V4-Flash