Qwen3.6-27B · GGUF F16

Converted by PBH Applied Systems, LLC — Applied AI/ML Consulting · LLM Optimization & Deployment · Quantized AI Infrastructure

📌 Provenance repository — no behavioral evaluation performed. This repository contains the full-precision F16 GGUF of Qwen3.6-27B. At 53.8 GB, the F16 artifact exceeds the VRAM capacity of the evaluation hardware (NVIDIA RTX 4090, 24 GB). All behavioral evaluation data for this model is in the Q4_K_M companion repository: pbhappliedsystems/qwen3.6-27B-gguf-Q4-K-M.

🆕 First Qwen3-series model in the PBH Applied Systems evaluated series. Qwen3 introduces hybrid (adaptive) thinking mode — the model generates extended chain-of-thought reasoning on harder tasks. See the Q4_K_M card for a full analysis of how this behavior interacts with structured output evaluation.


Why No Evaluation

In the PBH Applied Systems evaluation pipeline, F16 GGUFs serve as cache-generation baselines for Q4_K_M comparison runs. For this model, the F16 GGUF is 53.8 GB — loading it into the RTX 4090 (24 GB VRAM) for a valid baseline run is not possible. The Q4_K_M run (20260426_163540) was therefore run as a standalone evaluation without an F16 cache baseline.

For all behavioral results, cross-series comparisons, thinking mode analysis, and deployment guidance, see the Q4_K_M card.


Model Description

This repository contains the full-precision F16 GGUF of Qwen/Qwen3.6-27B, a 27-billion parameter model from Alibaba Cloud's Qwen3 generation featuring hybrid (adaptive) thinking mode.

Key Characteristics

  • Parameters: 27B
  • Architecture: Qwen3 · Hybrid thinking / non-thinking mode
  • Format: GGUF F16 (full precision)
  • File size: 53.8 GB
  • SHA256: 79ec580010d1a6690476a37436196e99b5c8fae7da75dfe2f6f3836663bf54cb
  • Minimum VRAM (full GPU offload): ~70 GB
  • Recommended hardware: 2× A100 40 GB · A100 80 GB · 3× A10G 24 GB
  • Context window: 32,768 tokens (check model config)
  • License: Apache 2.0

On thinking mode and F16 inference: At full F16 precision, Qwen3's adaptive thinking mode will generate substantially longer responses on harder tasks than at Q4_K_M, as the model has more capacity to explore extended reasoning chains. Expect significantly higher per-request latency for complex structured tasks compared to the Q4_K_M evaluation times documented in the companion card.


Artifact Provenance

Artifact Format Size SHA256 Evaluated
qwen3.6-27B-gguf-F16.gguf GGUF F16 53.8 GB 79ec580010d1a6690476a37436196e99b5c8fae7da75dfe2f6f3836663bf54cb ❌ VRAM constraint
Q4_K_M (companion repo) GGUF Q4_K_M 16.5 GB c863357b1b532a02c47ca363ab666dd623470a152a291dac6619ed7ce751d8c8 ✅ Run 20260426_163540

The F16 GGUF was converted from Qwen/Qwen3.6-27B using a custom-built llama.cpp conversion pipeline developed by PBH Applied Systems, without modification to model weights.


Hardware Requirements

Configuration VRAM Required Notes
F16 (this repo) · full GPU ~70 GB 53.8 GB model + KV cache
F16 · multi-GPU split ~18 GB per GPU 4× A10G 24 GB or 2× A100 40 GB
F16 · partial CPU offload ~40 GB VRAM + 32 GB RAM Reduced context; slower inference
Q4_K_M (companion repo) ~22 GB 16.5 GB — single RTX 4090 or A10G

Usage

Installation

pip install llama-cpp-python huggingface_hub

For multi-GPU CUDA deployment:

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Python — llama-cpp-python (multi-GPU) with Think-Block Stripping

from huggingface_hub import hf_hub_download
from llama_cpp import Llama
import re

# Note: 53.8 GB download — requires ~70 GB total VRAM for full GPU offload
model_path = hf_hub_download(
    repo_id="pbhappliedsystems/qwen3.6-27B-gguf-F16",
    filename="qwen3.6-27B-gguf-F16.gguf"
)

# Multi-GPU: adjust tensor_split to match your GPU configuration
llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=-1,
    tensor_split=[1, 1, 1],  # Example: 3× A10G 24 GB
    verbose=True,
)

def strip_thinking(raw: str) -> str:
    """Strip <think> blocks and EOS tokens from Qwen3 output."""
    clean = re.sub(r'<think>.*?</think>', '', raw, flags=re.DOTALL).strip()
    return re.sub(r'<\|im_end\|>', '', clean).strip()

# Use /no_think to suppress thinking mode for structured output tasks
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a precise assistant."},
        {"role": "user", "content": "Return a JSON object with keys: summary, risk_level. /no_think"}
    ],
    temperature=0.15,
    max_tokens=2048,  # Allow space for thinking tokens at full precision
)
print(strip_thinking(response["choices"][0]["message"]["content"]))

CLI — llama-cli (multi-GPU)

llama-cli \
  --model qwen3.6-27B-gguf-F16.gguf \
  --chat-template qwen3 \
  --system-prompt "You are a precise assistant." \
  --prompt "Return a JSON object with keys: summary, risk_level. /no_think" \
  --n-predict 2048 \
  --ctx-size 8192 \
  --n-gpu-layers -1 \
  --tensor-split 1,1,1 \
  --temp 0.15

🔬 About quant_eval & This Evaluation Series

quant_eval is a proprietary behavioral evaluation harness developed by PBH Applied Systems, LLC. It measures real agent-adjacent task performance across structured output, tool dispatch, multi-turn state retention, and multi-step planning — not perplexity or leaderboard proxies. Every model published under pbhappliedsystems has been independently evaluated using quant_eval before being recommended for any production role.

See it in action: Live AI Agent Demo → The demo runs production-style agent workflows powered by open-weight models selected through the quant_eval evaluation pipeline.

Need a deployment recommendation? Not sure which quantization level is right for your hardware, latency target, or agent type? → pbhappliedsystems.com


Evaluated and published by PBH Applied Systems, LLC · patrick@pbhappliedsystems.com


About PBH Applied Systems

PBH Applied Systems, LLC is an Oklahoma City–based applied machine learning and AI systems company specializing in production-grade model evaluation, quantization pipelines, agentic AI infrastructure, and scalable AI-driven application development.

Patrick Hill, M.S. — Founder · Data Scientist · AI/ML Engineer · Author of Applied Machine Learning: Concepts, Tools, and Case Studies (required reading, UAT CSC 373)


📞 Work With PBH Applied Systems

👉 Book a Scoping Call · 👉 Request an Evaluation Report — from $2,500

Connect


License

This GGUF repository inherits the license of the base model: Apache 2.0Qwen/Qwen3.6-27B


GGUF conversion performed by PBH Applied Systems, LLC · No behavioral evaluation — see companion Q4_K_M repository for all evaluation data

Downloads last month
173
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pbhappliedsystems/qwen3.6-27B-gguf-F16

Base model

Qwen/Qwen3.6-27B
Quantized
(436)
this model