Qwen3-0.6B GGUF

This repository contains GGUF (GPT-Generated Unified Format) conversions of the Qwen3-0.6B language model, optimized for efficient CPU inference using llama.cpp, Ollama, and other GGUF-compatible engines.

This model was converted by NEO - Fully autonomous ML Engineering Agent.

Model Overview

Qwen3-0.6B is a compact yet powerful 0.6 billion parameter causal language model from Alibaba's Qwen series, featuring:

  • Dual-mode inference: Supports both thinking and non-thinking modes for flexible reasoning
  • Enhanced reasoning: Improved logical reasoning and problem-solving capabilities
  • Multilingual support: Proficient in 100+ languages
  • Tool calling: Native support for function calling and agent workflows
  • Extended context: 32,768 token context length
  • Efficient architecture: GQA attention mechanism for optimized inference

Architecture Details

  • Parameters: 0.6B total (0.44B non-embedding)
  • Layers: 28 transformer layers
  • Attention: Grouped Query Attention (GQA)
    • 16 Query heads
    • 8 Key-Value heads
  • Context Length: 32,768 tokens
  • Vocabulary Size: 151,936 tokens
  • Original Precision: BF16

Quantization Variants

This repository provides 4 GGUF quantization variants optimized for different use cases:

Model Variant File Size Quantization Use Case Quality
qwen3-0.6b-fp16.gguf 1,439 MB FP16 Reference quality, GPU inference Highest
qwen3-0.6b-q8_0.gguf 767 MB Q8_0 High-quality CPU inference Very High
qwen3-0.6b-q5_k_m.gguf 526 MB Q5_K_M Production CPU deployment High
qwen3-0.6b-q4_k_m.gguf 462 MB Q4_K_M Edge devices, mobile, low memory Good

Quantization Recommendations

  • FP16: Use for reference benchmarks or when GPU memory is available
  • Q8_0: Best balance for CPU inference with minimal quality loss
  • Q5_K_M: Recommended for production CPU deployments (best quality/size ratio)
  • Q4_K_M: Optimal for resource-constrained environments (edge, mobile, IoT)

Usage Instructions

llama.cpp CLI

Download a model variant and run inference:

# Download model (replace with your preferred variant)
huggingface-cli download gvij/qwen3-0.6b-gguf qwen3-0.6b-q5_k_m.gguf --local-dir ./models

# Run inference
./llama-cli -m models/qwen3-0.6b-q5_k_m.gguf -p "Explain quantum computing:" -n 256 --temp 0.7

# Interactive chat mode
./llama-cli -m models/qwen3-0.6b-q5_k_m.gguf -cnv --color

Ollama Integration

Create a Modelfile:

FROM ./qwen3-0.6b-q5_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"

TEMPLATE """
<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

Create and run the model:

ollama create qwen3-0.6b -f Modelfile
ollama run qwen3-0.6b "Write a Python function to calculate factorial"

Python with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="qwen3-0.6b-q5_k_m.gguf",
    n_ctx=32768,
    n_threads=8,
    n_gpu_layers=0
)

response = llm(
    "Explain the theory of relativity in simple terms:",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["<|im_end|>", "<|endoftext|>"]
)

print(response['choices'][0]['text'])

Thinking Mode

Qwen3 supports explicit reasoning with thinking mode. Use the <think> tags:

prompt = """Think through this problem step by step:
<think>
What is 15% of 240?
</think>"""

response = llm(prompt, max_tokens=512)

Performance Characteristics

CPU Inference Speed (approximate, on modern x86-64 CPU):

  • Q4_K_M: ~20-30 tokens/second
  • Q5_K_M: ~18-25 tokens/second
  • Q8_0: ~12-18 tokens/second
  • FP16: ~8-12 tokens/second (CPU) / ~50-80 tokens/second (GPU)

Memory Requirements:

  • Q4_K_M: ~600 MB RAM
  • Q5_K_M: ~700 MB RAM
  • Q8_0: ~1 GB RAM
  • FP16: ~1.6 GB RAM

Note: Actual performance depends on CPU architecture, clock speed, and context length.

Features

Thinking & Non-thinking Modes: Dynamic reasoning control
Multilingual: 100+ languages supported
Tool Calling: Native function calling for agent workflows
Extended Context: 32K token context window
CPU Optimized: GGUF format for efficient CPU inference
Flexible Deployment: Compatible with llama.cpp, Ollama, LMStudio, MLX-LM

Model Source

  • Original Model: Qwen/Qwen3-0.6B
  • Model Family: Qwen3 Series
  • Developer: Alibaba Cloud
  • Conversion: HuggingFace → GGUF using llama.cpp conversion scripts

License

This model is released under the Apache 2.0 License, following the original Qwen3-0.6B license terms.

Citation

@misc{qwen3-0.6b-gguf,
  title={Qwen3-0.6B GGUF},
  author={Qwen Team},
  year={2024},
  url={https://huggingface.co/gvij/qwen3-0.6b-gguf}
}

Acknowledgments

  • Qwen Team at Alibaba Cloud for the original model
  • llama.cpp community for GGUF format and conversion tools
  • Model conversion performed using llama.cpp conversion pipeline

For issues, questions, or feedback, please visit the original Qwen3-0.6B repository.

Authored and published by NEO

Downloads last month
542
GGUF
Model size
0.8B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gvij/qwen3-0.6b-gguf

Finetuned
Qwen/Qwen3-0.6B
Quantized
(228)
this model