Instructions to use tiiuae/Falcon-H1-3B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Transformers:

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("tiiuae/Falcon-H1-3B-Instruct-GGUF", dtype="auto")

llama-cpp-python

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tiiuae/Falcon-H1-3B-Instruct-GGUF",
	filename="Falcon-H1-3B-Instruct-BF16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Ollama:
```
ollama run hf.co/tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio new

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tiiuae/Falcon-H1-3B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tiiuae/Falcon-H1-3B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tiiuae/Falcon-H1-3B-Instruct-GGUF to start chatting

Pi new

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use tiiuae/Falcon-H1-3B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tiiuae/Falcon-H1-3B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Falcon-H1-3B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

TL;DR
Model Details
Training Details
Usage
Evaluation
Citation

TL;DR

Model Details

Model Description

Developed by: https://www.tii.ae
Model type: Causal decoder-only
Architecture: Hybrid Transformers + Mamba architecture
Language(s) (NLP): English, Multilingual
License: Falcon-LLM License

Training details

For more details about the training protocol of this model, please refer to the Falcon-H1 technical blogpost and Technical Report.

Usage

Currently to use this model you can either rely on Hugging Face transformers, vLLM or our custom fork of llama.cpp library.

Inference

Make sure to install the latest version of transformers or vllm, eventually install these packages from source:

pip install git+https://github.com/huggingface/transformers.git

Refer to the official vLLM documentation for more details on building vLLM from source.

🤗 transformers

Refer to the snippet below to run H1 models using 🤗 transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tiiuae/Falcon-H1-1B-Base"

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Perform text generation

vLLM

For vLLM, simply start a server by executing the command below:

# pip install vllm
vllm serve tiiuae/Falcon-H1-1B-Instruct --tensor-parallel-size 2 --data-parallel-size 1

🦙 llama.cpp

Our architecture is integrated into the latest versions of llama.cpp: https://github.com/ggml-org/llama.cpp - you can use the our official GGUF files directly with llama.cpp

Evaluation

Falcon-H1 series perform very well on a variety of tasks, including reasoning tasks.

Tasks	Falcon-H1-3B	Qwen3-4B	Qwen2.5-3B	Gemma3-4B	Llama3.2-3B	Falcon3-3B
General
BBH	53.69	51.07	46.55	50.01	41.47	45.02
ARC-C	49.57	37.71	43.77	44.88	44.88	48.21
TruthfulQA	53.19	51.75	58.11	51.68	50.27	50.06
HellaSwag	69.85	55.31	64.21	47.68	63.74	64.24
MMLU	68.3	67.01	65.09	59.53	61.74	56.76
Math
GSM8k	84.76	80.44	57.54	77.41	77.26	74.68
MATH-500	74.2	85.0	64.2	76.4	41.2	54.2
AMC-23	55.63	66.88	39.84	48.12	22.66	29.69
AIME-24	11.88	22.29	6.25	6.67	11.67	3.96
AIME-25	13.33	18.96	3.96	13.33	0.21	2.29
Science
GPQA	33.89	28.02	28.69	29.19	28.94	28.69
GPQA_Diamond	38.72	40.74	35.69	28.62	29.97	29.29
MMLU-Pro	43.69	29.75	32.76	29.71	27.44	29.71
MMLU-stem	69.93	67.46	59.78	52.17	51.92	56.11
Code
HumanEval	76.83	84.15	73.78	67.07	54.27	52.44
HumanEval+	70.73	76.83	68.29	61.59	50.0	45.73
MBPP	79.63	68.78	72.75	77.78	62.17	61.9
MBPP+	67.46	59.79	60.85	66.93	50.53	55.29
LiveCodeBench	26.81	39.92	11.74	21.14	2.74	3.13
CRUXEval	56.25	69.63	43.26	52.13	17.75	44.38
Instruction Following
IFEval	85.05	84.01	64.26	77.01	74.0	69.1
Alpaca-Eval	31.09	36.51	17.37	39.64	19.69	14.82
MTBench	8.72	8.45	7.79	8.24	7.96	7.79
LiveBench	36.86	51.34	27.32	36.7	26.37	26.01

You can check more in detail on our our release blogpost, detailed benchmarks.

Useful links

View our release blogpost.
View our technical report.
Feel free to join our discord server if you have any questions or to interact with our researchers and developers.

Citation

If the Falcon-H1 family of models were helpful to your work, feel free to give us a cite.

@article{falconh1,
    title={Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance},
    author={Jingwei Zuo and Maksim Velikanov and Ilyas Chahed and Younes Belkada and Dhia Eddine Rhayem and Guillaume Kunsch and Hakim Hacid and Hamza Yous and Brahim Farhat and Ibrahim Khadraoui and Mugariya Farooq and Giulia Campesan and Ruxandra Cojocaru and Yasser Djilali and Shi Hu and Iheb Chaabane and Puneesh Khanna and Mohamed El Amine Seddik and Ngoc Dung Huynh and Phuc Le Khac and Leen AlQadi and Billel Mokeddem and Mohamed Chami and Abdalgader Abubaker and Mikhail Lubinets and Kacper Piskorski and Slim Frikha},
    journal = {arXiv preprint arXiv:2507.22448},
    year={2025}
}

Downloads last month: 589

GGUF

Model size

3B params

Architecture

falcon-h1

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tiiuae/Falcon-H1-3B-Instruct-GGUF

Base model

tiiuae/Falcon-H1-3B-Base

Quantized

(6)

this model

Collection including tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1

Collection

Falcon-H1 Family of Hybrid-Head Language Models (Transformer-SSM), including 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B (pretrained & instruction-tuned). • 33 items • Updated Mar 2 • 59

Paper for tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Paper • 2507.22448 • Published Jul 30, 2025 • 71

Article mentioning tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

tiiuae
/

Falcon-H1-3B-Instruct-GGUF

Table of Contents

TL;DR

Model Details

Model Description

Training details

Usage

Inference

🤗 transformers

vLLM

🦙 llama.cpp

Evaluation

Useful links

Citation

Model tree for tiiuae/Falcon-H1-3B-Instruct-GGUF

Collection including tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1

Paper for tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Article mentioning tiiuae/Falcon-H1-3B-Instruct-GGUF

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance