Instructions to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8")
model = AutoModelForCausalLM.from_pretrained("MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8

SGLang

How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with Docker Model Runner:
```
docker model run hf.co/MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8
```

exl2-6.0bpw

by tatianapoliakova - opened May 21, 2025

Discussion

tatianapoliakova

May 21, 2025

It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.

MetaphoricalCode

Owner May 21, 2025

•

edited May 21, 2025

It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.

Exllamav3 support was merged to tabbyAPI's main branch on May, 9. The quantization process is rather straightforward and easy. It's covered in the docs: https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md

GLM-4-32B-0414 exl3 6bpw (not exl2) is already quantized: https://huggingface.co/owentruong/GLM-4-32B-0414-EXL3/tree/6.0

GLM-4-32B models are not supported in Exllamav2. Turboderp, the creator of Exllama, is now mostly focused on v3. Most likely the support won't be added to v2. It's worth switching to v3 if your GPU is Ada or Blackwell based (RTX 40xx / 50xx). Performance is still an issue for Ampere (RTX 30xx), it's being worked on. It's not horrible and some people run exl3 with RTX 3090, but performance is expected to be improved in the future.

tatianapoliakova

May 21, 2025

OK, I didn't know that exllamav2 isn't supported for these models.

I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?

MetaphoricalCode

Owner May 21, 2025

OK, I didn't know that exllamav2 isn't supported for these models.

I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?

I have no information on that matter, sadly. However, you may reach out for help in Exllama's Discord server: https://discord.gg/NSFwVuCjRq
Surely there will be those who know. Turboderp is there, too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment