Instructions to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8") model = AutoModelForCausalLM.from_pretrained("MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8
- SGLang
How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8 with Docker Model Runner:
docker model run hf.co/MetaphoricalCode/GLM-4-32B-0414-exl3-5bpw-hb8
exl2-6.0bpw
It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.
It seems that there's no official tabbyAPI support for exl3 yet. Is it possible to have exl2-6.0bpw version, as it's more precise? For GLM-Z1-Rumination-32B-0414 as well? It seems that GLM-Z1-Rumination-32B-0414 is the best model.
Exllamav3 support was merged to tabbyAPI's main branch on May, 9. The quantization process is rather straightforward and easy. It's covered in the docs: https://github.com/turboderp-org/exllamav3/blob/master/doc/convert.md
GLM-4-32B-0414 exl3 6bpw (not exl2) is already quantized: https://huggingface.co/owentruong/GLM-4-32B-0414-EXL3/tree/6.0
GLM-4-32B models are not supported in Exllamav2. Turboderp, the creator of Exllama, is now mostly focused on v3. Most likely the support won't be added to v2. It's worth switching to v3 if your GPU is Ada or Blackwell based (RTX 40xx / 50xx). Performance is still an issue for Ampere (RTX 30xx), it's being worked on. It's not horrible and some people run exl3 with RTX 3090, but performance is expected to be improved in the future.
OK, I didn't know that exllamav2 isn't supported for these models.
I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?
OK, I didn't know that exllamav2 isn't supported for these models.
I'm intereseted in testing GLM-Z1-Rumination-32B-0414, as it uses more deep thinking process, if I understood well. If I use several RTX 3xxx but PCI x1 connection, will it dramatically decrease the conversion speed or it's better to use 2 GPUs but PCI x16 (I don't have more PCI x16), do you know ?
I have no information on that matter, sadly. However, you may reach out for help in Exllama's Discord server: https://discord.gg/NSFwVuCjRq
Surely there will be those who know. Turboderp is there, too.