Instructions to use ArliAI/GLM-4.6-Derestricted-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ArliAI/GLM-4.6-Derestricted-v3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ArliAI/GLM-4.6-Derestricted-v3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ArliAI/GLM-4.6-Derestricted-v3")
model = AutoModelForCausalLM.from_pretrained("ArliAI/GLM-4.6-Derestricted-v3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ArliAI/GLM-4.6-Derestricted-v3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ArliAI/GLM-4.6-Derestricted-v3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ArliAI/GLM-4.6-Derestricted-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ArliAI/GLM-4.6-Derestricted-v3

SGLang

How to use ArliAI/GLM-4.6-Derestricted-v3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ArliAI/GLM-4.6-Derestricted-v3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ArliAI/GLM-4.6-Derestricted-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ArliAI/GLM-4.6-Derestricted-v3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ArliAI/GLM-4.6-Derestricted-v3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ArliAI/GLM-4.6-Derestricted-v3 with Docker Model Runner:
```
docker model run hf.co/ArliAI/GLM-4.6-Derestricted-v3
```

Please derestrict this model

by drmcbride - opened Dec 5, 2025

Discussion

drmcbride

Dec 5, 2025

mistralai/Ministral-3-14B-Instruct-2512
everyone would appreciate it a lot :)

Infiper

Dec 10, 2025

Hello Arli AI,

(~25b - 40b models are currently for "enthusiasts" a good range for gpu inference. Currently i suppose for pure gpu inference, the dense releases below, are interesting for 48 gb -64 gb vram without any offloading to ram.)

thank you for your efforts! I wanted to ask if you could perhaps derestrict "Qwen/Qwen3-VL-32B-Thinking". If i am not mistaken the "Norm-Preserving Biprojected Abliteration" should remove censorship from thinking models as well or is that only working on Instruct non-thinking mode? the Qwen model above in question has only as far as i know "prithivMLmods/Qwen3-VL-32B-Thinking-abliterated-v1" an abliterated model with a perhaps less optimal abliteration technique than the NPBA approach.
The model "vprilepskii/Seed-OSS-36B-Instruct-biprojected-norm-preserving-abliterated" seems to be using the new NPBA method, though i am not sure if it was implemented correctly by vprilepskii. (didnt test it until now). At first i had to research and found out that it seems he used the same method like you but wrote the technical term instead of your "marketing" name of "Derestricted". Led to a bit of confusion. I think formalizing the different "official" abliteration techniques would help the community to differentiate between the approaches and measure and benchmark easier the effect of those methods in comparison easier.
Settings for samplers (like in Silly tavern, top_p, k, temperature or dry etc... are always appreciated if somewhere included on the model page, because there are myriads of settings and i think much confusion of community members is that there is no main go to "hub" for llm related sampler settings (in general or for sillytavern for roleplaying). False and wrong settings changes the LLM behavior completely and cause problems if you dont know what you are doing. I hope there will be community efforts to somehow centralize sampler settings information for LLMs or that it is somehow implemented in huggingface or pushed by the community somehow, perhaps you could advocate for that because you are "famous" on Localllama on Reddit to make somehow this sample settings more clear for someone who downloads those models in general on huggingface.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment