Instructions to use ArliAI/GLM-4.6-Derestricted-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ArliAI/GLM-4.6-Derestricted-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ArliAI/GLM-4.6-Derestricted-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ArliAI/GLM-4.6-Derestricted-v3") model = AutoModelForCausalLM.from_pretrained("ArliAI/GLM-4.6-Derestricted-v3") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ArliAI/GLM-4.6-Derestricted-v3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ArliAI/GLM-4.6-Derestricted-v3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArliAI/GLM-4.6-Derestricted-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ArliAI/GLM-4.6-Derestricted-v3
- SGLang
How to use ArliAI/GLM-4.6-Derestricted-v3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ArliAI/GLM-4.6-Derestricted-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArliAI/GLM-4.6-Derestricted-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ArliAI/GLM-4.6-Derestricted-v3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ArliAI/GLM-4.6-Derestricted-v3", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ArliAI/GLM-4.6-Derestricted-v3 with Docker Model Runner:
docker model run hf.co/ArliAI/GLM-4.6-Derestricted-v3
Please derestrict this model
mistralai/Ministral-3-14B-Instruct-2512
everyone would appreciate it a lot :)
Hello Arli AI,
(~25b - 40b models are currently for "enthusiasts" a good range for gpu inference. Currently i suppose for pure gpu inference, the dense releases below, are interesting for 48 gb -64 gb vram without any offloading to ram.)
thank you for your efforts! I wanted to ask if you could perhaps derestrict "Qwen/Qwen3-VL-32B-Thinking". If i am not mistaken the "Norm-Preserving Biprojected Abliteration" should remove censorship from thinking models as well or is that only working on Instruct non-thinking mode? the Qwen model above in question has only as far as i know "prithivMLmods/Qwen3-VL-32B-Thinking-abliterated-v1" an abliterated model with a perhaps less optimal abliteration technique than the NPBA approach.
The model "vprilepskii/Seed-OSS-36B-Instruct-biprojected-norm-preserving-abliterated" seems to be using the new NPBA method, though i am not sure if it was implemented correctly by vprilepskii. (didnt test it until now). At first i had to research and found out that it seems he used the same method like you but wrote the technical term instead of your "marketing" name of "Derestricted". Led to a bit of confusion. I think formalizing the different "official" abliteration techniques would help the community to differentiate between the approaches and measure and benchmark easier the effect of those methods in comparison easier.
Settings for samplers (like in Silly tavern, top_p, k, temperature or dry etc... are always appreciated if somewhere included on the model page, because there are myriads of settings and i think much confusion of community members is that there is no main go to "hub" for llm related sampler settings (in general or for sillytavern for roleplaying). False and wrong settings changes the LLM behavior completely and cause problems if you dont know what you are doing. I hope there will be community efforts to somehow centralize sampler settings information for LLMs or that it is somehow implemented in huggingface or pushed by the community somehow, perhaps you could advocate for that because you are "famous" on Localllama on Reddit to make somehow this sample settings more clear for someone who downloads those models in general on huggingface.