Instructions to use allenai/Olmo-3-32B-Think with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use allenai/Olmo-3-32B-Think with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="allenai/Olmo-3-32B-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-32B-Think")
model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-32B-Think")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use allenai/Olmo-3-32B-Think with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "allenai/Olmo-3-32B-Think"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Olmo-3-32B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/allenai/Olmo-3-32B-Think

SGLang

How to use allenai/Olmo-3-32B-Think with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "allenai/Olmo-3-32B-Think" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Olmo-3-32B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "allenai/Olmo-3-32B-Think" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "allenai/Olmo-3-32B-Think",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use allenai/Olmo-3-32B-Think with Docker Model Runner:
```
docker model run hf.co/allenai/Olmo-3-32B-Think
```

So far so good but the CoT rambling is way too much + question

by SerialKicked - opened Nov 20, 2025

Discussion

SerialKicked

Nov 20, 2025

•

edited Nov 20, 2025

Congrats for the release! I like the writing style, feels a lot less artificial than most other recent models. Model feels okay for its size, didn't get to really test it much yet, obviously. But so far I like it.

My only pet peeve is the CoT. Either I'm using the wrong sampling method (i tested a few) or you haven't tuned against forever / looping CoT. It, too often, gets into the thousand (even plural on occasion) of tokens for simple tasks. It's really wasteful in that regard. I thought Qwen was bad, but this is a whole level above it. Was it trained on some native way to disable CoT, like /nothink in qwen. Normally, i'd prefill queuing the start/end think tags with nothing in between, sadly this doesn't work most of the time with your model (blank generation). Or maybe is there a reasoning "effort" setting?

Another thing, looking at the jinja template you have an "environment" role setup. Was it used during training, and if so, what's its purpose? Is it a form of system message?

Edit: Prefilling responses with this seems to help "modulating" the reasoning effort. No idea how damaging it's to CoT. But worth sharing.

<think>\nOkay, I'll keep my thinking short.

natolambert

Ai2 org Nov 26, 2025

We're working on this for future versions @SerialKicked ! I agree it is a big time yapper, especially for easy queries. Its more similar for math+coding+reasoning queries.

natolambert changed discussion status to closed Nov 26, 2025

natolambert changed discussion status to open Nov 26, 2025

SerialKicked

Nov 26, 2025

•

edited Nov 26, 2025

For what its worth, the longer the total context is, the worst it seems to get. It's not as bad in 1 shot interactions.

Still nice, it's quite an achievement. A fully open 32B CoT model wasn't on my bingo card this year. Good luck to you all.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment