KingNish/reasoning-base-20k
Viewer • Updated • 19.9k • 490 • 231
How to use lunahr/thea-3b-25r with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="lunahr/thea-3b-25r")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("lunahr/thea-3b-25r")
model = AutoModelForCausalLM.from_pretrained("lunahr/thea-3b-25r")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use lunahr/thea-3b-25r with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lunahr/thea-3b-25r"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lunahr/thea-3b-25r",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/lunahr/thea-3b-25r
How to use lunahr/thea-3b-25r with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "lunahr/thea-3b-25r" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lunahr/thea-3b-25r",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "lunahr/thea-3b-25r" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "lunahr/thea-3b-25r",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use lunahr/thea-3b-25r with Docker Model Runner:
docker model run hf.co/lunahr/thea-3b-25r
An uncensored reasoning Llama 3.2 3B model trained on reasoning data.
It has been trained using improved training code, and gives an improved performance. Here is what inference code you should use:
from transformers import AutoModelForCausalLM, AutoTokenizer
MAX_REASONING_TOKENS = 1024
MAX_RESPONSE_TOKENS = 512
model_name = "lunahr/thea-3b-25r"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Which is greater 9.9 or 9.11 ??"
messages = [
{"role": "user", "content": prompt}
]
# Generate reasoning
reasoning_template = tokenizer.apply_chat_template(messages, tokenize=False, add_reasoning_prompt=True)
reasoning_inputs = tokenizer(reasoning_template, return_tensors="pt").to(model.device)
reasoning_ids = model.generate(**reasoning_inputs, max_new_tokens=MAX_REASONING_TOKENS)
reasoning_output = tokenizer.decode(reasoning_ids[0, reasoning_inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("REASONING: " + reasoning_output)
# Generate answer
messages.append({"role": "reasoning", "content": reasoning_output})
response_template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response_inputs = tokenizer(response_template, return_tensors="pt").to(model.device)
response_ids = model.generate(**response_inputs, max_new_tokens=MAX_RESPONSE_TOKENS)
response_output = tokenizer.decode(response_ids[0, response_inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("ANSWER: " + response_output)
This Llama model was trained faster than Unsloth using custom training code.
Visit https://www.kaggle.com/code/piotr25691/distributed-llama-training-with-2xt4 to find out how you can finetune your models using BOTH of the Kaggle provided GPUs.
Base model
chuanli11/Llama-3.2-3B-Instruct-uncensored