Instructions to use Lakshan2003/SmolLM3-3B-instruct-customerservice with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
model = PeftModel.from_pretrained(base_model, "Lakshan2003/SmolLM3-3B-instruct-customerservice")

Transformers

How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Lakshan2003/SmolLM3-3B-instruct-customerservice")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Lakshan2003/SmolLM3-3B-instruct-customerservice", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lakshan2003/SmolLM3-3B-instruct-customerservice"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lakshan2003/SmolLM3-3B-instruct-customerservice",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Lakshan2003/SmolLM3-3B-instruct-customerservice

SGLang

How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Lakshan2003/SmolLM3-3B-instruct-customerservice" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lakshan2003/SmolLM3-3B-instruct-customerservice",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Lakshan2003/SmolLM3-3B-instruct-customerservice" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lakshan2003/SmolLM3-3B-instruct-customerservice",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Lakshan2003/SmolLM3-3B-instruct-customerservice to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Lakshan2003/SmolLM3-3B-instruct-customerservice to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Lakshan2003/SmolLM3-3B-instruct-customerservice to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Lakshan2003/SmolLM3-3B-instruct-customerservice",
    max_seq_length=2048,
)

Docker Model Runner
How to use Lakshan2003/SmolLM3-3B-instruct-customerservice with Docker Model Runner:
```
docker model run hf.co/Lakshan2003/SmolLM3-3B-instruct-customerservice
```

SmolLM3-3B-instruct-customerservice

This model is a QLoRA fine-tuned version of HuggingFaceTB/SmolLM3-3B-Instruct on a context-summarized multi-turn customer-service QA dataset for banking domain conversations.

Model Description

This is a QLoRA (Quantized Low-Rank Adaptation) fine-tuned version of SmolLM3-3B-Instruct optimized for multi-turn customer-service question answering with context summarization. The model was trained on synthetic banking customer-service conversations with history summarization to preserve essential conversational context while maintaining dialogue continuity.

Base Model: HuggingFaceTB/SmolLM3-3B-Instruct
Parameters: ~3 billion
Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
Domain: Customer Service (Banking)
Task: Context-Summarized Multi-Turn Question Answering
Note: Reasoning capabilities disabled during training and inference (no thinking tags)

Intended Uses & Limitations

Intended Uses

Multi-turn customer service conversations in banking domain
Context-aware response generation with dialogue continuity
Real-time customer support automation
Efficient deployment on resource-constrained hardware
Privacy-preserving on-premise deployment

Limitations

Primarily trained on banking domain data; may require adaptation for other sectors
Performance based on synthetic data; real-world variability may differ
Requires context summarization for optimal performance
Maximum sequence length: 512 tokens
Lower performance compared to other 3B models (LLaMA, Qwen, Phi)
Struggles with dialogue continuity and contextual alignment

Training Data

Dataset: Synthetic context-summarized multi-turn customer-service QA dataset
Source: Derived from TalkMap Banking Conversation Corpus
Size: 128,335 training instances, 18,333 validation instances
Conversation Turns: 2-53 turns per conversation (avg: 10.06)
Context Strategy: History summarization using GPT-4o-mini
Response Refinement: GPT-4.1-based response quality enhancement

Training Procedure

Training Configuration

Framework: Unsloth + Hugging Face Transformers
Fine-tuning Method: QLoRA (4-bit quantization)
Hardware: NVIDIA RTX A100 40GB GPU
Training Time: 5-14 hours

Training Hyperparameters

Max Sequence Length: 512 tokens
Quantization: 4-bit precision
LoRA Rank (r): 16
LoRA Alpha: 32
LoRA Dropout: 0.1
LoRA Target Modules: All attention and feed-forward projection layers
Epochs: 3
Optimizer: AdamW 8-bit
Learning Rate: 2e-5
Weight Decay: 0.01
Warmup Ratio: 0.05
LR Scheduler: Cosine

Inference Parameters

generation_config = {
    "max_new_tokens": 128,
    "temperature": 0.6,
    "do_sample": True,
    "top_p": 0.95,
    "top_k": 50,
}

Usage Example

Installation

pip install unsloth transformers peft torch

Loading the Model

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B-Instruct",
    device_map="auto",
    torch_dtype=torch.float16,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Instruct")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "Lakshan2003/SmolLM3-3B-instruct-customerservice")

# Merge adapter (optional, for deployment)
model = model.merge_and_unload()
model.eval()

Inference

# Prompt template (adjust for SmolLM format)
prompt_template = """<|im_start|>system
{instruction}<|im_end|>
<|im_start|>user
Conversation History:
{history}

Client Question:
{client_question}<|im_end|>
<|im_start|>assistant
"""

# Example conversation
instruction = "You are a professional call-center customer service agent working at Optimal Financial Partners. Review the conversation history and any provided context (if available). Make sure your response is consistent with the conversation history (names, issues, and actions already taken). If no history is given, treat the client’s message as the start of the conversation. Continue the dialogue as the agent by giving a clear, helpful, and professional response. Responses should sound natural and human-like, like a real phone call, and usually be few short sentences. Provide more detail when the client’s request clearly requires it."
history = "Kathrine has contacted Almira from Optimal Financial Partners regarding unexpected charges on her statement and her rights as a consumer. Almira confirmed that as a customer, Kathrine has the right to dispute any unauthorized or incorrect charges. Almira offered to investigate any charges Kathrine believes are incorrect. No specific charges, amounts, or account identifiers have been mentioned, and no verification steps have been completed or are pending at this time. The conversation is currently focused on explaining consumer rights and the process for disputing charges."
client_question = "That's great to know. What if I'm not satisfied with the outcome of the investigation?"

# Format input
input_text = prompt_template.format(
    instruction=instruction,
    history=history,
    client_question=client_question
)

# Tokenize
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).to(model.device)

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.6,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode response
input_length = inputs.input_ids.shape[1]
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()
print(response)

Framework Versions

PEFT: 0.14.0
Transformers: 4.47.0
PyTorch: 2.5.1+cu121
Unsloth: Latest (training framework)

Citation

If you use this model, please cite:

@article{cooray2026small,
  title={Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation},
  author={Cooray, Lakshan and Sumanathilaka, Deshan and Raju, Pattigadapa Venkatesh},
  journal={arXiv preprint arXiv:2602.00665},
  year={2026}
}