Instructions to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit")
model = AutoModelForCausalLM.from_pretrained("jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

SGLang

How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with Docker Model Runner:
```
docker model run hf.co/jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit
```

Llama 3.2 1B E-commerce Intent (GPTQ 4-bit)

This is a fine-tuned version of meta-llama/Llama-3.2-1B that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (add or remove), the product name, and the quantity.

This version of the model is quantized to 4-bit using GPTQ, making it highly efficient for inference in terms of memory usage and speed. The QLoRA adapter was merged into the final GPTQ model - no separate adapter loading is required.

Model Description

The base model, Llama 3.2 1B, was fine-tuned using the QLoRA method on a synthetic dataset of 3000 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.

Dataset

The model was fine-tuned on a custom synthetic dataset of 3000 examples.

You can access the dataset here: jtlicardo/ecommerce-intent-3k

Intended Use & Limitations

This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.

Primary Use: Backend service for intent detection from user text.
Out-of-Scope: General conversation, answering questions, or any task not related to adding/removing items from a list.

How to Use

The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the Catalog and the User request.

Important: You need to install optimum and auto-gptq to run this 4-bit GPTQ model.

pip install -q optimum auto-gptq transformers

Here's how to run inference in Python:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Model repository on the Hugging Face Hub
model_id = "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit"

# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16
)

# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""

user_query = "Could you please take off 4 pairs of headphons from my cart?"

# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"

# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
    prompt,
    max_new_tokens=50,       # Max length of the JSON output
    do_sample=False,         # Use deterministic output
    temperature=None,        # Not needed for do_sample=False
    top_p=None,              # Not needed for do_sample=False
    return_full_text=False   # Only return the generated part
)

# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}

Training Procedure

This model was fine-tuned using the trl library's SFTTrainer.

Method: QLoRA (4-bit quantization with LoRA adapters)
Dataset: A custom JSONL file with 3000 prompt/completion pairs.
Configuration: completion_only_loss=True was used to ensure the model only learned to generate the assistant's JSON response.

Downloads last month: 3

Safetensors

Model size

1B params

Tensor type

I32

F16

Model tree for jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

Base model

meta-llama/Llama-3.2-1B

Quantized

(243)

this model

Dataset used to train jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

Collection including jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit

E-commerce Intent

Collection

8 items • Updated Mar 2