Instructions to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit") model = AutoModelForCausalLM.from_pretrained("jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit
- SGLang
How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit with Docker Model Runner:
docker model run hf.co/jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit
Llama 3.2 1B E-commerce Intent (GPTQ 4-bit)
This is a fine-tuned version of meta-llama/Llama-3.2-1B that has been specifically trained to act as an e-commerce intent detection model. Given a catalog of products and a user's request, it outputs a structured JSON object representing the user's intent (add or remove), the product name, and the quantity.
This version of the model is quantized to 4-bit using GPTQ, making it highly efficient for inference in terms of memory usage and speed. The QLoRA adapter was merged into the final GPTQ model - no separate adapter loading is required.
Model Description
The base model, Llama 3.2 1B, was fine-tuned using the QLoRA method on a synthetic dataset of 3000 examples. The training objective was to teach the model to ignore conversational pleasantries and strictly output a JSON object that can be directly parsed by a backend system for managing a shopping cart.
Dataset
The model was fine-tuned on a custom synthetic dataset of 3000 examples.
You can access the dataset here: jtlicardo/ecommerce-intent-3k
Intended Use & Limitations
This model is designed for a specific task: parsing user requests in an e-commerce context. It should not be used as a general-purpose chatbot.
- Primary Use: Backend service for intent detection from user text.
- Out-of-Scope: General conversation, answering questions, or any task not related to adding/removing items from a list.
How to Use
The model expects a prompt formatted in a specific way, following the TinyLlama-Chat template. You must provide the Catalog and the User request.
Important: You need to install optimum and auto-gptq to run this 4-bit GPTQ model.
pip install -q optimum auto-gptq transformers
Here's how to run inference in Python:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Model repository on the Hugging Face Hub
model_id = "jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit"
# Load the tokenizer and the 4-bit quantized model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16
)
# --- Define the prompt ---
catalog = """Catalog:
Shampoo (400ml bottle)
Hand Soap (250ml dispenser)
Peanut Butter (340g jar)
Headphones
Green Tea (25 tea bags)"""
user_query = "Could you please take off 4 pairs of headphons from my cart?"
# --- Format the prompt using the model's chat template ---
# The model was trained to see this structure.
prompt = f"<|user|>\n{catalog}\n\nUser:\n{user_query}\n<|assistant|>\n"
# --- Generate the output ---
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
outputs = pipe(
prompt,
max_new_tokens=50, # Max length of the JSON output
do_sample=False, # Use deterministic output
temperature=None, # Not needed for do_sample=False
top_p=None, # Not needed for do_sample=False
return_full_text=False # Only return the generated part
)
# The output will be a clean JSON string
generated_json = outputs[0]['generated_text'].strip()
print(generated_json)
# Expected output:
# {"action": "remove", "product": "Headphones", "quantity": 4}
Training Procedure
This model was fine-tuned using the trl library's SFTTrainer.
- Method: QLoRA (4-bit quantization with LoRA adapters)
- Dataset: A custom JSONL file with 3000
prompt/completionpairs. - Configuration:
completion_only_loss=Truewas used to ensure the model only learned to generate the assistant's JSON response.
- Downloads last month
- 3
Model tree for jtlicardo/llama_3.2-1b-ecommerce-intent-gptq-4bit
Base model
meta-llama/Llama-3.2-1B