SmolVLA Demo "Put [object] into a bowl."
Model Summary
This model is a fine-tuned version of smolvla_base trained on the cueng/so101_demo_bowl dataset. It acts as a visuomotor policy for the SO101 robotic embodiment, specifically trained to perform the standardized household manipulation task: "Put [object] into a bowl." The model takes in multi-view RGB camera streams (top and wrist cameras) alongside the robot's current proprioceptive state, and outputs joint action commands at 30 Hz to successfully grasp and place objects.
Model Details
- Base Model:
smolvla_base - Embodiment: SO101 Follower
- Action Frequency: 30 Hz
- Inputs:
observation.images.top: Fixed global overhead view.observation.images.wrist: Ego-centric view from the end-effector.observation.state: Current joint state of the robot.
- Outputs: Commanded actions for the SO101 robot (
List[float32]). - Language Instruction:
"Put [object] into a bowl."
Supported Objects
The model has been explicitly trained on 100 demonstrations for each of the following 10 objects:
- Plastic spoon
- Metal spoon
- Scissors
- Roll of tissue paper
- Blue marker
- Black marker
- Nipper
- Glue
- White tape
- Screwdriver
Limitations & Bias
- Environmental Overfitting: The dataset was recorded against a fixed background setup. The model may struggle to generalize to novel backgrounds, lighting conditions, or different camera angles.
- Embodiment Specificity: The action space is strictly mapped to the SO101 Follower. It cannot be directly deployed on other robotic arms (e.g., UR5, Franka, Aloha) without further fine-tuning or cross-embodiment translation.
- Task Specificity: The model is optimized for a single semantic task ("Put into bowl") and may not generalize to other verbs or semantic actions (e.g., "push," "stack," or "hand over") without further instruction tuning.
Training Data
The model was fine-tuned using the cueng/so101_demo_bowl dataset, which contains 1,000 successful demonstration episodes (221,896 frames total). The data was recorded and validated by Sorrawit Poomseetong.
Usage with LeRobot
You can load and evaluate this policy using the LeRobot framework.
from lerobot.common.policies.pretrained import PreTrainedPolicy
import torch
# Load the fine-tuned policy
policy = PreTrainedPolicy.from_pretrained("your_username/smolvla_base_so101_bowl")
policy.eval()
# Example observation dummy data (replace with actual camera/state feeds)
observation = {
"observation.images.top": torch.rand(1, 3, 224, 224), # Adjust dims based on preprocessing
"observation.images.wrist": torch.rand(1, 3, 224, 224), # Adjust dims based on preprocessing
"observation.state": torch.rand(1, state_dim) # Adjust state_dim for SO101
}
# Add language instruction (if supported via text conditioning in your VLA wrapper)
task_instruction = "Put a blue marker into a bowl."
# Predict the next action
with torch.no_grad():
action = policy.select_action(observation, instruction=task_instruction)
print("Predicted action:", action)
- Downloads last month
- -
Model tree for cueng/smolvla_demo_bowl
Base model
lerobot/smolvla_base