SmolVLA Demo "Put [object] into a bowl."

Model Summary

This model is a fine-tuned version of smolvla_base trained on the cueng/so101_demo_bowl dataset. It acts as a visuomotor policy for the SO101 robotic embodiment, specifically trained to perform the standardized household manipulation task: "Put [object] into a bowl." The model takes in multi-view RGB camera streams (top and wrist cameras) alongside the robot's current proprioceptive state, and outputs joint action commands at 30 Hz to successfully grasp and place objects.

Model Details

Base Model: smolvla_base
Embodiment: SO101 Follower
Action Frequency: 30 Hz
Inputs:
- observation.images.top: Fixed global overhead view.
- observation.images.wrist: Ego-centric view from the end-effector.
- observation.state: Current joint state of the robot.
Outputs: Commanded actions for the SO101 robot (List[float32]).
Language Instruction: "Put [object] into a bowl."

Supported Objects

The model has been explicitly trained on 100 demonstrations for each of the following 10 objects:

Plastic spoon
Metal spoon
Scissors
Roll of tissue paper
Blue marker
Black marker
Nipper
Glue
White tape
Screwdriver

Limitations & Bias

Environmental Overfitting: The dataset was recorded against a fixed background setup. The model may struggle to generalize to novel backgrounds, lighting conditions, or different camera angles.
Embodiment Specificity: The action space is strictly mapped to the SO101 Follower. It cannot be directly deployed on other robotic arms (e.g., UR5, Franka, Aloha) without further fine-tuning or cross-embodiment translation.
Task Specificity: The model is optimized for a single semantic task ("Put into bowl") and may not generalize to other verbs or semantic actions (e.g., "push," "stack," or "hand over") without further instruction tuning.

Training Data

The model was fine-tuned using the cueng/so101_demo_bowl dataset, which contains 1,000 successful demonstration episodes (221,896 frames total). The data was recorded and validated by Sorrawit Poomseetong.

Usage with LeRobot

You can load and evaluate this policy using the LeRobot framework.

from lerobot.common.policies.pretrained import PreTrainedPolicy
import torch

# Load the fine-tuned policy
policy = PreTrainedPolicy.from_pretrained("your_username/smolvla_base_so101_bowl")
policy.eval()

# Example observation dummy data (replace with actual camera/state feeds)
observation = {
    "observation.images.top": torch.rand(1, 3, 224, 224),   # Adjust dims based on preprocessing
    "observation.images.wrist": torch.rand(1, 3, 224, 224), # Adjust dims based on preprocessing
    "observation.state": torch.rand(1, state_dim)           # Adjust state_dim for SO101
}

# Add language instruction (if supported via text conditioning in your VLA wrapper)
task_instruction = "Put a blue marker into a bowl."

# Predict the next action
with torch.no_grad():
    action = policy.select_action(observation, instruction=task_instruction)

print("Predicted action:", action)

Downloads last month: -

Video Preview

Robotics

Model tree for cueng/smolvla_demo_bowl

Base model

lerobot/smolvla_base

Finetuned

(4963)

this model

cueng
/

smolvla_demo_bowl