SmolVLA

SmolVLA is a 450M-parameter open-source Vision-Language-Action (VLA) model designed for robotics applications, combining visual perception, language understanding, and action prediction in a unified architecture. It processes RGB images, sensorimotor states, and natural language instructions to generate continuous robot control commands.
The core value lies in democratizing robotics AI by offering an efficient, affordable solution that runs on consumer hardware (e.g., single GPUs or MacBooks) while outperforming larger proprietary models in real-world and simulated tasks.

Compact Architecture: Utilizes layer skipping (halving vision encoder layers), visual token reduction (64 tokens per frame via PixelShuffle), and interleaved cross/self-attention blocks to reduce computational overhead while maintaining accuracy.
Asynchronous Inference: Enables 30% faster response times and 2× task throughput by decoupling action execution from policy computation, using non-blocking threads and chunk fusion for smooth transitions between action sequences.
Community-Driven Training: Trained exclusively on 30k episodes from LeRobot Community Datasets (10M frames), standardized via automated task annotation refinement and camera view normalization for diverse real-world generalization.

High Cost of Robotics AI: Addresses the inaccessibility of proprietary VLAs by providing a lightweight model that eliminates dependency on expensive hardware (e.g., runs on NVIDIA RTX 3090 or Apple M2 Ultra).
Fragmented Robotics Data: Resolves data scarcity and inconsistency by aggregating and standardizing community-shared datasets under the LeRobot tag, enabling robust pretraining without proprietary data.
Real-Time Control Limitations: Mitigates latency issues in dynamic environments through asynchronous inference, allowing continuous action execution while processing new observations.

Open-Source Efficiency: Outperforms larger models like ACT and Diffusion Policy on benchmarks (78.3% success on SO100 tasks) despite using 10× fewer training episodes, validated across LIBERO, Meta-World, and real-world SO101.
Flow Matching Action Expert: Employs a 100M-parameter transformer with flow matching for non-autoregressive action prediction, achieving sub-millisecond latency and precise control without autoregressive decoding bottlenecks.
Hardware Agnosticism: Compatible with low-cost robots (e.g., SO-100 arm, LeKiwi) and edge devices, with quantized weights for CPU deployment and native PyTorch integration for easy prototyping.

How does asynchronous inference handle rapidly changing environments?
The policy server processes observations in parallel with action execution, using chunk fusion to merge overlapping action sequences and early triggering (70% queue threshold) to minimize latency between perception and response.
What hardware is required to run SmolVLA?
The base model operates on consumer GPUs (8GB VRAM), CPUs, or Apple Silicon MacBooks, with pretrained weights available in FP16 and INT8 formats for deployment on embedded systems like Raspberry Pi 5.
Can SmolVLA be trained on custom datasets?
Yes, the training pipeline supports finetuning with the LeRobot framework on user-collected data, requiring only PyTorch and standard RGB-action pairs formatted via Hugging Face Datasets.

Powerful robotics VLA that runs on consumer hardware