SmolVLA: Hugging Face’s New Robotics AI

SmolVLA was announced in June 2025 as an open-source robotic Vision-Language-Action (VLA) model with 4.5 billion parameters. The model is optimized to run on consumer-grade hardware such as the MacBook Pro and performs similarly or better than larger models. This aims to significantly reduce the cost of entry and hardware requirements in the robotics field.

The model architecture combines the Transformer structure and the flow-matching encoder. It includes four main optimizations: layer skipping in the visual model, alternating use of self- and cross-attention modules, reducing the number of visual tokens, and using a lighter-weight visual encoder, SmolVLM2. This increases both speed and efficiency.

SmolVLA outperforms competing models such as Octo and OpenVLA in simulation and real-world environments for general-purpose robotic tasks (e.g. object handling, placement, classification). In addition, the asynchronous inference architecture allows the robot to respond quickly to environmental changes.

Hugging Face aims to democratize access to VLA models and accelerate general-purpose robotic agent research by open-sourcing the model, codebase, training datasets, and robotic hardware guides.

SmolVLA was trained on community-shared datasets and is seen as a significant step forward for low-cost robotics development. Real-world use cases for the model include running it on a MacBook and implementing it on robotic platforms such as the Koch Arm.

SmolVLA was launched in June 2025 as an accessible, open-source, and high-performance VLA model in robotics, and is considered a significant milestone in robotics research and development.