Vision foundation model | Hong Kong Centre for Logistics Robotics

By deeply integrating large-scale Vision-Language Models (VLMs), the Embodied Intelligence System Based on Vision-Language Models (VLMs) achieves semantic perception, reasoning, and embodied decision-making capabilities. It gains the ability to comprehend high-level semantic instructions—such as “organize the kitchen” or “take care of the elderly”—and autonomously decompose them into concrete subtasks while planning complete execution routes.

Through the seamless fusion of multimodal perception and intelligent modeling, it attains human-like understanding of tasks and environmental awareness, enabling stable and efficient task execution in scenarios such as household services, logistics, and advanced manufacturing.