What is a vision language model?
Time: 9:30 AM to 9:55 AM
From text to vision
On Day 3, your robot could hear and speak. Today, it learns to see. A VLM processes images alongside text so you can ask questions about what the robot sees.How vision encoding works
Vision encoder
The image is processed by a vision encoder that converts it into a sequence of visual tokens.
The model does not see like a human. It receives a compressed representation of the image and predicts tokens based on patterns learned during training.
What this enables
With a VLM, you can ask your robot:- “Describe what you see”
- “Is there an obstacle in front of you?”
- “What color is the object on your left?”
- “Should I move forward or stop?”

