Vision-triggered tool calls: the robot acts on what it sees
Time: 11:05 AM to 12:15 PM
Combine vision with tool calls
Add image capture to the AI Voice Assistant Car loop so the robot can optionally look at its surroundings before responding.Update the system prompt
Give the robot context about its camera:Test with spoken commands
Say “What do you see?” and the robot will:- Capture a frame from the camera
- Send it to the VLM
- Speak the description
- Take physical action if appropriate
Add a look_around() tool
Extend the robot’s capabilities by adding alook_around() function that the LLM can call whenever it needs visual information before making a decision.
The big picture
This is the foundation of embodied AI: a model that perceives its environment and acts within it. Your robot now has all five layers:
- Movement - it can drive and steer
- Sensing - it reads distance, brightness, and audio
- Speech - it can speak and listen
- Language intelligence - it reasons with an LLM
- Vision - it sees and understands through a VLM

