Skip to main content

Vision-triggered tool calls: the robot acts on what it sees

Time: 11:05 AM to 12:15 PM
This is the culmination of the entire week’s work. You are combining Day 3’s tool calls with Day 4’s vision so the robot can see, think, and act in a single loop.

Combine vision with tool calls

Add image capture to the AI Voice Assistant Car loop so the robot can optionally look at its surroundings before responding.

Update the system prompt

Give the robot context about its camera:
You are a robot car with a camera. When asked about your surroundings,
capture an image and describe it. If you see an obstacle, say so and
reverse. If you see a specific color, tell the user.

Test with spoken commands

Say “What do you see?” and the robot will:
  1. Capture a frame from the camera
  2. Send it to the VLM
  3. Speak the description
  4. Take physical action if appropriate

Add a look_around() tool

Extend the robot’s capabilities by adding a look_around() function that the LLM can call whenever it needs visual information before making a decision.

The big picture

This is the foundation of embodied AI: a model that perceives its environment and acts within it. Your robot now has all five layers:
  1. Movement - it can drive and steer
  2. Sensing - it reads distance, brightness, and audio
  3. Speech - it can speak and listen
  4. Language intelligence - it reasons with an LLM
  5. Vision - it sees and understands through a VLM
What would you need to add to make this fully autonomous?

Day 4 wrap-up

Recap: The robot now sees, hears, thinks, speaks, and moves. All five layers are connected. Tomorrow is different. No more building. You will zoom out and discuss what all of this means for society, security, and where to go from here. Preview for Day 5: AI ethics, risks, data privacy, open-source models, running AI locally, and resources to keep learning.