Vision-triggered tool calls

Vision-triggered tool calls: the robot acts on what it sees

Time: 11:05 AM to 12:15 PM

This is the culmination of the entire week’s work. You are combining Day 3’s tool calls with Day 4’s vision so the robot can see, think, and act in a single loop.

The full stack

The new piece: describe_surroundings is a tool the LLM can call. When it needs visual information, it triggers a camera capture and VLM analysis, then uses the result to decide its next action.

Building the vision robot

This program extends Day 3’s llm_robot.py with camera support.

Step 1 — Add camera initialization

Set up the Pi camera alongside the robot hardware:

from picamera2 import Picamera2
import io, base64

cam = Picamera2()
cam.configure(cam.create_still_configuration(main={"size": (640, 480)}))
cam.start()

def capture_frame():
    stream = io.BytesIO()
    cam.capture_file(stream, format="jpeg")
    stream.seek(0)
    return base64.b64encode(stream.read()).decode("utf-8")

Step 2 — Add the describe_surroundings tool

This is the new tool. When GPT calls it, your code captures a photo and sends it to the vision API:

def action_describe_surroundings(question="Describe what you see.", **_):
    image_b64 = capture_frame()
    vision_client = OpenAI(api_key=OPENAI_KEY)

    response = vision_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_b64}",
                    "detail": "low",
                }},
            ],
        }],
        max_tokens=200,
    )
    description = response.choices[0].message.content.strip()
    return f"Camera sees: {description}"

The tool result goes back to GPT, which then decides what to do next — maybe move, maybe speak, maybe call another tool. Register the new tool in the TOOLS list so GPT knows it exists:

{
    "type": "function",
    "function": {
        "name": "describe_surroundings",
        "description": "Capture a photo and describe what is visible. Use when the user asks about surroundings.",
        "parameters": {
            "type": "object",
            "properties": {
                "question": {
                    "type": "string",
                    "description": "What to look for. Default: 'Describe what you see.'"
                }
            }
        }
    }
}

And add it to the ACTIONS dictionary:

ACTIONS = {
    "move_forward": action_move_forward,
    # ... all Day 3 tools ...
    "describe_surroundings": action_describe_surroundings,  # NEW
}

Step 4 — Update the system prompt

Tell GPT it has a camera:

SYSTEM_PROMPT = """You are an AI assistant controlling a physical robot car with a camera.
You can see through your camera, move, turn, look around, and speak.
When the user asks about your surroundings, use describe_surroundings first.
When asked to do something physical, use the movement tools.
You can call multiple tools in sequence.
Be enthusiastic, specific about what you see, and keep spoken responses under 2 sentences."""

The prompt now explicitly tells GPT to use describe_surroundings when asked about the environment.

How it flows

When you say “what do you see?”, here is what happens: The key insight: GPT decides when to use the camera. It might respond to “tell me a joke” without looking, but respond to “what’s in front of you?” by calling describe_surroundings first.

Click to see the complete vision_robot.py program

#!/usr/bin/env python3
"""
Vision-Enabled LLM Robot — sees, hears, thinks, speaks, and acts.

11 tools including describe_surroundings for camera vision.

Usage:
    sudo python3 vision_robot.py          # voice mode
    sudo python3 vision_robot.py --type   # keyboard mode
"""

import sys, os, io, json, time, base64, ctypes

try:
    asound = ctypes.cdll.LoadLibrary("libasound.so.2")
    c_handler = ctypes.CFUNCTYPE(None, ctypes.c_char_p, ctypes.c_int,
                                  ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p)
    asound.snd_lib_error_set_handler(c_handler(lambda *_: None))
except: pass

sys.path.insert(0, os.path.expanduser("~/camp"))

from secret import OPENAI_KEY
from openai import OpenAI
from picamera2 import Picamera2
from picarx import Picarx
try:
    import vosk, pyaudio
except ImportError:
    vosk = pyaudio = None

MODEL_PATH = os.path.expanduser("~/camp/vosk-model")
WAKE_WORD = "hey buddy"

SYSTEM_PROMPT = """You are an AI assistant controlling a physical robot car with a camera.
You can see through your camera, move, turn, look around, and speak.
When the user asks about your surroundings, use describe_surroundings first.
When asked to do something physical, use the movement tools.
Always respond with a short spoken message — call speak() so the user hears you.
Be enthusiastic, specific about what you see, and keep responses under 2 sentences."""

# (Full TOOLS list with 11 tools, ACTIONS dict, all action functions,
#  find_mic, listen_for_wake, listen_command, process_command, and main
#  are the same as the version in robot-content/vision_robot.py)

The complete program is at ~/camp/day4/vision_robot.py — it extends llm_robot.py with camera support and the describe_surroundings tool.

Run the program

sudo python3 ~/camp/day4/vision_robot.py --type

Or with voice:

sudo python3 ~/camp/day4/vision_robot.py

Try these commands

Vision commands:

“What do you see?”
“Is there anything in front of me?”
“What color is that object?”
“Describe your surroundings”
“Can you read that sign?”

Vision + action combinations:

“Look around and tell me what you see”
“If the path is clear, move forward”
“What do you see? If it’s interesting, celebrate!”

Still works from Day 3:

“Make a square”
“Move forward for 3 seconds”
“Celebrate”
All 10 original tools still work

Available tools (updated)

The robot now has 11 tools — everything from Day 3 plus vision:

Tool	What it does
`move_forward`	Drive forward
`move_backward`	Drive backward
`turn_left`	Turn left by N degrees
`turn_right`	Turn right by N degrees
`stop_car`	Stop all movement
`speak`	Say something aloud
`nod`	Nod yes
`shake_head`	Shake head no
`celebrate`	Happy dance
`look_around`	Pan camera left/right
`describe_surroundings`	Capture photo and describe what is visible

The big picture

This is the foundation of embodied AI: a model that perceives its environment and acts within it. Your robot now has all five layers:

Movement — it can drive and steer
Sensing — it reads distance, brightness, and audio
Speech — it can speak and listen
Language intelligence — it reasons with an LLM
Vision — it sees and understands through a VLM

What would you need to add to make this fully autonomous?

Make it yours

Change what the robot looks for

Modify the describe_surroundings tool’s question parameter. Instead of general description, make it look for specific things: “Count how many people you see” or “Tell me if there’s food in this image.”

Add obstacle avoidance

Combine vision with the ultrasonic sensor. If the distance sensor reads less than 20cm, automatically call describe_surroundings to see what the obstacle is, then decide whether to go around it.

Vision-guided navigation

Create a loop where the robot drives forward, checks what it sees every 3 seconds, and stops if it sees a specific object (like a red cup or a person).

Day 4 wrap-up

Recap: The robot now sees, hears, thinks, speaks, and moves. All five layers are connected in a single program. Tomorrow is different. You will zoom out to discuss what all of this means for society — bias, surveillance, data privacy, and the dual-use problem. Then you will bring everything together in one final project: a robot reporter that tours the room and narrates what it sees.

Welcome

Class Recordings

Day 1: Setup and Calibration

Day 2: Code & Computer Vision

Day 3: GenAI and Cloud LLMs

Day 4: Vision AI

Day 5: AI Ethics & Final Project