Skip to main content

Camera feed into a VLM: text and vision talk

Time: 9:55 AM to 10:55 AM
Now you will connect the robot’s camera directly to a vision language model. The robot captures what it sees, sends the image to GPT-4o, and speaks the description aloud.

Capture a single frame

Use the camera library to capture one frame:
from vilib import Vilib
Vilib.take_photo()
This saves a JPEG image locally on the robot.

Send to the vision API

Send the captured image along with a text question to the GPT-4o vision API:
Describe what you see in this image.
The model returns a text description of the image contents.

Run the live loop

Run the Text Vision Talk example:
sudo python3 17.text_vision_talk.py
This runs in a continuous loop: capture frame, send to VLM, speak the description.

Test it

Hold different objects in front of the camera and ask questions:
  • “What is in front of you?”
  • “What color is it?”
  • “Is the path clear?”
  • “How many objects do you see?”

How the API call works

The image gets encoded as base64 and included in the messages array alongside the text prompt. The model receives both the visual data and your question in a single API call.
GPT-4o vision calls use more tokens than text-only calls. Each image costs roughly 300 to 800 tokens depending on resolution. The facilitator will set a usage reminder.
After this section, take a 10-minute break (10:55 AM to 11:05 AM).