Camera feed and VLM

Camera feed into a VLM

Time: 9:55 AM to 10:55 AM

Now you will connect the robot’s camera directly to a vision language model. The robot captures what it sees, sends the image to GPT-4o-mini, and describes it — either on screen or aloud.

The architecture

Program 1: Test the vision API

Start simple — capture one photo, send it to GPT, and print what it sees.

Step 1 — Capture a frame as base64

The picamera2 library captures JPEG frames. Convert to base64 so it can be included in the API request:

from picamera2 import Picamera2
import io, base64

cam = Picamera2()
cam.configure(cam.create_still_configuration(main={"size": (640, 480)}))
cam.start()

def capture_frame(cam):
    stream = io.BytesIO()
    cam.capture_file(stream, format="jpeg")
    stream.seek(0)
    return base64.b64encode(stream.read()).decode("utf-8")

The image becomes a long text string (about 50-80 KB) that gets embedded directly in the API request.

Step 2 — Send image + text to the vision API

The key difference from Day 3: the content field is now a list containing both text and image data:

from openai import OpenAI
client = OpenAI(api_key=OPENAI_KEY)

image_b64 = capture_frame(cam)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what you see."},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_b64}",
                    "detail": "low",
                },
            },
        ],
    }],
    max_tokens=300,
)

print(response.choices[0].message.content)

Notice the content field changed from a simple string to a list with two items: the text prompt and the image. The detail: "low" setting keeps token costs down.

Step 3 — Interactive loop

After the first description, let the user ask follow-up questions about the same image or snap a new one:

while True:
    user_input = input("  You: ").strip()

    if user_input.lower() in ("snap", "photo", "new"):
        image_b64 = capture_frame(cam)
        description = ask_vision(client, image_b64, "Describe what you see.")
        print(f"  GPT sees: {description}")
        continue

    answer = ask_vision(client, image_b64, user_input)
    print(f"  GPT: {answer}")

Run it

python3 ~/camp/day4/test_vision.py

Point the camera at different objects and see what GPT describes:

==================================================
  Test Vision (GPT-4o-mini)
==================================================

  Starting camera...
  Camera ready.
  OpenAI client ready.

  Capturing photo...
  Photo captured (62 KB base64)
  Sending to GPT-4o-mini vision...

  GPT sees: The image shows a desk with a laptop computer,
            a coffee mug, and some papers scattered around.
            There appears to be a window in the background
            with natural light coming in.

  Interactive mode — ask about what the robot sees.
  Type 'snap' to take a new photo.
  Type 'quit' to exit.

  You: what color is the mug?
  GPT: The mug appears to be white with a blue logo on it.

  You: snap
  Capturing new photo...
  GPT sees: Now I can see a hand holding a red book in front
            of the camera.

Try these experiments:

Hold up different colored objects and ask “what color is this?”
Show it text on a piece of paper and ask “what does this say?”
Point it at a person and ask “describe who you see”
Cover the lens and ask “what do you see?” — the model will say it is dark

Click to see the complete test_vision.py program

#!/usr/bin/env python3
"""
Test Vision — captures a photo, sends to GPT-4o-mini vision,
prints the description, then allows interactive follow-up.
"""

import sys, os, io, base64, time

sys.path.insert(0, os.path.expanduser("~/camp"))
from secret import OPENAI_KEY
from openai import OpenAI
from picamera2 import Picamera2

def capture_frame(cam):
    stream = io.BytesIO()
    cam.capture_file(stream, format="jpeg")
    stream.seek(0)
    return base64.b64encode(stream.read()).decode("utf-8")

def ask_vision(client, image_b64, prompt):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/jpeg;base64,{image_b64}",
                    "detail": "low"}},
            ],
        }],
        max_tokens=300,
    )
    return response.choices[0].message.content.strip()

def main():
    print("=" * 50)
    print("  Test Vision (GPT-4o-mini)")
    print("=" * 50)

    cam = Picamera2()
    cam.configure(cam.create_still_configuration(main={"size": (640, 480)}))
    cam.start(); time.sleep(2)

    client = OpenAI(api_key=OPENAI_KEY)

    print("\n  Capturing photo...")
    image_b64 = capture_frame(cam)
    print(f"  Photo captured ({len(image_b64) // 1024} KB base64)")

    print("  Sending to GPT-4o-mini vision...\n")
    desc = ask_vision(client, image_b64, "Describe what you see in 2-3 sentences.")
    print(f"  GPT sees: {desc}\n")

    print("  Type 'snap' for new photo, 'quit' to exit.\n")
    while True:
        try:
            inp = input("  You: ").strip()
        except (EOFError, KeyboardInterrupt):
            break
        if inp.lower() in ("quit", "exit", "q"): break
        if inp.lower() in ("snap", "photo", "new"):
            image_b64 = capture_frame(cam)
            desc = ask_vision(client, image_b64, "Describe what you see in 2-3 sentences.")
            print(f"  GPT sees: {desc}\n"); continue
        if inp:
            ans = ask_vision(client, image_b64, inp)
            print(f"  GPT: {ans}\n")

    cam.stop(); cam.close()
    print("\n  Done!\n")

if __name__ == "__main__":
    main()

Program 2: Vision chatbot

Now add voice and TTS. The robot captures a photo with every question, sends both the image and your spoken question to GPT, and speaks the answer.

Step 1 — Combine camera + voice input

The key difference from Day 3’s voice chatbot: every API call now includes a fresh camera frame alongside the text:

def ask_vision(client, image_b64, text, history):
    messages = history.copy()
    messages.append({
        "role": "user",
        "content": [
            {"type": "text", "text": text},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_b64}",
                    "detail": "low",
                },
            },
        ],
    })

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=200,
    )
    return response.choices[0].message.content.strip()

Step 2 — The conversation loop with camera

Every iteration: listen → capture photo → send both to GPT → speak response:

while True:
    text = listen_once(recognizer, stream, chunk, timeout=10)
    if not text:
        continue

    # Capture fresh photo with every question
    image_b64 = capture_frame(cam)

    answer = ask_vision(client, image_b64, text, history)
    print(f"  Robot: {answer}")
    speak(answer)

The robot always sends a current photo, so if you move an object between questions, the model sees the change.

Run it

python3 ~/camp/day4/vision_chatbot.py

Or keyboard mode:

python3 ~/camp/day4/vision_chatbot.py --type

Try these interactions:

  Robot: Hello! I can see through my camera. Ask me what I see!

  You: what do you see
  Robot: I can see a desk with a laptop and a blue water bottle.
         There's a whiteboard in the background.

  You: what color is the water bottle
  Robot: The water bottle appears to be blue with a silver cap.

  You: hold a book up — I'm going to show you something
  Robot: I can see you're holding up a book! It looks like it has
         a red cover with white text.

Click to see the complete vision_chatbot.py program

#!/usr/bin/env python3
"""
Vision Chatbot — voice/keyboard chatbot that can SEE.
Captures a photo with every question and sends it to GPT-4o-mini vision.

Usage:
    python3 vision_chatbot.py           # voice mode
    python3 vision_chatbot.py --type    # keyboard mode
"""

import sys, os, io, json, time, base64, ctypes

try:
    asound = ctypes.cdll.LoadLibrary("libasound.so.2")
    c_handler = ctypes.CFUNCTYPE(None, ctypes.c_char_p, ctypes.c_int,
                                  ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p)
    asound.snd_lib_error_set_handler(c_handler(lambda *_: None))
except: pass

sys.path.insert(0, os.path.expanduser("~/camp"))
from secret import OPENAI_KEY
from openai import OpenAI
from picamera2 import Picamera2

try:
    import vosk, pyaudio
except ImportError:
    vosk = pyaudio = None

MODEL_PATH = os.path.expanduser("~/camp/vosk-model")

SYSTEM_PROMPT = """You are a friendly robot with a camera. Describe what you see
clearly in 1-3 sentences. Be specific about colors, shapes, and objects."""

def speak(text, speed=160):
    safe = text.replace("'", "'\\''").replace('"', '\\"')
    os.system(f"espeak -s {speed} '{safe}' 2>/dev/null")

def capture_frame(cam):
    stream = io.BytesIO()
    cam.capture_file(stream, format="jpeg")
    stream.seek(0)
    return base64.b64encode(stream.read()).decode("utf-8")

def find_mic():
    pa = pyaudio.PyAudio(); usb_idx = any_idx = None
    for i in range(pa.get_device_count()):
        info = pa.get_device_info_by_index(i)
        if info["maxInputChannels"] < 1: continue
        if any_idx is None: any_idx = i
        if "usb" in info["name"].lower(): usb_idx = i
    idx = usb_idx if usb_idx is not None else any_idx
    rate = int(pa.get_device_info_by_index(idx)["defaultSampleRate"])
    pa.terminate(); return idx, rate

def listen_once(recognizer, stream, chunk, timeout=10):
    start = time.time()
    while time.time() - start < timeout:
        data = stream.read(chunk, exception_on_overflow=False)
        if recognizer.AcceptWaveform(data):
            text = json.loads(recognizer.Result()).get("text", "").strip()
            if text: return text
    return json.loads(recognizer.FinalResult()).get("text", "").strip()

def ask_vision(client, image_b64, text, history):
    messages = history.copy()
    messages.append({"role": "user", "content": [
        {"type": "text", "text": text},
        {"type": "image_url", "image_url": {
            "url": f"data:image/jpeg;base64,{image_b64}", "detail": "low"}}]})
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=messages, max_tokens=200)
    answer = response.choices[0].message.content.strip()
    history.append({"role": "user", "content": text})
    history.append({"role": "assistant", "content": answer})
    return answer

def main():
    type_mode = "--type" in sys.argv or "-t" in sys.argv

    cam = Picamera2()
    cam.configure(cam.create_still_configuration(main={"size": (640, 480)}))
    cam.start(); time.sleep(2)

    audio_stream = pa = recognizer = None; chunk = 0
    if not type_mode and vosk and pyaudio and os.path.isdir(MODEL_PATH):
        mic_idx, mic_rate = find_mic(); chunk = mic_rate // 4
        vosk.SetLogLevel(-1)
        recognizer = vosk.KaldiRecognizer(vosk.Model(MODEL_PATH), mic_rate)
        pa = pyaudio.PyAudio()
        audio_stream = pa.open(format=pyaudio.paInt16, channels=1, rate=mic_rate,
                               input=True, input_device_index=mic_idx, frames_per_buffer=chunk)
    else:
        type_mode = True

    client = OpenAI(api_key=OPENAI_KEY)
    history = [{"role": "system", "content": SYSTEM_PROMPT}]

    speak("Hello! I can see through my camera. Ask me what I see!")
    try:
        while True:
            if type_mode:
                text = input("\n  You: ").strip()
            else:
                sys.stdout.write("  🎤  Listening... "); sys.stdout.flush()
                text = listen_once(recognizer, audio_stream, chunk)
                if not text: sys.stdout.write("(nothing heard)\n"); continue
                print(f"\n  You: {text}")
            if not text: continue
            if text.lower() in ("goodbye", "bye", "quit", "exit", "q"):
                speak("Goodbye!"); break
            image_b64 = capture_frame(cam)
            answer = ask_vision(client, image_b64, text, history)
            print(f"  Robot: {answer}")
            speak(answer)
    except KeyboardInterrupt: print("\n  Interrupted.")
    finally:
        cam.stop(); cam.close()
        if audio_stream: audio_stream.close()
        if pa: pa.terminate()

if __name__ == "__main__":
    main()

After this section, take a 10-minute break (10:55 AM to 11:05 AM).

Welcome

Class Recordings

Day 1: Setup and Calibration

Day 2: Code & Computer Vision

Day 3: GenAI and Cloud LLMs

Day 4: Vision AI

Day 5: AI Ethics & Final Project

Camera feed into a VLM

The architecture

Program 1: Test the vision API

Step 1 — Capture a frame as base64

Step 2 — Send image + text to the vision API

Step 3 — Interactive loop

Run it

Program 2: Vision chatbot

Step 1 — Combine camera + voice input

Step 2 — The conversation loop with camera

Run it

​Camera feed into a VLM

​The architecture

​Program 1: Test the vision API

​Step 1 — Capture a frame as base64

​Step 2 — Send image + text to the vision API

​Step 3 — Interactive loop

​Run it

​Program 2: Vision chatbot

​Step 1 — Combine camera + voice input

​Step 2 — The conversation loop with camera

​Run it

Camera feed into a VLM

The architecture

Program 1: Test the vision API

Step 1 — Capture a frame as base64

Step 2 — Send image + text to the vision API

Step 3 — Interactive loop

Run it

Program 2: Vision chatbot

Step 1 — Combine camera + voice input

Step 2 — The conversation loop with camera

Run it