What are VLMs?

What is a vision language model?

Time: 9:30 AM to 9:55 AM

Yesterday, your LLM took text in and generated text out. Today, it learns to see. A vision language model (VLM) takes both images and text as input, so you can ask questions about what the robot’s camera captures.

From text to vision

On Day 3, your robot could hear and speak, but it was blind. Today you give it eyes. A VLM is not a separate model — it is the same LLM (GPT-4o-mini) with a vision encoder bolted on. The vision encoder converts images into tokens the language model can understand.

How vision encoding works

When you send an image to the API, it goes through a multi-step pipeline before the language model ever sees it:

Image capture

The robot’s camera captures a frame as a JPEG image (640x480 pixels).

Base64 encoding

The JPEG is converted to a base64 text string so it can be included in the API request.

Vision encoder

On OpenAI’s servers, a vision encoder (like a CLIP model) converts the image into a sequence of visual tokens — a compressed numerical representation of the image.

Combined input

These visual tokens are concatenated with your text prompt tokens. The language model now sees both.

Response generation

The model generates text tokens conditioned on both the image and your question, producing a response that references what it “sees.”

The model does not see like a human. It receives a compressed mathematical representation of the image and predicts tokens based on patterns learned during training. It can be fooled by unusual angles, low lighting, or objects it has never seen in training data.

What this enables

With a VLM, you can ask your robot:

Question	What the robot does
”What do you see?”	Captures a photo and describes the scene
”Is there an obstacle ahead?”	Looks at the image and reports objects in the path
”What color is that object?”	Identifies colors from the camera feed
”How many people are in front of you?”	Counts visible people
”Read that sign for me”	Attempts to read text visible in the image
”Should I move forward or stop?”	Makes a judgment call based on what it sees

What VLMs struggle with

VLMs are impressive but have clear limits:

Limitation	Example
Poor lighting	Dark rooms produce blurry, underexposed images the model cannot interpret
Small text	Tiny text at a distance is unreadable from a low-resolution camera
Unusual angles	A bird’s-eye view of common objects can confuse the model
Hallucination	The model may describe objects that are not in the image
Speed	Each vision API call takes 2-5 seconds — not real-time

VLMs hallucinate just like text-only LLMs. The model may confidently describe an object that does not exist in the image. Always verify important observations.

Model and cost

Today’s exercises use GPT-4o-mini via the OpenAI API. Each vision API call costs roughly 300-800 tokens depending on image resolution. We use detail: "low" to keep costs down while still getting useful descriptions.

Setting	Token cost	Quality
`detail: "low"`	~300 tokens per image	Good enough for object identification and scene description
`detail: "high"`	~800 tokens per image	Better for reading text and fine details

Day	Capability	Status
Day 1-2	Movement, sensors, TTS	Done
Day 3	Hearing, speaking, thinking, acting	Done
Day 4	Seeing and understanding	Today
Day 5	Ethics + final project	Tomorrow

By the end of today, your robot will have all five layers: movement, sensing, speech, language intelligence, and vision. That is the full stack of embodied AI.

Welcome

Class Recordings

Day 1: Setup and Calibration

Day 2: Code & Computer Vision

Day 3: GenAI and Cloud LLMs

Day 4: Vision AI

Day 5: AI Ethics & Final Project

What is a vision language model?

From text to vision

How vision encoding works

What this enables

What VLMs struggle with

Model and cost

From blind to sighted: the week so far

​What is a vision language model?

​From text to vision

​How vision encoding works

​What this enables

​What VLMs struggle with

​Model and cost

​From blind to sighted: the week so far

What is a vision language model?

From text to vision

How vision encoding works

What this enables

What VLMs struggle with

Model and cost

From blind to sighted: the week so far