Skip to main content

What is a vision language model?

Time: 9:30 AM to 9:55 AM
Yesterday, the LLM took text in and generated text out. A vision language model (VLM) takes both images and text as input, opening up an entirely new set of capabilities.

From text to vision

On Day 3, your robot could hear and speak. Today, it learns to see. A VLM processes images alongside text so you can ask questions about what the robot sees.

How vision encoding works

1

Image capture

The robot’s camera captures a frame as a JPEG image.
2

Vision encoder

The image is processed by a vision encoder that converts it into a sequence of visual tokens.
3

Combined input

These visual tokens are fed into the language model alongside your text prompt.
4

Response generation

The model predicts tokens conditioned on both the image and the text, generating a response that references what it “sees.”
The model does not see like a human. It receives a compressed representation of the image and predicts tokens based on patterns learned during training.

What this enables

With a VLM, you can ask your robot:
  • “Describe what you see”
  • “Is there an obstacle in front of you?”
  • “What color is the object on your left?”
  • “Should I move forward or stop?”

Model used

Today’s exercises use GPT-4o via the OpenAI API. API keys will be distributed before the session begins.