LLM-controlled robot

LLM-controlled robot via tool calls

Time: 11:25 AM to 12:15 PM

This is the moment everything comes together. Your robot now understands speech, thinks with a large language model, speaks back to you, and executes physical actions based on the conversation.

The full loop

Building the program

This is the most complex program of the week. Here is how each piece works. You give GPT a list of tools as JSON dictionaries. Each tool has a name, description, and parameters. Here is one tool from the list:

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "move_forward",
            "description": "Drive the robot forward.",
            "parameters": {
                "type": "object",
                "properties": {
                    "speed": {
                        "type": "integer",
                        "description": "Speed percentage (10-60). Default 30.",
                    },
                    "duration": {
                        "type": "number",
                        "description": "How many seconds to drive. Default 2.",
                    },
                },
            },
        },
    },
    # ... 9 more tools (turn_left, celebrate, speak, etc.)
]

The model reads these descriptions to decide which tool to call and what arguments to pass. It never sees your Python code — only this JSON menu.

Step 2 — Write the action functions

Each tool maps to a Python function that controls the physical robot:

from picarx import Picarx
px = Picarx()

def action_move_forward(speed=30, duration=2, **_):
    speed = max(10, min(int(speed), 60))
    duration = max(0.5, min(float(duration), 10))
    px.forward(speed)
    time.sleep(duration)
    px.stop()
    return f"Drove forward at {speed}% for {duration}s."

def action_celebrate(**_):
    _set_pan(-30)
    time.sleep(0.15)
    _set_pan(30)
    _set_tilt(20)
    time.sleep(0.15)
    _set_pan(0)
    _set_tilt(0)
    px.set_dir_servo_angle(-25)
    px.forward(25)
    time.sleep(0.5)
    px.set_dir_servo_angle(25)
    time.sleep(0.5)
    px.stop()
    px.set_dir_servo_angle(0)
    return "Celebrated!"

Notice two safety patterns:

Clamping — max(10, min(speed, 60)) prevents the LLM from requesting dangerous speeds
Return value — each function returns a string describing what happened, which gets sent back to GPT

Step 3 — Connect tools to functions

A dictionary maps tool names to Python functions. When GPT says "call move_forward", your code looks it up here:

ACTIONS = {
    "move_forward":  action_move_forward,
    "move_backward": action_move_backward,
    "turn_left":     action_turn_left,
    "turn_right":    action_turn_right,
    "stop_car":      action_stop_car,
    "speak":         action_speak,
    "nod":           action_nod,
    "shake_head":    action_shake_head,
    "celebrate":     action_celebrate,
    "look_around":   action_look_around,
}

Step 4 — Process a command with GPT

This is the core function. It sends the user’s command to GPT along with the tool menu, then executes any tool calls GPT requests:

def process_command(client, messages, command):
    messages.append({"role": "user", "content": command})

    # First API call: GPT decides what to do
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message

    if msg.tool_calls:
        messages.append(msg)

        # Execute each tool call
        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)

            print(f"    → {fn_name}({args})")

            result = ACTIONS[fn_name](**args)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

        # Second API call: GPT generates spoken response
        followup = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )
        return followup.choices[0].message.content

    else:
        # No tools needed — just a text response
        messages.append({"role": "assistant", "content": msg.content})
        return msg.content

There are two API calls per command:

First call — GPT reads the command and the tool menu, then returns structured JSON saying which functions to call with which arguments
Second call — after your code executes the tools and sends results back, GPT generates the natural language response that gets spoken aloud

Step 5 — The main loop

Tie it all together with wake word detection and a command loop:

while True:
    # Wait for wake word
    listen_for_wake(recognizer, stream, chunk)
    tts("Yes?")

    # Listen for the command
    command = listen_command(recognizer, stream, chunk, timeout=10)
    if not command:
        tts("I didn't catch that. Try again.")
        continue

    # Process with GPT + execute tools
    response_text = process_command(client, messages, command)

    # Speak the response
    if response_text:
        tts(response_text)

In type mode (--type flag), the wake word and voice input are replaced with input() — same tool-calling logic, just typed instead of spoken.

Click to see the complete llm_robot.py program

#!/usr/bin/env python3
"""
LLM-Controlled Robot — voice-activated robot controlled by GPT via tool calls.

The full loop:
  1. Say "Hey buddy" to wake the robot
  2. Speak your command
  3. Vosk transcribes your speech
  4. GPT decides which physical actions to take
  5. The robot executes movements AND speaks a response

Usage:
    sudo python3 llm_robot.py          # voice mode
    sudo python3 llm_robot.py --type   # keyboard mode
"""

import sys, os, json, time, threading, ctypes

try:
    asound = ctypes.cdll.LoadLibrary("libasound.so.2")
    c_handler = ctypes.CFUNCTYPE(None, ctypes.c_char_p, ctypes.c_int,
                                  ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p)
    asound.snd_lib_error_set_handler(c_handler(lambda *_: None))
except: pass

_real_user = os.environ.get("SUDO_USER", "")
_HOME = os.path.expanduser(f"~{_real_user}") if _real_user else os.path.expanduser("~")
sys.path.insert(0, os.path.join(_HOME, "camp"))

from secret import OPENAI_KEY
from openai import OpenAI
import vosk, pyaudio
from picarx import Picarx

MODEL_PATH = os.path.join(_HOME, "camp", "vosk-model")
WAKE_WORD = "hey buddy"

SYSTEM_PROMPT = """You are an AI assistant controlling a physical robot car (PiCar-X).
You can move, turn, look around, and speak. You have a camera on a pan/tilt mount.
When the user asks you to do something physical, use the available tools.
You can call multiple tools in sequence for complex actions.
Always respond with a short spoken message too — call speak() so the user hears you.
Be enthusiastic, friendly, and keep spoken responses under 2 sentences."""

TOOLS = [
    {"type": "function", "function": {"name": "move_forward",
        "description": "Drive the robot forward.",
        "parameters": {"type": "object", "properties": {
            "speed": {"type": "integer", "description": "Speed percentage (10-60). Default 30."},
            "duration": {"type": "number", "description": "Seconds to drive. Default 2."}}}}},
    {"type": "function", "function": {"name": "move_backward",
        "description": "Drive the robot backward.",
        "parameters": {"type": "object", "properties": {
            "speed": {"type": "integer", "description": "Speed percentage (10-60). Default 30."},
            "duration": {"type": "number", "description": "Seconds to drive. Default 2."}}}}},
    {"type": "function", "function": {"name": "turn_left",
        "description": "Turn the robot left by a given number of degrees.",
        "parameters": {"type": "object", "properties": {
            "degrees": {"type": "integer", "description": "Degrees to turn (10-180). Default 90."}}}}},
    {"type": "function", "function": {"name": "turn_right",
        "description": "Turn the robot right by a given number of degrees.",
        "parameters": {"type": "object", "properties": {
            "degrees": {"type": "integer", "description": "Degrees to turn (10-180). Default 90."}}}}},
    {"type": "function", "function": {"name": "stop_car",
        "description": "Stop all movement and straighten the wheels.",
        "parameters": {"type": "object", "properties": {}}}},
    {"type": "function", "function": {"name": "speak",
        "description": "Say something out loud through the robot's speaker.",
        "parameters": {"type": "object", "properties": {
            "text": {"type": "string", "description": "The text to speak aloud."}},
            "required": ["text"]}}},
    {"type": "function", "function": {"name": "nod",
        "description": "Nod the camera up and down (like saying yes).",
        "parameters": {"type": "object", "properties": {}}}},
    {"type": "function", "function": {"name": "shake_head",
        "description": "Shake the camera side to side (like saying no).",
        "parameters": {"type": "object", "properties": {}}}},
    {"type": "function", "function": {"name": "celebrate",
        "description": "Do a happy celebration dance with the camera and wheels.",
        "parameters": {"type": "object", "properties": {}}}},
    {"type": "function", "function": {"name": "look_around",
        "description": "Pan the camera left and right to look around.",
        "parameters": {"type": "object", "properties": {}}}},
]

px = None

def _set_pan(a):
    if hasattr(px, "set_camera_servo1_angle"): px.set_camera_servo1_angle(a)
    elif hasattr(px, "set_cam_pan_angle"): px.set_cam_pan_angle(a)

def _set_tilt(a):
    if hasattr(px, "set_camera_servo2_angle"): px.set_camera_servo2_angle(a)
    elif hasattr(px, "set_cam_tilt_angle"): px.set_cam_tilt_angle(a)

def tts(text, speed=160):
    safe = text.replace("'", "'\\''").replace('"', '\\"')
    os.system(f"espeak -s {speed} '{safe}' 2>/dev/null")

def action_move_forward(speed=30, duration=2, **_):
    speed = max(10, min(int(speed), 60)); duration = max(0.5, min(float(duration), 10))
    px.forward(speed); time.sleep(duration); px.stop()
    return f"Drove forward at {speed}% for {duration}s."

def action_move_backward(speed=30, duration=2, **_):
    speed = max(10, min(int(speed), 60)); duration = max(0.5, min(float(duration), 10))
    px.backward(speed); time.sleep(duration); px.stop()
    return f"Drove backward at {speed}% for {duration}s."

def action_turn_left(degrees=90, **_):
    degrees = max(10, min(int(degrees), 180)); duration = degrees / 30.0
    px.set_dir_servo_angle(-35); px.forward(25); time.sleep(duration)
    px.stop(); px.set_dir_servo_angle(0)
    return f"Turned left {degrees} degrees."

def action_turn_right(degrees=90, **_):
    degrees = max(10, min(int(degrees), 180)); duration = degrees / 30.0
    px.set_dir_servo_angle(35); px.forward(25); time.sleep(duration)
    px.stop(); px.set_dir_servo_angle(0)
    return f"Turned right {degrees} degrees."

def action_stop_car(**_):
    px.stop(); px.set_dir_servo_angle(0); return "Stopped."

def action_speak(text="Hello!", **_):
    tts(text); return f"Said: {text}"

def action_nod(**_):
    for _ in range(2): _set_tilt(20); time.sleep(0.3); _set_tilt(-10); time.sleep(0.3)
    _set_tilt(0); return "Nodded."

def action_shake_head(**_):
    for _ in range(2): _set_pan(30); time.sleep(0.3); _set_pan(-30); time.sleep(0.3)
    _set_pan(0); return "Shook head."

def action_celebrate(**_):
    _set_pan(-30); time.sleep(0.15); _set_pan(30); _set_tilt(20); time.sleep(0.15)
    _set_pan(-30); _set_tilt(-10); time.sleep(0.15); _set_pan(0); _set_tilt(0)
    px.set_dir_servo_angle(-25); px.forward(25); time.sleep(0.5)
    px.set_dir_servo_angle(25); time.sleep(0.5); px.stop(); px.set_dir_servo_angle(0)
    return "Celebrated!"

def action_look_around(**_):
    _set_pan(-40); time.sleep(0.6); _set_pan(40); time.sleep(1.2)
    _set_pan(-40); time.sleep(1.2); _set_pan(0)
    return "Looked around."

ACTIONS = {
    "move_forward": action_move_forward, "move_backward": action_move_backward,
    "turn_left": action_turn_left, "turn_right": action_turn_right,
    "stop_car": action_stop_car, "speak": action_speak, "nod": action_nod,
    "shake_head": action_shake_head, "celebrate": action_celebrate,
    "look_around": action_look_around,
}

def find_mic():
    pa = pyaudio.PyAudio(); usb_idx = any_idx = None
    for i in range(pa.get_device_count()):
        info = pa.get_device_info_by_index(i)
        if info["maxInputChannels"] < 1: continue
        if any_idx is None: any_idx = i
        if "usb" in info["name"].lower(): usb_idx = i
    idx = usb_idx if usb_idx is not None else any_idx
    rate = int(pa.get_device_info_by_index(idx)["defaultSampleRate"])
    pa.terminate(); return idx, rate

def listen_for_wake(recognizer, stream, chunk):
    while True:
        data = stream.read(chunk, exception_on_overflow=False)
        if recognizer.AcceptWaveform(data):
            text = json.loads(recognizer.Result()).get("text", "").lower()
            if WAKE_WORD in text: return
        else:
            partial = json.loads(recognizer.PartialResult()).get("partial", "").lower()
            if WAKE_WORD in partial:
                time.sleep(0.5); recognizer.Reset(); return

def listen_command(recognizer, stream, chunk, timeout=10):
    start = time.time()
    while time.time() - start < timeout:
        data = stream.read(chunk, exception_on_overflow=False)
        if recognizer.AcceptWaveform(data):
            text = json.loads(recognizer.Result()).get("text", "").strip()
            if text: return text
    return json.loads(recognizer.FinalResult()).get("text", "").strip()

def process_command(client, messages, command):
    messages.append({"role": "user", "content": command})
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=messages, tools=TOOLS,
        tool_choice="auto", max_tokens=300)
    msg = response.choices[0].message
    if msg.tool_calls:
        messages.append(msg)
        for tc in msg.tool_calls:
            fn = tc.function.name
            args = json.loads(tc.function.arguments) if tc.function.arguments else {}
            print(f"    → {fn}({args})")
            result = ACTIONS[fn](**args) if fn in ACTIONS else f"Unknown: {fn}"
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})
        followup = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, max_tokens=200)
        final = followup.choices[0].message
        messages.append({"role": "assistant", "content": final.content or ""})
        return final.content or ""
    else:
        messages.append({"role": "assistant", "content": msg.content or ""})
        return msg.content or ""

def main():
    global px
    type_mode = "--type" in sys.argv or "-t" in sys.argv

    px = Picarx(); px.stop(); px.set_dir_servo_angle(0)
    stream = pa = recognizer = None; chunk = 0

    if not type_mode:
        mic_idx, mic_rate = find_mic(); chunk = mic_rate // 4
        vosk.SetLogLevel(-1)
        recognizer = vosk.KaldiRecognizer(vosk.Model(MODEL_PATH), mic_rate)
        pa = pyaudio.PyAudio()
        stream = pa.open(format=pyaudio.paInt16, channels=1, rate=mic_rate,
                         input=True, input_device_index=mic_idx, frames_per_buffer=chunk)

    client = OpenAI(api_key=OPENAI_KEY)
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]

    if type_mode:
        print("  Type a command. Type 'quit' to exit.")
        tts("Ready for commands.")
    else:
        print(f"  Say \"{WAKE_WORD}\" to activate. Ctrl-C to quit.")
        tts(f"Ready! Say {WAKE_WORD} to talk to me.")

    try:
        while True:
            if type_mode:
                command = input("  You: ").strip()
                if not command: continue
            else:
                listen_for_wake(recognizer, stream, chunk)
                tts("Yes?"); recognizer.Reset()
                command = listen_command(recognizer, stream, chunk, timeout=10)
                if not command: tts("I didn't catch that."); continue
                print(f"  You: {command}")

            if command.lower() in ("goodbye", "bye", "quit", "exit", "q"):
                tts("Goodbye!"); break

            msg_before = len(messages)
            response_text = process_command(client, messages, command)
            if response_text:
                print(f"  Robot: {response_text}\n")
                new_msgs = messages[msg_before:]
                spoke = any(isinstance(m, dict) and m.get("role") == "tool"
                           and "Said:" in m.get("content", "") for m in new_msgs)
                if not spoke: tts(response_text)
    except KeyboardInterrupt:
        print("\n  Shutting down...")
    finally:
        px.stop(); px.set_dir_servo_angle(0)
        try: _set_pan(0); _set_tilt(0)
        except: pass
        if stream: stream.close()
        if pa: pa.terminate()

if __name__ == "__main__":
    main()

Run the program

Voice mode (with USB microphone):

sudo python3 ~/camp/day3/llm_robot.py

Keyboard mode (type commands instead of speaking):

sudo python3 ~/camp/day3/llm_robot.py --type

This program needs sudo because it controls the motors and servos, which require root access to the I2C bus.

How to interact

Voice mode: Say the wake word “Hey buddy” to activate the robot. It will respond with “Yes?” and listen for your command. Speak your request clearly, and the robot will execute it. Type mode: Just type your command and press Enter. No wake word needed.

Available actions

The LLM has access to these 10 tools. It decides which ones to call based on your natural language command:

Tool	What it does	Example command
`move_forward`	Drive forward (speed + duration)	“Move forward for 3 seconds”
`move_backward`	Drive backward (speed + duration)	“Back up slowly”
`turn_left`	Turn left by N degrees	”Turn left 90 degrees”
`turn_right`	Turn right by N degrees	”Make a right turn”
`stop_car`	Stop all movement	”Stop”
`speak`	Say something through the speaker	”Say hello to everyone”
`nod`	Nod the camera up and down	”Nod yes”
`shake_head`	Shake the camera side to side	”Shake your head no”
`celebrate`	Happy dance (camera + wheels)	“Celebrate!”
`look_around`	Pan the camera to scan surroundings	”Look around”

The model can call multiple tools in one response. Try compound commands like “make a square” — it will call move_forward and turn_right four times in sequence.

Try these commands

Start with simple commands, then try more complex ones: Simple:

“Move forward”
“Turn left”
“Celebrate”
“Look around”

With parameters:

“Drive forward for 5 seconds”
“Back up slowly for 3 seconds”
“Turn right 90 degrees”

Multi-step:

“Make a square”
“Drive forward, then turn around and come back”
“Nod yes, then say ‘I agree!’”

Creative:

“Do a victory dance”
“Explore the room”
“Act excited about something”

Questions (no tools needed):

“What can you do?”
“How do your motors work?”
“Tell me a joke about robots”

How tool calls flow in the code

Here is what happens inside process_command() when you say “move forward for 3 seconds”:

Inspect the code

Read through llm_robot.py and find the answers to these questions:

Find the system prompt

Look for the SYSTEM_PROMPT variable. This is the hidden instruction that tells GPT it is controlling a physical robot. How does it instruct the model to behave?

Find the tool definitions

Look for the TOOLS list. Each tool is a JSON dictionary with a name, description, and parameters. Count how many tools are defined. Do the descriptions match the table above?

Find where tools get executed

Look for the ACTIONS dictionary and the process_command function. When GPT returns a tool call, how does the code know which Python function to run?

Find where the result goes back

In process_command, after executing a tool, the result is appended to messages with role: "tool". Then a second API call is made so GPT can generate its final spoken response. Why does the model need to see the result?

Make it yours

Change the wake word

Open llm_robot.py and find the WAKE_WORD variable near the top. Change "hey buddy" to whatever you want — your name, a code word, or a silly phrase.

sudo nano ~/camp/day3/llm_robot.py

Change the personality

Find the SYSTEM_PROMPT variable and rewrite it. Make the robot a pirate, a drill sergeant, a surfer dude, or a Shakespearean actor. The personality affects how it responds and which tools it chooses to call.

Add a new tool

This is the advanced challenge. To add a new tool:

Add a new entry to the TOOLS list with a name, description, and parameters
Write a Python function that performs the physical action
Add it to the ACTIONS dictionary

Ideas for new tools:

spin_in_circle — turn 360 degrees
patrol — drive forward, look around, drive back
shy — slowly back away while looking down
read_distance — read the ultrasonic sensor and report what it sees

When testing with motors, always have the robot on a clear surface with space to move. Keep your hands away from the wheels.

Day 3 wrap-up

What you built today: The full loop is now complete. You started the day understanding what an LLM is, learned how tool calls connect language to action, built a voice chatbot, and then gave the LLM control of a physical robot. Your robot can now hear, think, speak, and act. But it is still blind. It can hear you and respond, but it cannot see what is in front of it. Preview for tomorrow: On Day 4, the robot’s camera feed goes directly into a vision language model. It will finally be able to see and understand its surroundings — reading signs, identifying objects, and describing what it sees. The brain gets eyes.

Welcome

Class Recordings

Day 1: Setup and Calibration

Day 2: Code & Computer Vision

Day 3: GenAI and Cloud LLMs

Day 4: Vision AI

Day 5: AI Ethics & Final Project