Subagent Prompting — Research & Improvement Plan

Background research for PMF-3101 · The interface agent (V.I.) dispatches tasks to a subagent. Today users often have to literally say "use the subagent." We want it to just know, and we want the subagent to make fewer mistakes once dispatched.
PMF-3101 ↗

1 · Where prompts live in the subagent system

Five distinct prompt surfaces. Each one shapes either when the main agent dispatches, or how well the subagent executes.

Main Agent (V.I.) voice_interface/ ① Dispatch Preamble dispatch_runner.py:102 create_task Subagent Service ② Subagent System Prompt env: SUBAGENT_SYSTEM_PROMPT fallback: "You are a helpful assistant." ③ Systems (7) — each has a description() gen_response tasks apps agent_peers agent_core diagnostics debug subagent_service/systems/*.py — name() + summary() shown to LLM Today most summaries are 1 sentence. Inconsistent depth. ④ Apps / Skills (14) — owned by individual app developers notion research spotify apple_music g_calendar g_docs cooking briefing entertainment reminders shared_tools agent_files workspace utils claude_agents/<app>/prompts/skill.txt — out of scope for this ticket. Per-app prompt quality is each developer's responsibility.

2 · The subagent system prompt today

What's actually in production (dev cluster) right now.

What the fallback gives us

A generic, role-less assistant with no context that it is:

  • a subagent dispatched by a parent agent (not the user-facing one)
  • operating inside a voice product where latency & brevity matter
  • expected to narrate to the parent, not address the end user
  • limited to a curated set of systems (apps, tasks, peers…)
  • handling installed apps = integrations / 3rd-party services (not iPhone apps)

Concrete failure modes this causes

  • Subagent addresses the user directly instead of narrating to V.I.
  • Subagent refuses when asked to "install Notion" because "I can't install apps on your phone" — it doesn't know app means subagent-app
  • Subagent over-explains instead of returning a short result the parent can speak
  • No anti-patterns → tries to do web search, extended thinking search, reminders itself instead of declining (these are handled by direct voice tools)
  • No structural guidance for tool selection, batching, clarification

Note: A separate file at claude_agents/prompts/subagent/subagent.txt contains the old Claude-Agent-SDK subagent's preamble (narration protocol, batching, GetTime-first, etc.). That file is loaded by claude_agents/subagent/subagent.py:71 — not by the new subagent_service. Some or all of its content should be ported into the new SUBAGENT_SYSTEM_PROMPT.

3 · "Never use the subagent for these" — the must-add guidance

These flows already have direct voice tools in the main agent. Dispatching them to the subagent adds latency, breaks UI cards, and routes the user through the wrong code path.

never · search Handled by web_gemini / web_xai / web_exa directly in V.I.
voice_interface/tools/impls/search/
never · extended / deep search Handled by extended_thinking_search directly in V.I.
voice_interface/tools/impls/search/extended_thinking_search.py
never · reminders Handled by the dedicated reminders direct tool, not the subagent app.

Where this guidance has to land

① Dispatch preamble (main agent side)

dispatch_runner.py:102–113

Stops V.I. from calling create_task for these requests in the first place. Highest leverage.

② Subagent system prompt (subagent side)

SUBAGENT_SYSTEM_PROMPT env

Belt-and-suspenders. If V.I. still dispatches, the subagent itself politely declines and tells V.I. to use the direct tool. Prevents the subagent from "trying its best" with the wrong tools.

4 · The "install Notion" confusion

Today: user says "install Notion" → main agent responds "I can't install apps on your iPhone." The agent has no idea that app = integration in the subagent's app registry.

Where the confusion comes from
  • The word "app" in voice-product context overwhelmingly means mobile app in the LLM's training data.
  • The dispatch preamble never disambiguates the term.
  • install_app's description (subagent side) does explain itself well, but V.I. never gets to see it — V.I. only sees create_task's description + the assembled capabilities text.
Where to fix it — two places, surgical
  • Main agent prompt OR dispatch preamble: add a glossary line — "When the user says 'install', 'add', 'connect' Notion / Spotify / Google Calendar / etc., they mean a Sesame integration. Dispatch to create_task — do NOT say you can't install things on their phone."
  • Capabilities text assembled in get_tool_description() at dispatch_runner.py:677: prefix with one line — "The agent can install / connect the following third-party integrations on the user's behalf:".

5 · System descriptions — owner notes & questions to send

Each system has a 1-sentence summary() shown to the subagent LLM. The system's author is the right person to enrich it. Below: David's take + the questions to send each owner.

David's take per system (TL;DR before the messages below)

apps ~Jake / Ankit / Andres

systems/apps.py:291 — "Install, uninstall, and list apps…"

Reinforce that apps are integrations / 3rd-party tools. Not all are (utils, workspace, shared_tools), but many are. Fixes the "install Notion → can't install on your phone" failure.

tasks ~Jake

systems/tasks.py:161 — "Spawn forked tasks…"

"Fork vs. inline" probably doesn't matter much. The unique value of tasks is reusable, start/stoppable. The current one sentence kind of covers it but is ambiguous.

agent_peers ~Ankit / Andres / David

systems/agent_peers.py:143 — "…request UI surfaces on them."

"Request UI surfaces on them" is overly technical — the LLM has no purchase on it. Needs plain English plus a concrete example.

agent_core ~Jake / Ankit

systems/agent_core.py:146 — "Execute bash, read/write files on a persistent VM."

Not clear when the subagent would request this. Unclear what data can flow between the VM and the subagent. Also: commented out in subagent_instance.py:298–301 — may not even be live.

diagnostics ~Ankit / Jake

systems/diagnostics.py:31 — "system health…"

low priority An example would help but not critical. Defer.

gen_response / debug

systems/gen_response.py:254 + systems/debug.py:270

out of scope Internal plumbing / monitoring only. No questions needed.

Copy-paste blocks — one per recipient

Each block has every question that recipient (and only that recipient) needs to answer. No duplication across blocks.

Qs for Jake

scope: tasks system
Hey Jake — for the subagent prompting cleanup (PMF-3101), can you help me
sharpen the description of the `tasks` system so the subagent LLM knows
when to reach for it? Right now the one-line summary is:

  "Spawn forked tasks to handle parallel or long-running work."

A few questions:

1. What is the defining property of a task — long-running? Resumable?
   Cancellable? Re-attachable across sessions?
2. When should the subagent reach for spawn_task vs. just doing the work
   inline in its current turn?
3. Are tasks shared across users / agents, or strictly per-agent?
4. What happens to a running task if the subagent finishes its turn —
   does the parent agent see the final result?
5. What's a concrete example of a task the LLM should spawn?
   (e.g. "watch this Apollo article for updates over the next week" —
   yes? no?)

Goal is a 3–5 sentence description with one concrete example.

Qs for Jake & Ankit

scope: agent_core system
Hey Jake / Ankit — for the subagent prompting cleanup (PMF-3101), I need
help defining the `agent_core` system clearly. Right now the summary is:

  "Execute bash commands, read/write/edit files on a persistent VM."

I'm not sure when the subagent would actually reach for this, and the
data contract between the VM and the subagent isn't obvious. Questions:

1. What is the persistent VM actually for? Per-user? Per-agent?
   Per-task? How long does it live?
2. What's the data contract between subagent ↔ VM? What can the
   subagent send in (text? files? structured payloads?), and what
   comes back (stdout? file contents? exit codes?)
3. What's a realistic user request that should route through the VM?
   (e.g. "convert this PDF the user shared to text"? "run this Python
   script"?)
4. Is this currently enabled in production? subagent_instance.py:298–301
   shows it commented out — is it even live?
5. If it's not live, should we remove the system summary entirely so
   the LLM doesn't think it's an option?

Goal is a clear 3–5 sentence description (or a decision to remove it
from the surface entirely until it ships).

Qs for Jake, Ankit & Andres

scope: apps system
Hey all — for the subagent prompting cleanup (PMF-3101), I want to
sharpen the `apps` system description. Today it says:

  "Install, uninstall, and list apps — each app provides its own tools
   once installed."

My take is we should reinforce that apps are integrations / 3rd-party
tools — most of them are (Notion, Spotify, Apple Music, Google Calendar,
Google Docs). This would fix the "install Notion → can't install on your
phone" failure we keep seeing. Questions:

1. Is "integration / 3rd-party tool" the right framing for the majority
   of apps? Which ones explicitly are NOT (utils, workspace,
   shared_tools)?
2. Should we split the description into "integration apps" vs.
   "built-in capability apps"?
3. When should the subagent proactively install vs. ask the user first?
4. What's the cost (latency / OAuth handshake) of an install_app call
   so we can teach the LLM when an install is "free"?

Goal is a 3–5 sentence description with the right mental model so the
LLM stops treating "install" as a phone-app action.

Qs for Ankit & Andres

scope: agent_peers system (David is co-owner, no need to ask himself)
Hey Ankit / Andres — we co-own `agent_peers` so I'd like a quick sync
before rewriting its description. Today the summary is:

  "Get information about connected Agent instances and request UI
   surfaces on them."

"Request UI surfaces on them" is overly technical — the LLM has no
purchase on what that means. Questions:

1. What is an "agent peer" in user terms? (The V.I. instance that
   dispatched me? Sibling subagents? Another user's agent?)
2. What does "request a UI surface" actually mean in practice — open
   a card on the user's phone? Push a status pill? Send a text message
   back to V.I.?
3. What's a concrete example: "when the user says X, the subagent
   calls Y on the peer to produce Z"?
4. When should the subagent prefer talking to a peer vs. answering
   directly?

Goal is a plain-English description + one concrete example. Happy to
draft once you confirm the model.

6 · The dispatch preamble — the single biggest lever

Today's preamble describes what the subagent does but gives no trigger heuristics, no anti-patterns, and no app-vocabulary disambiguation.

Before — current (dispatch_runner.py:102)
You have a powerful agent that you can delegate
complex tasks to. Call this tool with a detailed
prompt describing what you want done — the agent
will reason, use tools, and return the result.
Be specific in your prompt: include all relevant
context, names, dates, and constraints. The agent
works independently and may take a few moments
for complex tasks. Some of the agent's tools may
also be available directly in your own tool list;
prefer calling one of those when it fits the
request. Use create_task when the request needs
multi-step reasoning, when no direct tool fits,
or when the work may span multiple operations.
  • issue No examples of triggering phrases
  • issue "Multi-step reasoning" is abstract
  • issue No anti-patterns (search, deep search, reminders)
  • issue No "app = integration" glossary
After — proposed shape
You can delegate to a background research/action
agent via create_task.

DISPATCH AUTONOMOUSLY when:
  ✓ User wants to act on a 3rd-party integration
    — Notion, Spotify, Apple Music, Google
    Calendar, Google Docs, Apple Music, etc.
  ✓ User says "install/add/connect" any of those
    → those are SESAME INTEGRATIONS, NOT iPhone
    apps. NEVER tell the user you can't install
    apps on their phone — dispatch instead.
  ✓ User asks about something they previously
    shared, saved, or briefed with Sesame.
  ✓ Work spans >1 tool call or external lookup.

NEVER dispatch for:
  ✗ Web search of any kind — use your direct
    search tools (lower latency, card UI).
  ✗ Extended / deep search — same reason.
  ✗ Reminders — handled by a direct voice tool.
  ✗ Pure conversation / chit-chat.
  ✗ Asking a clarifying question — do that
    yourself before dispatching.

Be specific in your prompt: include names, dates,
the user's exact phrasing, and what "done" means.

The agent can install / connect the following
third-party integrations and act on them:
  • win Explicit trigger heuristics
  • win NEVER-list resolves search/reminders
  • win "App = integration" disambiguation
  • win Capabilities prefixed with framing line

7 · Improvement roadmap, scoped to what we own

App-prompt cleanup belongs to individual app developers — explicitly out of scope here. We own the dispatch preamble + the subagent system prompt + the system descriptions we authored.

1Rewrite the dispatch preamble

Add (a) trigger heuristics, (b) NEVER-list for search / deep-search / reminders, (c) "app = integration" disambiguation, (d) capabilities-section framing line.

file: dispatch_runner.py:102–113 + :677

2Write a real SUBAGENT_SYSTEM_PROMPT

Replace "You are a helpful assistant." Port relevant bits from the old claude_agents/prompts/subagent/subagent.txt (narration to parent, batch ops, brevity). Add the same NEVER-list as belt-and-suspenders.

file: subagent_instance.py:288 (env source TBD)

3Send the system-author question list

One thread per system owner (apps, tasks, agent_peers, agent_core). Use the questions in section 5. Skip gen_response, debug, and (for now) diagnostics.

async — gates rewrites of summaries

4Rewrite the system summaries with answers in hand

Add "when to use / when not to use / one concrete example" for each. agent_peers is the most urgent (plain-English rewrite).

files: subagent_service/systems/*.py

8 · How we'd measure improvement

Dispatch precision

% of user turns that should have triggered create_task and did, without the user having to explicitly ask.

eval: replay call corpus, label dispatch-worthy turns.

Mis-routes avoided

Count of "install on iPhone" refusals + count of subagent-handled search / deep-search / reminders. Both should drop to ~0.

datadog: log scan on V.I. + subagent turns.

Subagent task success

% of dispatched tasks that produce a correct, user-acceptable result on the first try.

eval: unit-test-ml on a curated set of mis-handled calls.