Making the assistant speak

assistantSpeak, stream mode, TTS events, and the pending tts-cancel work.

session.assistantSpeak(text) sends text straight to TTS, bypassing the LLM. It’s the key primitive for deterministic utterances — compliance disclosures, scripted IVR prompts, agent handoff announcements, audio pipeline smoke tests.

Signature

session.assistantSpeak(text: string): Promise<void>

Usage

// Sequence via await
await session.assistantSpeak('Hello, how can I help you today?');
startListening();

await session.assistantSpeak('Please hold while I transfer you.');
transferCall();

Behavior

Returns a Promise<void> that resolves on the next 'tts-end' after sending.
Rejects with SessionError if the session isn’t connected or closes before completion.
Rejects with InvalidRequestError if text is empty or whitespace-only.
Text over 2000 chars is truncated (with a console warning).
In 'stream' execution mode, the call goes directly to TTS — text is spoken verbatim.
In 'production' / 'test' modes, behavior depends on the backend pipeline — the LLM may intervene. Prefer 'stream' for deterministic output.

Why no per-utterance callback?

The Promise resolves on “the next tts-end we observe” — not ”tts-end for exactly this utterance.” Pipecat’s TTS events carry no correlation ID, and the pipeline can produce TTS for reasons other than your call (server-initiated idle prompts, VAD-driven barge-in, template-baked audio). A callback that claimed per-utterance lifecycle would lie about a precision the underlying system doesn’t provide.

The honest primitive is: await for sequencing, subscribe to global tts-* events for live observation, handle interruption via VAD events.

Barge-in caveat

If the user starts speaking mid-utterance, Pipecat's VAD automatically cancels the TTS. The 'tts-end' event still fires — but the user didn't hear the full utterance. For flows that require full playback (legal disclosures, consent), observe 'user-speech-start' during your utterance and re-play if interrupted.

Observing TTS lifecycle

Two equally valid ways to subscribe — pick whichever reads better for your code.

Low-level — three global events:

session.on('tts-start', () => showSpeakingIndicator());
session.on('tts-chunk', (text) => appendWord(text));
session.on('tts-end',   () => hideSpeakingIndicator());

Aggregate helper — one handler for all three:

const off = session.onAssistantSpeaking((event) => {
  switch (event.type) {
    case 'start': showSpeakingIndicator(); break;
    case 'chunk': appendWord(event.text); break;
    case 'end':   hideSpeakingIndicator(); break;
  }
});

// later
off();

Both do the same thing. onAssistantSpeaking just packs the three events into one subscription.

`onAssistantSpeaking` vs `onAssistantTranscript`

They sound similar — they’re not the same:

	`onAssistantTranscript`	`onAssistantSpeaking`
Source	LLM tokens	TTS pipeline output
Stream mode	Doesn’t fire (no LLM)	Fires — this is your primary text stream
Fires when	Model is generating text	Audio is being synthesized / played
Reflects post-processing?	No	Yes — what’s actually heard

Use onAssistantTranscript for what the model thought, onAssistantSpeaking for what the user is hearing. See Transcripts for the full comparison.

Detecting barge-in

let isSpeaking = false;
session.on('tts-start', () => { isSpeaking = true; });
session.on('tts-end',   () => { isSpeaking = false; });
session.on('user-speech-start', () => {
  if (isSpeaking) handleBargeIn();
});

Stream mode example — end-to-end

import { joinRoom } from '@juspay/breeze-buddy-client-sdk';

const session = await joinRoom({ roomUrl, token });

// Listen to user
session.on('transcript', (entry) => {
  if (entry.role !== 'user' || !entry.isComplete) return;
  const text = entry.text.toLowerCase();
  if (text.includes('one') || text.includes('1')) menuSales();
  if (text.includes('two') || text.includes('2')) menuSupport();
});

// Speak a menu prompt
await session.assistantSpeak(
  'For sales, say one. For support, say two.'
);

function menuSales() {
  session.assistantSpeak('Connecting you to sales.');
}

function menuSupport() {
  session.assistantSpeak('Connecting you to support.');
}

Custom RTVI messages — `session.sendMessage`

If the clairvoyance pipeline registers additional on_client_message handlers, you can invoke them from the SDK:

session.sendMessage('my-custom-handler', { some: 'data' });

sendMessage is fire-and-forget. For request/response semantics, coordinate a custom RTVI event path with the backend team.

Flushing / cancelling TTS — TODO (cross-team)

Not currently supported. There’s no client-triggerable way to stop the assistant mid-utterance. The backend pipeline only cancels TTS automatically via VAD-driven barge-in (user starts speaking → pipeline cancels bot).

To enable programmatic flush:

Backend (clairvoyance) — register a new on_client_message handler for tts-cancel (or tts-stop) that pushes a cancellation frame into the pipeline. Lives near app/ai/voice/agents/breeze_buddy/agent/__init__.py:650-665 where tts-speak is registered.
SDK — add session.cancelSpeech() that wraps sendClientMessage('tts-cancel'). Source TODO marker lives in packages/client-sdk/src/lib/session/session.ts.

Neither half ships in isolation. Coordinate with the clairvoyance team.

Was this helpful?

Edit on GitHub