Making the assistant speak
assistantSpeak, stream mode, TTS events, and the pending tts-cancel work.
session.assistantSpeak(text) sends text straight to TTS, bypassing the LLM. It’s the key primitive for deterministic utterances — compliance disclosures, scripted IVR prompts, agent handoff announcements, audio pipeline smoke tests.
Signature
session.assistantSpeak(text: string): Promise<void> Usage
// Sequence via await
await session.assistantSpeak('Hello, how can I help you today?');
startListening();
await session.assistantSpeak('Please hold while I transfer you.');
transferCall(); Behavior
- Returns a
Promise<void>that resolves on the next'tts-end'after sending. - Rejects with
SessionErrorif the session isn’t connected or closes before completion. - Rejects with
InvalidRequestErroriftextis empty or whitespace-only. - Text over 2000 chars is truncated (with a console warning).
- In
'stream'execution mode, the call goes directly to TTS — text is spoken verbatim. - In
'production'/'test'modes, behavior depends on the backend pipeline — the LLM may intervene. Prefer'stream'for deterministic output.
Why no per-utterance callback?
The Promise resolves on “the next tts-end we observe” — not ”tts-end for exactly this utterance.” Pipecat’s TTS events carry no correlation ID, and the pipeline can produce TTS for reasons other than your call (server-initiated idle prompts, VAD-driven barge-in, template-baked audio). A callback that claimed per-utterance lifecycle would lie about a precision the underlying system doesn’t provide.
The honest primitive is: await for sequencing, subscribe to global tts-* events for live observation, handle interruption via VAD events.
Barge-in caveat
'tts-end' event still fires — but the user didn't hear the full utterance. For flows that require full playback (legal disclosures, consent), observe 'user-speech-start' during your utterance and re-play if interrupted.Observing TTS lifecycle
Two equally valid ways to subscribe — pick whichever reads better for your code.
Low-level — three global events:
session.on('tts-start', () => showSpeakingIndicator());
session.on('tts-chunk', (text) => appendWord(text));
session.on('tts-end', () => hideSpeakingIndicator()); Aggregate helper — one handler for all three:
const off = session.onAssistantSpeaking((event) => {
switch (event.type) {
case 'start': showSpeakingIndicator(); break;
case 'chunk': appendWord(event.text); break;
case 'end': hideSpeakingIndicator(); break;
}
});
// later
off(); Both do the same thing. onAssistantSpeaking just packs the three events into one subscription.
onAssistantSpeaking vs onAssistantTranscript
They sound similar — they’re not the same:
onAssistantTranscript | onAssistantSpeaking | |
|---|---|---|
| Source | LLM tokens | TTS pipeline output |
| Stream mode | Doesn’t fire (no LLM) | Fires — this is your primary text stream |
| Fires when | Model is generating text | Audio is being synthesized / played |
| Reflects post-processing? | No | Yes — what’s actually heard |
Use onAssistantTranscript for what the model thought, onAssistantSpeaking for what the user is hearing. See Transcripts for the full comparison.
Detecting barge-in
let isSpeaking = false;
session.on('tts-start', () => { isSpeaking = true; });
session.on('tts-end', () => { isSpeaking = false; });
session.on('user-speech-start', () => {
if (isSpeaking) handleBargeIn();
}); Stream mode example — end-to-end
import { joinRoom } from '@juspay/breeze-buddy-client-sdk';
const session = await joinRoom({ roomUrl, token });
// Listen to user
session.on('transcript', (entry) => {
if (entry.role !== 'user' || !entry.isComplete) return;
const text = entry.text.toLowerCase();
if (text.includes('one') || text.includes('1')) menuSales();
if (text.includes('two') || text.includes('2')) menuSupport();
});
// Speak a menu prompt
await session.assistantSpeak(
'For sales, say one. For support, say two.'
);
function menuSales() {
session.assistantSpeak('Connecting you to sales.');
}
function menuSupport() {
session.assistantSpeak('Connecting you to support.');
} Custom RTVI messages — session.sendMessage
If the clairvoyance pipeline registers additional on_client_message handlers, you can invoke them from the SDK:
session.sendMessage('my-custom-handler', { some: 'data' }); sendMessage is fire-and-forget. For request/response semantics, coordinate a custom RTVI event path with the backend team.
Flushing / cancelling TTS — TODO (cross-team)
Not currently supported. There’s no client-triggerable way to stop the assistant mid-utterance. The backend pipeline only cancels TTS automatically via VAD-driven barge-in (user starts speaking → pipeline cancels bot).
To enable programmatic flush:
- Backend (clairvoyance) — register a new
on_client_messagehandler fortts-cancel(ortts-stop) that pushes a cancellation frame into the pipeline. Lives nearapp/ai/voice/agents/breeze_buddy/agent/__init__.py:650-665wheretts-speakis registered. - SDK — add
session.cancelSpeech()that wrapssendClientMessage('tts-cancel'). Source TODO marker lives inpackages/client-sdk/src/lib/session/session.ts.
Neither half ships in isolation. Coordinate with the clairvoyance team.