Transcripts

TranscriptEntry shape, streaming semantics, and how to consume transcripts from the session.

Transcripts stream in real time from both the user (via STT) and the assistant (via LLM / TTS). The SDK delivers them through one event — 'transcript' — and you branch on the entry’s role.

The TranscriptEntry type

A discriminated union on role:

type TranscriptEntry =
  | { id: string; role: 'user';      text: string; isComplete: boolean }
  | { id: string; role: 'assistant'; text: string; isComplete: boolean }
  | { id: string; role: 'tool_call'; functionName: string; isComplete: boolean };
FieldMeaning
idStable UUID for this entry — same value across partial updates
roleDiscriminator: 'user', 'assistant', or 'tool_call'
textAccumulated text so far (NOT a delta — full current string)
functionNameOnly on 'tool_call' — the tool / function name being invoked
isCompletefalse while still streaming, true on the finalized version

Streaming semantics

The same id is emitted multiple times as partial text arrives. Use it as a stable key (React / DOM) so partials update in place instead of creating new bubbles.

Example trace — user says “Hey hi, how are you?“:

{ id: 'abc', role: 'user', text: 'hey',              isComplete: false }
{ id: 'abc', role: 'user', text: 'hey hi',           isComplete: false }
{ id: 'abc', role: 'user', text: 'hey hi how',       isComplete: false }
{ id: 'abc', role: 'user', text: 'hey hi how are',   isComplete: false }
{ id: 'abc', role: 'user', text: 'hey hi how are you', isComplete: true }

Same pattern applies to assistant transcripts (LLM tokens streamed in real time).

Consuming transcripts — the main way

Subscribe to the 'transcript' event and branch on role. This is the recommended pattern for all transcript handling:

session.on('transcript', (entry) => {
  switch (entry.role) {
    case 'user':
      renderUserBubble(entry.id, entry.text, entry.isComplete);
      break;
    case 'assistant':
      renderAssistantBubble(entry.id, entry.text, entry.isComplete);
      break;
    case 'tool_call':
      renderToolBadge(entry.id, entry.functionName, entry.isComplete);
      break;
  }
});

TypeScript narrows entry inside each case — no casts needed, full type safety on text vs functionName.

Common patterns

Live chat UI

const bubbles = new Map<string, HTMLElement>();

session.on('transcript', (entry) => {
  if (entry.role === 'tool_call') return; // handle tool calls separately

  let bubble = bubbles.get(entry.id);
  if (!bubble) {
    bubble = createBubble(entry.role);
    bubbles.set(entry.id, bubble);
    container.append(bubble);
  }
  bubble.textContent = entry.text;
  bubble.classList.toggle('partial', !entry.isComplete);
});

Commit-on-final (command parsing)

session.on('transcript', (entry) => {
  if (entry.role !== 'user' || !entry.isComplete) return;
  const text = entry.text.toLowerCase();
  if (text.includes('transfer')) transferCall();
  if (text.includes('hang up')) session.close();
});

Tool-call indicator

session.on('transcript', (entry) => {
  if (entry.role !== 'tool_call') return;
  if (entry.isComplete) {
    toast(`Finished: ${entry.functionName}`);
  } else {
    toast(`Calling: ${entry.functionName}`);
  }
});

Accessing full history

getState().transcripts returns a cloned snapshot of all entries so far. Useful for exporting transcripts, debugging, or re-rendering after hot-reload:

const { transcripts } = session.getState();
exportAsJson(transcripts);

The array is a clone — mutating it won’t affect session state.


Transcript vs speaking: pick the right one

Two closely-related but distinct streams for “what the assistant said”:

onAssistantTranscriptonAssistantSpeaking
SourceLLM token streamTTS pipeline
Fires whenThe model is generating textAudio is being synthesized
Stream mode (no LLM)❌ never fires✅ fires — the only option
Production / test mode✅ fires (earlier in the pipeline)✅ fires (after TTS begins)
Handler receivesIncremental AssistantTranscriptDiscriminated { start \| chunk \| end }
Use forRender the model’s response as textSync UI with actual audio (speaking indicator, karaoke)
Reflects post-processing? (PII redaction, profanity filter)No — raw LLM outputYes — what’s actually being heard

Rule of thumb:

  • Want what the model saidonAssistantTranscript
  • Want what the user is hearing right nowonAssistantSpeaking
  • In stream modeonAssistantSpeaking is your primary text stream (transcript doesn’t fire without an LLM)

The helper for onAssistantSpeaking lives in Speaking. The transcript helpers are below.

Single-role helpers

Handy when you only care about one role

The methods below wrap session.on('transcript', ...) with a role filter and a type-narrowed handler signature. Use whichever feels cleaner for your code — the main pattern above is a solid default, and these are a neat shortcut when you only need one role.
type Unsubscribe = () => void;

session.onUserTranscript(
  (entry: UserTranscript) => void
): Unsubscribe;

session.onAssistantTranscript(
  (entry: AssistantTranscript) => void
): Unsubscribe;

session.onToolCall(
  (entry: ToolCallTranscript) => void
): Unsubscribe;

Example:

const off = session.onUserTranscript((e) => {
  if (e.isComplete) handle(e.text);
});

// later
off();
Was this helpful?