Concept

Voice pipeline

End-to-end audio processing pipeline — from microphone input to speaker output. Built on Pipecat.

Pipeline overview

Every voice call flows through a sequential pipeline of processors. Audio arrives from the transport layer (Daily WebRTC or Telephony WebSocket), gets transcribed, processed by the LLM, synthesized to speech, and sent back to the caller. See Architecture for where the pipeline fits in the broader system.

transport.input()
STT
TranscriptionGate
UserIdle
UserAggregator
LLM
TTS
transport.output()
AssistantAggregator

STT (speech-to-Text) providers

The STT stage converts raw audio into text transcriptions. Multiple providers are supported with different strengths.

ProviderKeyNotes
SonioxsonioxDefault. Low-latency streaming with strong multilingual support.
DeepgramdeepgramHigh accuracy, extensive language coverage.
SarvamsarvamOptimized for Indian languages.
OpenAIopenaiWhisper-based, high accuracy for English.
GooglegoogleGoogle Cloud Speech-to-Text.

Provider Configuration

Configure the STT provider in your template’s STT config. See STT Configuration for full options.

LLM (large language model)

The LLM processes the transcribed user input along with the conversation history and system prompts to generate the assistant’s response. Breeze Buddy uses Azure OpenAI as the LLM provider. Prompts are defined in your flow node task_messages.

AspectDetail
ProviderAzure OpenAI
StreamingYes — token-by-token streaming for low-latency TTS feeding
Function CallingSupported — tools/functions defined in template flow nodes
Context WindowManaged automatically with conversation history truncation

LLM Configuration

Adjust model, temperature, max tokens, and other parameters in LLM Configuration.

TTS (text-to-Speech) providers

The TTS stage converts the LLM’s text response into audio that is streamed back to the caller.

ProviderKeyNotes
ElevenLabselevenlabsDefault. Natural-sounding voices with low latency.
CartesiacartesiaUltra-low latency streaming TTS.
SarvamsarvamIndian language voices.
GooglegoogleGoogle Cloud Text-to-Speech.

TranscriptionGate

The TranscriptionGate sits between STT and the user aggregator. It provides two key capabilities:

  • Keyword filtering — filters out known noise phrases or false positives from the STT output.
  • Hard mute control — when the gate is closed, all user transcriptions are blocked from reaching the LLM. This is used during assistant speech to prevent the agent from responding to its own audio.

UserIdle handling

The UserIdle processor monitors for silence from the user. If no speech is detected for a configured duration, it:

  1. Sends a configurable idle prompt to re-engage the user.
  2. Repeats up to max_retries times.
  3. Ends the call if the user remains silent after all retries are exhausted.

Configuration

See User Idle Handling for timeout durations, prompt customization, and retry settings.

Aggregators

UserAggregator

Collects and merges partial user transcriptions into complete user turns before sending them to the LLM. Works with the turn detection strategy to determine when the user has finished speaking.

AssistantAggregator

Collects the assistant’s streamed response tokens and the corresponding TTS audio, assembling the complete assistant turn for transcription logging and analytics.

User turn strategies

Turn detection determines when the user has finished speaking and the system should process their input. The strategy is configured via VAD (Voice Activity Detection) settings.

StrategyBehavior
endpointingUses silence duration after speech to detect turn end. Configurable via stop_secs.
push-to-talkUser explicitly signals turn boundaries (used in Daily/WebRTC sessions).

VAD Settings

Fine-tune turn detection in VAD & Turn Detection configuration.

Transport layer

The transport layer manages the audio I/O connection between the voice pipeline and the caller. Breeze Buddy supports two transport modes:

TransportUse Case
Telephony (Twilio / Plivo / Exotel)PSTN calls — outbound dialing and inbound call handling.
Daily WebRTCBrowser-based sessions, playground testing, and Daily SDK integrations.
Was this helpful?