Concept

Voice pipeline

End-to-end audio processing pipeline — from microphone input to speaker output. Built on Pipecat.

Pipeline overview

Every voice call flows through a sequential pipeline of processors. Audio arrives from the transport layer (Daily WebRTC or Telephony WebSocket), gets transcribed, processed by the LLM, synthesized to speech, and sent back to the caller. See Architecture for where the pipeline fits in the broader system.

transport.input()

STT

TranscriptionGate

UserIdle

UserAggregator

LLM

TTS

transport.output()

AssistantAggregator

STT (speech-to-Text) providers

The STT stage converts raw audio into text transcriptions. Multiple providers are supported with different strengths.

Provider	Key	Notes
Soniox	`soniox`	Default. Low-latency streaming with strong multilingual support.
Deepgram	`deepgram`	High accuracy, extensive language coverage.
Sarvam	`sarvam`	Optimized for Indian languages.
OpenAI	`openai`	Whisper-based, high accuracy for English.
Google	`google`	Google Cloud Speech-to-Text.

Provider Configuration

Configure the STT provider in your template’s STT config. See STT Configuration for full options.

LLM (large language model)

The LLM processes the transcribed user input along with the conversation history and system prompts to generate the assistant’s response. Breeze Buddy uses Azure OpenAI as the LLM provider. Prompts are defined in your flow node task_messages.

Aspect	Detail
Provider	Azure OpenAI
Streaming	Yes — token-by-token streaming for low-latency TTS feeding
Function Calling	Supported — tools/functions defined in template flow nodes
Context Window	Managed automatically with conversation history truncation

LLM Configuration

Adjust model, temperature, max tokens, and other parameters in LLM Configuration.

TTS (text-to-Speech) providers

The TTS stage converts the LLM’s text response into audio that is streamed back to the caller.

Provider	Key	Notes
ElevenLabs	`elevenlabs`	Default. Natural-sounding voices with low latency.
Cartesia	`cartesia`	Ultra-low latency streaming TTS.
Sarvam	`sarvam`	Indian language voices.
Google	`google`	Google Cloud Text-to-Speech.

TranscriptionGate

The TranscriptionGate sits between STT and the user aggregator. It provides two key capabilities:

Keyword filtering — filters out known noise phrases or false positives from the STT output.
Hard mute control — when the gate is closed, all user transcriptions are blocked from reaching the LLM. This is used during assistant speech to prevent the agent from responding to its own audio.

UserIdle handling

The UserIdle processor monitors for silence from the user. If no speech is detected for a configured duration, it:

Sends a configurable idle prompt to re-engage the user.
Repeats up to max_retries times.
Ends the call if the user remains silent after all retries are exhausted.

Configuration

See User Idle Handling for timeout durations, prompt customization, and retry settings.

Aggregators

UserAggregator

Collects and merges partial user transcriptions into complete user turns before sending them to the LLM. Works with the turn detection strategy to determine when the user has finished speaking.

AssistantAggregator

Collects the assistant’s streamed response tokens and the corresponding TTS audio, assembling the complete assistant turn for transcription logging and analytics.

User turn strategies

Turn detection determines when the user has finished speaking and the system should process their input. The strategy is configured via VAD (Voice Activity Detection) settings.

Strategy	Behavior
`endpointing`	Uses silence duration after speech to detect turn end. Configurable via `stop_secs`.
`push-to-talk`	User explicitly signals turn boundaries (used in Daily/WebRTC sessions).

VAD Settings

Fine-tune turn detection in VAD & Turn Detection configuration.

Transport layer

The transport layer manages the audio I/O connection between the voice pipeline and the caller. Breeze Buddy supports two transport modes:

Transport	Use Case
Telephony (Twilio / Plivo / Exotel)	PSTN calls — outbound dialing and inbound call handling.
Daily WebRTC	Browser-based sessions, playground testing, and Daily SDK integrations.

Was this helpful?

Edit on GitHub