Voice pipeline
End-to-end audio processing pipeline — from microphone input to speaker output. Built on Pipecat.
Pipeline overview
Every voice call flows through a sequential pipeline of processors. Audio arrives from the transport layer (Daily WebRTC or Telephony WebSocket), gets transcribed, processed by the LLM, synthesized to speech, and sent back to the caller. See Architecture for where the pipeline fits in the broader system.
STT (speech-to-Text) providers
The STT stage converts raw audio into text transcriptions. Multiple providers are supported with different strengths.
| Provider | Key | Notes |
|---|---|---|
| Soniox | soniox | Default. Low-latency streaming with strong multilingual support. |
| Deepgram | deepgram | High accuracy, extensive language coverage. |
| Sarvam | sarvam | Optimized for Indian languages. |
| OpenAI | openai | Whisper-based, high accuracy for English. |
google | Google Cloud Speech-to-Text. |
Provider Configuration
Configure the STT provider in your template’s STT config. See STT Configuration for full options.
LLM (large language model)
The LLM processes the transcribed user input along with the conversation history and system prompts to generate the assistant’s response. Breeze Buddy uses Azure OpenAI as the LLM provider. Prompts are defined in your flow node task_messages.
| Aspect | Detail |
|---|---|
| Provider | Azure OpenAI |
| Streaming | Yes — token-by-token streaming for low-latency TTS feeding |
| Function Calling | Supported — tools/functions defined in template flow nodes |
| Context Window | Managed automatically with conversation history truncation |
LLM Configuration
Adjust model, temperature, max tokens, and other parameters in LLM Configuration.
TTS (text-to-Speech) providers
The TTS stage converts the LLM’s text response into audio that is streamed back to the caller.
| Provider | Key | Notes |
|---|---|---|
| ElevenLabs | elevenlabs | Default. Natural-sounding voices with low latency. |
| Cartesia | cartesia | Ultra-low latency streaming TTS. |
| Sarvam | sarvam | Indian language voices. |
google | Google Cloud Text-to-Speech. |
TranscriptionGate
The TranscriptionGate sits between STT and the user aggregator. It provides two key capabilities:
- Keyword filtering — filters out known noise phrases or false positives from the STT output.
- Hard mute control — when the gate is closed, all user transcriptions are blocked from reaching the LLM. This is used during assistant speech to prevent the agent from responding to its own audio.
UserIdle handling
The UserIdle processor monitors for silence from the user. If no speech is detected for a configured duration, it:
- Sends a configurable idle prompt to re-engage the user.
- Repeats up to
max_retriestimes. - Ends the call if the user remains silent after all retries are exhausted.
Configuration
See User Idle Handling for timeout durations, prompt customization, and retry settings.
Aggregators
UserAggregator
Collects and merges partial user transcriptions into complete user turns before sending them to the LLM. Works with the turn detection strategy to determine when the user has finished speaking.
AssistantAggregator
Collects the assistant’s streamed response tokens and the corresponding TTS audio, assembling the complete assistant turn for transcription logging and analytics.
User turn strategies
Turn detection determines when the user has finished speaking and the system should process their input. The strategy is configured via VAD (Voice Activity Detection) settings.
| Strategy | Behavior |
|---|---|
endpointing | Uses silence duration after speech to detect turn end. Configurable via stop_secs. |
push-to-talk | User explicitly signals turn boundaries (used in Daily/WebRTC sessions). |
VAD Settings
Fine-tune turn detection in VAD & Turn Detection configuration.
Transport layer
The transport layer manages the audio I/O connection between the voice pipeline and the caller. Breeze Buddy supports two transport modes:
| Transport | Use Case |
|---|---|
| Telephony (Twilio / Plivo / Exotel) | PSTN calls — outbound dialing and inbound call handling. |
| Daily WebRTC | Browser-based sessions, playground testing, and Daily SDK integrations. |