Turn detection
Pick how the agent decides when a user turn has ended — native STT endpoints, smart-turn ML, or a silence timeout.
Turn detection is the upstream decision behind everything else in the pipeline. It governs when the LLM is prompted, when input-collection windows close, and when interruption opportunities open. Breeze Buddy supports three strategies; the right one depends on your STT provider and how forgiving the conversation needs to be.
Prerequisites
The three strategies
| Mode | Who decides the turn ended | When to use |
|---|---|---|
stt_native | The STT provider (Soniox semantic endpoint detection, Deepgram endpointing) | Default in production. Best for conversational back-and-forth. |
smart_turn | A dedicated ML model (SmartTurn) that classifies turn boundaries | When STT endpointing misfires on pauses. Requires the turn analyzer to be wired up. |
timeout | N seconds of silence after the last transcription | Fallback when STT-native is unreliable; also the basis for multi-segment input collection. |
stt_native is the default. You rarely need to change it.
stt_native
The STT service signals end-of-turn on its own. Each provider has a different tuning knob:
| Provider | Knob | Notes |
|---|---|---|
| Soniox | max_endpoint_delay_ms (default 500ms) | Set at WebSocket connect time; not runtime-tunable. |
| Deepgram | endpointing parameter on the stream | Enable for endpoint-aware streams. |
With stt_native, turns end as soon as the STT decides the user has finished. No further config is needed on the Breeze Buddy side — just pick an STT that supports it.
timeout
Turn ends when user_speech_timeout seconds of silence follow the last transcription. Useful when STT endpointing is missing or when you want predictable accumulation of multi-segment input.
{
"configurations": {
"stt_configuration": {
"provider": "deepgram",
"language": "en",
"turn_detection": "timeout",
"user_speech_timeout": 1.5
}
}
}Both turn_detection and user_speech_timeout live inside stt_configuration — they are not top-level keys. And user_speech_timeout is silently reset to 0.0 unless turn_detection is "timeout".
A 1.5-second timeout is a good starting point for phone calls. Raise it to 3 seconds when collecting phone numbers, addresses, or anything the user might say in chunks (see Input collection). The timer resets on every new transcript, so segments accumulate naturally.
smart_turn
A learned classifier (SmartTurn v3 in Pipecat) decides turn boundaries based on linguistic cues rather than silence or endpoints. It is most helpful when:
- Users pause mid-sentence (“my order number is… uh… 4-8-2-7-3”).
- STT endpointing closes the turn too eagerly.
- You need a single detector across STT providers.
Availability
smart_turn requires the turn-analyzer component to be configured at deployment time. If your deployment does not include it, only stt_native and timeout are selectable.
Edge cases
- Barge-in while TTS — Turn start is detected regardless of mode. Whether that start actually interrupts depends on Interruption control.
- Short answers on timeout mode — A 3-second timeout means a “yes” answer takes 3 seconds to propagate. Use
stt_nativeorsmart_turnfor fast back-and-forth. - Timeout with noisy mics — VAD may generate spurious partial transcripts, each resetting the timer. Tune VAD
stop_secsalongside the timeout.