How-to

Turn detection

Pick how the agent decides when a user turn has ended — native STT endpoints, smart-turn ML, or a silence timeout.

Turn detection is the upstream decision behind everything else in the pipeline. It governs when the LLM is prompted, when input-collection windows close, and when interruption opportunities open. Breeze Buddy supports three strategies; the right one depends on your STT provider and how forgiving the conversation needs to be.

Prerequisites

Working STT configuration — see STT.
A rough sense of your conversation’s cadence (short confirmations vs. long open-ended answers).
VAD basics — see VAD.

The three strategies

Mode	Who decides the turn ended	When to use
`stt_native`	The STT provider (Soniox semantic endpoint detection, Deepgram endpointing)	Default in production. Best for conversational back-and-forth.
`smart_turn`	A dedicated ML model (SmartTurn) that classifies turn boundaries	When STT endpointing misfires on pauses. Requires the turn analyzer to be wired up.
`timeout`	N seconds of silence after the last transcription	Fallback when STT-native is unreliable; also the basis for multi-segment input collection.

stt_native is the default. You rarely need to change it.

stt_native

The STT service signals end-of-turn on its own. Each provider has a different tuning knob:

Provider	Knob	Notes
Soniox	`max_endpoint_delay_ms` (default 500ms)	Set at WebSocket connect time; not runtime-tunable.
Deepgram	`endpointing` parameter on the stream	Enable for endpoint-aware streams.

With stt_native, turns end as soon as the STT decides the user has finished. No further config is needed on the Breeze Buddy side — just pick an STT that supports it.

timeout

Turn ends when user_speech_timeout seconds of silence follow the last transcription. Useful when STT endpointing is missing or when you want predictable accumulation of multi-segment input.

template-timeout.json

json

{
  "configurations": {
    "stt_configuration": {
      "provider": "deepgram",
      "language": "en",
      "turn_detection": "timeout",
      "user_speech_timeout": 1.5
    }
  }
}

Both turn_detection and user_speech_timeout live inside stt_configuration — they are not top-level keys. And user_speech_timeout is silently reset to 0.0 unless turn_detection is "timeout".

A 1.5-second timeout is a good starting point for phone calls. Raise it to 3 seconds when collecting phone numbers, addresses, or anything the user might say in chunks (see Input collection). The timer resets on every new transcript, so segments accumulate naturally.

smart_turn

A learned classifier (SmartTurn v3 in Pipecat) decides turn boundaries based on linguistic cues rather than silence or endpoints. It is most helpful when:

Users pause mid-sentence (“my order number is… uh… 4-8-2-7-3”).
STT endpointing closes the turn too eagerly.
You need a single detector across STT providers.

Availability

smart_turn requires the turn-analyzer component to be configured at deployment time. If your deployment does not include it, only stt_native and timeout are selectable.

Edge cases

Barge-in while TTS — Turn start is detected regardless of mode. Whether that start actually interrupts depends on Interruption control.
Short answers on timeout mode — A 3-second timeout means a “yes” answer takes 3 seconds to propagate. Use stt_native or smart_turn for fast back-and-forth.
Timeout with noisy mics — VAD may generate spurious partial transcripts, each resetting the timer. Tune VAD stop_secs alongside the timeout.

Next steps

STT Pick a provider that supports the strategy you want.VAD Tune the upstream voice activity detector.Interruption control Decide whether turn-starts actually interrupt.Input collection Use timeout mode for structured multi-segment capture.

Was this helpful?

Edit on GitHub