How-to

Turn detection

Pick how the agent decides when a user turn has ended — native STT endpoints, smart-turn ML, or a silence timeout.

Turn detection is the upstream decision behind everything else in the pipeline. It governs when the LLM is prompted, when input-collection windows close, and when interruption opportunities open. Breeze Buddy supports three strategies; the right one depends on your STT provider and how forgiving the conversation needs to be.

Prerequisites

  • Working STT configuration — see STT.
  • A rough sense of your conversation’s cadence (short confirmations vs. long open-ended answers).
  • VAD basics — see VAD.

The three strategies

ModeWho decides the turn endedWhen to use
stt_nativeThe STT provider (Soniox semantic endpoint detection, Deepgram endpointing)Default in production. Best for conversational back-and-forth.
smart_turnA dedicated ML model (SmartTurn) that classifies turn boundariesWhen STT endpointing misfires on pauses. Requires the turn analyzer to be wired up.
timeoutN seconds of silence after the last transcriptionFallback when STT-native is unreliable; also the basis for multi-segment input collection.

stt_native is the default. You rarely need to change it.

stt_native

The STT service signals end-of-turn on its own. Each provider has a different tuning knob:

ProviderKnobNotes
Sonioxmax_endpoint_delay_ms (default 500ms)Set at WebSocket connect time; not runtime-tunable.
Deepgramendpointing parameter on the streamEnable for endpoint-aware streams.

With stt_native, turns end as soon as the STT decides the user has finished. No further config is needed on the Breeze Buddy side — just pick an STT that supports it.

timeout

Turn ends when user_speech_timeout seconds of silence follow the last transcription. Useful when STT endpointing is missing or when you want predictable accumulation of multi-segment input.

template-timeout.json
json
{
  "configurations": {
    "stt_configuration": {
      "provider": "deepgram",
      "language": "en",
      "turn_detection": "timeout",
      "user_speech_timeout": 1.5
    }
  }
}

Both turn_detection and user_speech_timeout live inside stt_configuration — they are not top-level keys. And user_speech_timeout is silently reset to 0.0 unless turn_detection is "timeout".

A 1.5-second timeout is a good starting point for phone calls. Raise it to 3 seconds when collecting phone numbers, addresses, or anything the user might say in chunks (see Input collection). The timer resets on every new transcript, so segments accumulate naturally.

smart_turn

A learned classifier (SmartTurn v3 in Pipecat) decides turn boundaries based on linguistic cues rather than silence or endpoints. It is most helpful when:

  • Users pause mid-sentence (“my order number is… uh… 4-8-2-7-3”).
  • STT endpointing closes the turn too eagerly.
  • You need a single detector across STT providers.

Availability

smart_turn requires the turn-analyzer component to be configured at deployment time. If your deployment does not include it, only stt_native and timeout are selectable.

Edge cases

  • Barge-in while TTS — Turn start is detected regardless of mode. Whether that start actually interrupts depends on Interruption control.
  • Short answers on timeout mode — A 3-second timeout means a “yes” answer takes 3 seconds to propagate. Use stt_native or smart_turn for fast back-and-forth.
  • Timeout with noisy mics — VAD may generate spurious partial transcripts, each resetting the timer. Tune VAD stop_secs alongside the timeout.

Next steps

Was this helpful?