How-to

VAD and turn detection

Silero VAD parameters, turn-detection strategies, and per-node overrides.

Overview

Voice Activity Detection (VAD) determines when the user is speaking. Turn detection decides when the user has finished speaking so the bot can respond. Breeze Buddy uses the Silero VAD model combined with one of three turn-detection strategies.

VAD parameters

FieldTypeRangeDescription
confidencefloat0.0–1.0Minimum confidence to classify audio as speech.
start_secsfloat≥ 0Consecutive speech seconds before marking onset.
stop_secsfloat≥ 0Consecutive silence seconds before marking offset.
min_volumefloat≥ 0Audio below this volume is treated as silence.
VAD Configuration
json
{
  "vad_config": {
    "confidence": 0.5,
    "start_secs": 0.2,
    "stop_secs": 0.8,
    "min_volume": 0.6
  }
}

Tuning tips

Lower confidence catches softer speech but may false-trigger on noise. Higher stop_secs prevents premature cutoffs but adds latency. Start with defaults and tune from test calls.

Turn detection strategies

Set via stt_configuration.turn_detection:

StrategyMechanismLatencyBest WithExtra Config
stt_nativeProvider endpoint tokenLowestSonioxNone
smart_turnWhisper ONNX prosody analysisMediumDeepgramsmart_turn
timeoutSilent timer after last transcriptConfigurableAnyuser_speech_timeout

Smart Turn Config

FieldDefaultDescription
stop_secs3.0Max silence seconds before forcing turn stop.
pre_speech_ms500.0Audio context (ms) before speech onset for analysis.
max_duration_secs8.0Maximum turn duration.
cpu_count1CPU threads for ONNX inference.

Per-Node overrides

VAD and turn-detection follow a reset-then-apply cascade:

  1. Template-level vad_config applies as baseline.
  2. When entering a node with its own config, template settings are reset.
  3. Node-level settings apply in full — no merging.

No merging

Node-level vad_config replaces the template config entirely. Include all desired fields — missing ones revert to system defaults, not the template’s values.

Full example

VAD + Smart Turn (Production)
json
{
  "configurations": {
    "vad_config": {
      "confidence": 0.5,
      "start_secs": 0.2,
      "stop_secs": 0.8,
      "min_volume": 0.6
    },
    "stt_configuration": {
      "provider": "deepgram",
      "language": "en",
      "turn_detection": "smart_turn",
      "deepgram": {
        "model": "nova-3-general",
        "endpointing_ms": 25,
        "no_delay": true
      },
      "smart_turn": {
        "stop_secs": 3.0,
        "pre_speech_ms": 500.0,
        "max_duration_secs": 8.0,
        "cpu_count": 1
      }
    }
  }
}
Was this helpful?