How-to

VAD and turn detection

Silero VAD parameters, turn-detection strategies, and per-node overrides.

Overview

Voice Activity Detection (VAD) determines when the user is speaking. Turn detection decides when the user has finished speaking so the bot can respond. Breeze Buddy uses the Silero VAD model combined with one of three turn-detection strategies.

VAD parameters

Field	Type	Range	Description
`confidence`	`float`	0.0–1.0	Minimum confidence to classify audio as speech.
`start_secs`	`float`	≥ 0	Consecutive speech seconds before marking onset.
`stop_secs`	`float`	≥ 0	Consecutive silence seconds before marking offset.
`min_volume`	`float`	≥ 0	Audio below this volume is treated as silence.

VAD Configuration

json

{
  "vad_config": {
    "confidence": 0.5,
    "start_secs": 0.2,
    "stop_secs": 0.8,
    "min_volume": 0.6
  }
}

Tuning tips

Lower confidence catches softer speech but may false-trigger on noise. Higher stop_secs prevents premature cutoffs but adds latency. Start with defaults and tune from test calls.

Turn detection strategies

Set via stt_configuration.turn_detection:

Strategy	Mechanism	Latency	Best With	Extra Config
`stt_native`	Provider endpoint token	Lowest	Soniox	None
`smart_turn`	Whisper ONNX prosody analysis	Medium	Deepgram	`smart_turn`
`timeout`	Silent timer after last transcript	Configurable	Any	`user_speech_timeout`

Smart Turn Config

Field	Default	Description
`stop_secs`	`3.0`	Max silence seconds before forcing turn stop.
`pre_speech_ms`	`500.0`	Audio context (ms) before speech onset for analysis.
`max_duration_secs`	`8.0`	Maximum turn duration.
`cpu_count`	`1`	CPU threads for ONNX inference.

Per-Node overrides

VAD and turn-detection follow a reset-then-apply cascade:

Template-level vad_config applies as baseline.
When entering a node with its own config, template settings are reset.
Node-level settings apply in full — no merging.

No merging

Node-level vad_config replaces the template config entirely. Include all desired fields — missing ones revert to system defaults, not the template’s values.

Full example

VAD + Smart Turn (Production)

json

{
  "configurations": {
    "vad_config": {
      "confidence": 0.5,
      "start_secs": 0.2,
      "stop_secs": 0.8,
      "min_volume": 0.6
    },
    "stt_configuration": {
      "provider": "deepgram",
      "language": "en",
      "turn_detection": "smart_turn",
      "deepgram": {
        "model": "nova-3-general",
        "endpointing_ms": 25,
        "no_delay": true
      },
      "smart_turn": {
        "stop_secs": 3.0,
        "pre_speech_ms": 500.0,
        "max_duration_secs": 8.0,
        "cpu_count": 1
      }
    }
  }
}

Was this helpful?

Edit on GitHub