How-to

STT (speech-to-text)

Provider selection, language settings, turn detection, and provider-specific tuning for stt_configuration.

Overview

The stt_configuration field inside ConfigurationModel controls how incoming audio is transcribed. Choose a provider, set the language, pick a turn-detection strategy, and optionally fine-tune provider-specific parameters.

Top-Level fields

Field	Type	Default	Description
`provider`	`STTProvider`	`soniox`	`soniox`, `deepgram`, `sarvam`, `openai`, `google`
`language`	`str \\| list[str]`	`null`	BCP-47 code(s). Pass a list for multilingual recognition.
`payload_based_language_selection`	`bool`	`false`	LLM detects language from lead payload dynamically.
`turn_detection`	`TurnDetectionMode`	`stt_native`	`stt_native`, `smart_turn`, or `timeout`.
`user_speech_timeout`	`float`	`0.3`	Silence seconds before finalising a turn. Honoured only when `turn_detection: "timeout"` — otherwise silently reset to `0.0`.
`deepgram`	`DeepgramConfig`	`null`	Deepgram-specific tuning. See below.
`soniox`	`SonioxConfig`	`null`	Soniox-specific tuning.
`sarvam`	`SarvamConfig`	`null`	Sarvam-specific tuning.
`smart_turn`	`SmartTurnConfig`	`null`	Tuning for `turn_detection: "smart_turn"`.

Provider comparison

Provider	Strengths	Native Turn Detection	Multilingual	Best Turn Mode
`soniox`	Low latency, `<end>` token	Yes	Yes	`stt_native`
`deepgram`	Nova-3, smart formatting	Limited	Auto-detect	`smart_turn`
`sarvam`	Indic language specialist	No	No	`timeout`
`google`	Broad language coverage	Limited	Yes	`timeout`
`openai`	Whisper-based real-time API	Yes	Yes	`stt_native`

Provider-Specific config

Deepgram

Field	Default	Description
`model`	`nova-3-general`	Deepgram model name.
`endpointing_ms`	`25`	Silence threshold (ms) before endpoint.
`utterance_end_ms`	—	Max wait (ms) for utterance end signal.
`no_delay`	`true`	Disable internal buffering for lower latency.
`smart_format`	`true`	Auto capitalization and punctuation.
`numerals`	`true`	Convert spoken numbers to digits.
`profanity_filter`	`false`	Mask profane words.
`auto_detect_language`	`false`	Auto-detect spoken language.

Soniox

Fields: context (domain hint), model (model identifier).

Sarvam

Fields: model, language_code (e.g. hi-IN). Single-language only.

Recommended configuration

Deepgram + Smart Turn (Recommended)

json

{
  "stt_configuration": {
    "provider": "deepgram",
    "language": "en",
    "turn_detection": "smart_turn",
    "deepgram": {
      "model": "nova-3-general",
      "endpointing_ms": 25,
      "utterance_end_ms": 1000,
      "no_delay": true,
      "smart_format": true,
      "numerals": true
    },
    "smart_turn": {
      "stop_secs": 3.0,
      "pre_speech_ms": 500.0,
      "max_duration_secs": 8.0,
      "cpu_count": 1
    }
  }
}

Language selection

Single — "language": "en"
Multiple — "language": ["en", "hi"] (provider must support multilingual)
Payload-based — set "payload_based_language_selection": true and the LLM infers from lead payload

Production recommendation

Use Deepgram + smart_turn for the best balance of transcription quality and natural turn-taking.

Was this helpful?

Edit on GitHub