How-to

STT (speech-to-text)

Provider selection, language settings, turn detection, and provider-specific tuning for stt_configuration.

Overview

The stt_configuration field inside ConfigurationModel controls how incoming audio is transcribed. Choose a provider, set the language, pick a turn-detection strategy, and optionally fine-tune provider-specific parameters.

Top-Level fields

FieldTypeDefaultDescription
providerSTTProvidersonioxsoniox, deepgram, sarvam, openai, google
languagestr \| list[str]nullBCP-47 code(s). Pass a list for multilingual recognition.
payload_based_language_selectionboolfalseLLM detects language from lead payload dynamically.
turn_detectionTurnDetectionModestt_nativestt_native, smart_turn, or timeout.
user_speech_timeoutfloat0.3Silence seconds before finalising a turn. Honoured only when turn_detection: "timeout" — otherwise silently reset to 0.0.
deepgramDeepgramConfignullDeepgram-specific tuning. See below.
sonioxSonioxConfignullSoniox-specific tuning.
sarvamSarvamConfignullSarvam-specific tuning.
smart_turnSmartTurnConfignullTuning for turn_detection: "smart_turn".

Provider comparison

ProviderStrengthsNative Turn DetectionMultilingualBest Turn Mode
sonioxLow latency, <end> tokenYesYesstt_native
deepgramNova-3, smart formattingLimitedAuto-detectsmart_turn
sarvamIndic language specialistNoNotimeout
googleBroad language coverageLimitedYestimeout
openaiWhisper-based real-time APIYesYesstt_native

Provider-Specific config

Deepgram

FieldDefaultDescription
modelnova-3-generalDeepgram model name.
endpointing_ms25Silence threshold (ms) before endpoint.
utterance_end_msMax wait (ms) for utterance end signal.
no_delaytrueDisable internal buffering for lower latency.
smart_formattrueAuto capitalization and punctuation.
numeralstrueConvert spoken numbers to digits.
profanity_filterfalseMask profane words.
auto_detect_languagefalseAuto-detect spoken language.

Soniox

Fields: context (domain hint), model (model identifier).

Sarvam

Fields: model, language_code (e.g. hi-IN). Single-language only.

Deepgram + Smart Turn (Recommended)
json
{
  "stt_configuration": {
    "provider": "deepgram",
    "language": "en",
    "turn_detection": "smart_turn",
    "deepgram": {
      "model": "nova-3-general",
      "endpointing_ms": 25,
      "utterance_end_ms": 1000,
      "no_delay": true,
      "smart_format": true,
      "numerals": true
    },
    "smart_turn": {
      "stop_secs": 3.0,
      "pre_speech_ms": 500.0,
      "max_duration_secs": 8.0,
      "cpu_count": 1
    }
  }
}

Language selection

  • Single"language": "en"
  • Multiple"language": ["en", "hi"] (provider must support multilingual)
  • Payload-based — set "payload_based_language_selection": true and the LLM infers from lead payload

Production recommendation

Use Deepgram + smart_turn for the best balance of transcription quality and natural turn-taking.

Was this helpful?