How-to
STT (speech-to-text)
Provider selection, language settings, turn detection, and provider-specific tuning for stt_configuration.
Overview
The stt_configuration field inside ConfigurationModel controls how incoming audio is transcribed. Choose a provider, set the language, pick a turn-detection strategy, and optionally fine-tune provider-specific parameters.
Top-Level fields
| Field | Type | Default | Description |
|---|---|---|---|
provider | STTProvider | soniox | soniox, deepgram, sarvam, openai, google |
language | str \| list[str] | null | BCP-47 code(s). Pass a list for multilingual recognition. |
payload_based_language_selection | bool | false | LLM detects language from lead payload dynamically. |
turn_detection | TurnDetectionMode | stt_native | stt_native, smart_turn, or timeout. |
user_speech_timeout | float | 0.3 | Silence seconds before finalising a turn. Honoured only when turn_detection: "timeout" — otherwise silently reset to 0.0. |
deepgram | DeepgramConfig | null | Deepgram-specific tuning. See below. |
soniox | SonioxConfig | null | Soniox-specific tuning. |
sarvam | SarvamConfig | null | Sarvam-specific tuning. |
smart_turn | SmartTurnConfig | null | Tuning for turn_detection: "smart_turn". |
Provider comparison
| Provider | Strengths | Native Turn Detection | Multilingual | Best Turn Mode |
|---|---|---|---|---|
soniox | Low latency, <end> token | Yes | Yes | stt_native |
deepgram | Nova-3, smart formatting | Limited | Auto-detect | smart_turn |
sarvam | Indic language specialist | No | No | timeout |
google | Broad language coverage | Limited | Yes | timeout |
openai | Whisper-based real-time API | Yes | Yes | stt_native |
Provider-Specific config
Deepgram
| Field | Default | Description |
|---|---|---|
model | nova-3-general | Deepgram model name. |
endpointing_ms | 25 | Silence threshold (ms) before endpoint. |
utterance_end_ms | — | Max wait (ms) for utterance end signal. |
no_delay | true | Disable internal buffering for lower latency. |
smart_format | true | Auto capitalization and punctuation. |
numerals | true | Convert spoken numbers to digits. |
profanity_filter | false | Mask profane words. |
auto_detect_language | false | Auto-detect spoken language. |
Soniox
Fields: context (domain hint), model (model identifier).
Sarvam
Fields: model, language_code (e.g. hi-IN). Single-language only.
Recommended configuration
json
{
"stt_configuration": {
"provider": "deepgram",
"language": "en",
"turn_detection": "smart_turn",
"deepgram": {
"model": "nova-3-general",
"endpointing_ms": 25,
"utterance_end_ms": 1000,
"no_delay": true,
"smart_format": true,
"numerals": true
},
"smart_turn": {
"stop_secs": 3.0,
"pre_speech_ms": 500.0,
"max_duration_secs": 8.0,
"cpu_count": 1
}
}
}Language selection
- Single —
"language": "en" - Multiple —
"language": ["en", "hi"](provider must support multilingual) - Payload-based — set
"payload_based_language_selection": trueand the LLM infers from lead payload
Production recommendation
Use Deepgram + smart_turn for the best balance of transcription quality and natural turn-taking.
Was this helpful?