TTS (text-to-speech)
Pick a TTS provider, configure the voice, and override per-language or per-provider with tts_configuration_overrides.
Overview
Breeze Buddy synthesises bot speech with one of four providers: ElevenLabs (default), Cartesia, Sarvam, and Google. Pick one by setting tts_configuration. If you need per-language or per-provider variants on the same template, add tts_configuration_overrides.
Top-level shape
Inside configurations:
| Field | Type | Required | Description |
|---|---|---|---|
tts_configuration | TTSConfig | Yes | The default TTS provider and voice for the template. |
tts_configuration_overrides | Dict[str, TTSConfig] | No | Per-provider override map. The dict key (e.g. "cartesia") auto-fills the entry’s provider field, so each override can target a specific provider without repeating it. |
tts_selection_config | TTSSelectionConfig | No | LLM-based per-utterance provider selection. Use only when genuinely multilingual. |
TTSConfig fields
| Field | Type | Provider | Description |
|---|---|---|---|
provider | TTSProvider | all | elevenlabs, cartesia, sarvam, or google. |
voice_id | str | all | Provider-specific voice identifier. |
model_id | str | ElevenLabs | Model, e.g. eleven_turbo_v2_5, eleven_flash_v2_5. |
speed | float | all | Speech rate. Ranges vary — ElevenLabs 0.7–1.2, Cartesia 0.6–1.5. |
volume | float | Cartesia | 0.5–2.0. |
emotion | list[str] | Cartesia | Emotion tags, e.g. ["positivity:high", "curiosity"]. |
language | str | all | BCP-47 language code. |
Picking a provider
| Feature | ElevenLabs | Cartesia | Sarvam | |
|---|---|---|---|---|
| Best for | Multilingual, natural prosody | Expressive English, low-latency | Indian languages | Broad language coverage |
| Emotion control | No | Yes (emotion tags) | No | Limited |
| Volume control | No | Yes (0.5–2.0) | No | Limited |
| Model selection | Yes (model_id) | No | No | No |
Example — default ElevenLabs
{
"configurations": {
"tts_configuration": {
"provider": "elevenlabs",
"voice_id": "pFZP5JQG7iQjIQuC4Bku",
"model_id": "eleven_turbo_v2_5",
"speed": 1.05,
"language": "en"
}
}
}Example — Cartesia with emotion
{
"configurations": {
"tts_configuration": {
"provider": "cartesia",
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"speed": 1.1,
"volume": 1.2,
"emotion": ["positivity:high", "curiosity"],
"language": "en"
}
}
}Example — per-provider overrides
Use tts_configuration_overrides when you want the same template to produce different voices depending on which provider is chosen at runtime (e.g. via tts_selection_config). The dict key is the provider, so you don’t repeat it inside the config.
{
"configurations": {
"tts_configuration": {
"provider": "elevenlabs",
"voice_id": "pFZP5JQG7iQjIQuC4Bku",
"model_id": "eleven_turbo_v2_5",
"language": "en"
},
"tts_configuration_overrides": {
"cartesia": {
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"language": "en",
"speed": 1.1
},
"sarvam": {
"voice_id": "anushka",
"language": "hi-IN"
}
}
}
}LLM-based TTS selection
tts_selection_config lets a Gemini LLM pick the provider per utterance. It adds latency — only use when you need multilingual switching.
| Field | Type | Description |
|---|---|---|
enabled | bool | Master switch. |
prompt | str | Guidance for the LLM picking the provider. |
providers | list[TTSProvider] | Candidate providers. |
{
"configurations": {
"tts_configuration": {
"provider": "elevenlabs",
"voice_id": "pFZP5JQG7iQjIQuC4Bku",
"model_id": "eleven_turbo_v2_5",
"language": "en"
},
"tts_configuration_overrides": {
"cartesia": {
"voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
"language": "en"
}
},
"tts_selection_config": {
"enabled": true,
"prompt": "Use cartesia for English, elevenlabs for anything else.",
"providers": ["cartesia", "elevenlabs"]
}
}
}Latency trade-off
LLM-based selection adds a small latency overhead per utterance. Use only when you need multilingual provider switching.
Best practices
- Start with
tts_configurationpointing at one provider. Addtts_configuration_overridesonly if you genuinely need different voices per provider. - Fine-tune
speedandvolumefrom test calls — defaults are neutral but rarely perfect. - Use Cartesia
emotiontags for explicit expressiveness. - Enable
tts_selection_configonly for multilingual or multi-accent flows.