How-to

TTS (text-to-speech)

Pick a TTS provider, configure the voice, and override per-language or per-provider with tts_configuration_overrides.

Overview

Breeze Buddy synthesises bot speech with one of four providers: ElevenLabs (default), Cartesia, Sarvam, and Google. Pick one by setting tts_configuration. If you need per-language or per-provider variants on the same template, add tts_configuration_overrides.

Top-level shape

Inside configurations:

Field	Type	Required	Description
`tts_configuration`	`TTSConfig`	Yes	The default TTS provider and voice for the template.
`tts_configuration_overrides`	`Dict[str, TTSConfig]`	No	Per-provider override map. The dict key (e.g. `"cartesia"`) auto-fills the entry’s `provider` field, so each override can target a specific provider without repeating it.
`tts_selection_config`	`TTSSelectionConfig`	No	LLM-based per-utterance provider selection. Use only when genuinely multilingual.

TTSConfig fields

Field	Type	Provider	Description
`provider`	`TTSProvider`	all	`elevenlabs`, `cartesia`, `sarvam`, or `google`.
`voice_id`	`str`	all	Provider-specific voice identifier.
`model_id`	`str`	ElevenLabs	Model, e.g. `eleven_turbo_v2_5`, `eleven_flash_v2_5`.
`speed`	`float`	all	Speech rate. Ranges vary — ElevenLabs 0.7–1.2, Cartesia 0.6–1.5.
`volume`	`float`	Cartesia	0.5–2.0.
`emotion`	`list[str]`	Cartesia	Emotion tags, e.g. `["positivity:high", "curiosity"]`.
`language`	`str`	all	BCP-47 language code.

Picking a provider

Feature	ElevenLabs	Cartesia	Sarvam	Google
Best for	Multilingual, natural prosody	Expressive English, low-latency	Indian languages	Broad language coverage
Emotion control	No	Yes (`emotion` tags)	No	Limited
Volume control	No	Yes (0.5–2.0)	No	Limited
Model selection	Yes (`model_id`)	No	No	No

Example — default ElevenLabs

tts-elevenlabs.json

json

{
  "configurations": {
    "tts_configuration": {
      "provider": "elevenlabs",
      "voice_id": "pFZP5JQG7iQjIQuC4Bku",
      "model_id": "eleven_turbo_v2_5",
      "speed": 1.05,
      "language": "en"
    }
  }
}

Example — Cartesia with emotion

tts-cartesia.json

json

{
  "configurations": {
    "tts_configuration": {
      "provider": "cartesia",
      "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
      "speed": 1.1,
      "volume": 1.2,
      "emotion": ["positivity:high", "curiosity"],
      "language": "en"
    }
  }
}

Example — per-provider overrides

Use tts_configuration_overrides when you want the same template to produce different voices depending on which provider is chosen at runtime (e.g. via tts_selection_config). The dict key is the provider, so you don’t repeat it inside the config.

tts-overrides.json

json

{
  "configurations": {
    "tts_configuration": {
      "provider": "elevenlabs",
      "voice_id": "pFZP5JQG7iQjIQuC4Bku",
      "model_id": "eleven_turbo_v2_5",
      "language": "en"
    },
    "tts_configuration_overrides": {
      "cartesia": {
        "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
        "language": "en",
        "speed": 1.1
      },
      "sarvam": {
        "voice_id": "anushka",
        "language": "hi-IN"
      }
    }
  }
}

LLM-based TTS selection

tts_selection_config lets a Gemini LLM pick the provider per utterance. It adds latency — only use when you need multilingual switching.

Field	Type	Description
`enabled`	`bool`	Master switch.
`prompt`	`str`	Guidance for the LLM picking the provider.
`providers`	`list[TTSProvider]`	Candidate providers.

Multilingual TTS Selection

json

{
  "configurations": {
    "tts_configuration": {
      "provider": "elevenlabs",
      "voice_id": "pFZP5JQG7iQjIQuC4Bku",
      "model_id": "eleven_turbo_v2_5",
      "language": "en"
    },
    "tts_configuration_overrides": {
      "cartesia": {
        "voice_id": "a0e99841-438c-4a64-b679-ae501e7d6091",
        "language": "en"
      }
    },
    "tts_selection_config": {
      "enabled": true,
      "prompt": "Use cartesia for English, elevenlabs for anything else.",
      "providers": ["cartesia", "elevenlabs"]
    }
  }
}

Latency trade-off

LLM-based selection adds a small latency overhead per utterance. Use only when you need multilingual provider switching.

Best practices

Start with tts_configuration pointing at one provider. Add tts_configuration_overrides only if you genuinely need different voices per provider.
Fine-tune speed and volume from test calls — defaults are neutral but rarely perfect.
Use Cartesia emotion tags for explicit expressiveness.
Enable tts_selection_config only for multilingual or multi-accent flows.

Next steps

STT Pair TTS with a matching STT provider.Configuration reference Full ConfigurationModel shape.Audio Background sound, noise filter, keyword filter.

Was this helpful?

Edit on GitHub