How-to

Input collection

Accumulate multi-segment speech input for phone numbers, addresses, and other structured data.

Overview

Input Collection is a node-level configuration that changes how user speech is handled. Instead of sending each speech segment to the LLM immediately, the system waits for the user to finish speaking across multiple segments before processing.

This is essential for collecting structured data like phone numbers, email addresses, or postal addresses — where the user naturally pauses between segments (e.g. “nine eight seven… six five four… three two one zero”).

How it works

  1. User begins speaking in segments with natural pauses.
  2. Each segment is accumulated rather than sent immediately to the LLM.
  3. A timer starts after each segment — if no new speech arrives within user_speech_timeout seconds, the accumulated input is finalized.
  4. The complete accumulated text is sent to the LLM as a single user turn.

InputCollectionConfig

Input Collection is configured on individual FlowNodeModel nodes — it is a node-level only setting, not a global configuration.

FieldTypeDefaultDescription
enabledboolfalseEnable input collection mode on this node.
user_speech_timeoutfloatSeconds to wait after the last speech segment before finalizing. Must be >= 0.0.

Node-Level Only

Input Collection is configured per flow node, not globally. Enable it only on nodes where multi-segment input is expected (e.g. “collect phone number” or “collect address” nodes).

Configuration example

json
{
  "node_name": "collect_phone",
  "task_messages": [
    {
      "role": "system",
      "content": "Ask the customer for their phone number. Wait for them to speak the full number."
    }
  ],
  "input_collection": {
    "enabled": true,
    "user_speech_timeout": 2.5
  },
  "functions": [
    {
      "name": "phone_collected",
      "description": "Phone number has been collected",
      "properties": {
        "phone_number": { "type": "string", "description": "The collected phone number" }
      },
      "required": ["phone_number"],
      "transition_to": "next_step"
    }
  ]
}

Timeout tuning

The user_speech_timeout value controls how long the system waits after the last speech segment before considering the input complete:

ValueUse Case
1.0–2.0sShort inputs — confirmation codes, yes/no with detail.
2.0–3.0sPhone numbers — users typically pause between digit groups.
3.0–5.0sAddresses — longer pauses between street, city, postcode.

Too high vs. too low

Setting the timeout too low may cut off the user mid-input. Setting it too high adds unnecessary latency before the agent responds. Test with real speech patterns to find the right balance.

When to use mute_stt instead

mute_stt / unmute_stt are pre/post-action handlers that stop the STT pipeline from processing audio entirely. Pair them with a node that reads a long fixed message (“please hold for verification”) where any user audio is noise you want to drop outright. Use them when:

  • You have a specific node that should ignore the user completely for a window of time.
  • Background noise during the bot’s speech is so heavy that keyword-filter and disabled_discard still leak through.

Use input_collection instead when:

  • You want the user’s speech, you just want to accumulate it across pauses (phone numbers, addresses, multi-segment answers).
  • The conversation is normal otherwise and you want the bot to react to the input.

See Global functions → Built-in handlers for the mute_stt action syntax.

Was this helpful?