Runbook

Langfuse alert response

On-call runbook for a Langfuse LLM evaluator firing a zero-score Slack alert.

A Slack alert named “breeze buddy outcome correctness” (or another configured evaluator) has fired. A call just failed an LLM-judged evaluation — the bot said or did something wrong. This runbook gets you from “I got paged” to either “fixed” or “escalated” in under 15 minutes.

Prerequisites

  • You are on-call and have Slack access.
  • You have Langfuse viewer access on the configured base URL.
  • You can reach the recording URL (GCS / S3 depending on deployment).

Symptom

The Slack message contains:

  • Evaluator name (e.g., breeze buddy outcome correctness).
  • Trace ID (links to Langfuse).
  • Call SID.
  • Merchant ID.
  • Reported outcome.
  • Failure reason from the evaluator.
  • Recording URL.

Diagnosis

Work through these in order. Most alerts resolve at step 2 or 3.

Step 1 — Confirm the alert is real

Open the Langfuse trace. Verify the evaluator actually ran, actually scored zero, and the reason is coherent. False positives from evaluator bugs happen; don’t chase them.

If the evaluator is clearly broken, mark the alert “evaluator bug” in thread and ping the evaluation team. Stop here.

Step 2 — Listen to the recording

Open the recording URL. Listen to the last 60–90 seconds. You are trying to confirm the evaluator’s failure reason matches what the bot actually did.

  • If the bot did the wrong thing: go to Step 3.
  • If the bot did the right thing but the evaluator scored it wrong: evaluator bug — see Step 1 disposition.
  • If the call ended abnormally (abrupt disconnect, telephony error): see Call troubleshooting instead.

Step 3 — Classify the failure

PatternWhat to do
Hallucinated data (invented an order number, price, policy)Ping the prompt / template owner. Usually fixed by tightening node task_messages or adding an explicit task_instructions.
Wrong branch takenInspect function_calls in the trace. The LLM called the wrong function — tighten function descriptions or node transitions.
Incorrect language / code-switchCheck payload_based_language_selection and the language in the payload. Often a misconfigured template.
Stuck in a loopLook at node transitions in the trace. Usually a missing transition_to or a function returning the wrong outcome.
Silent / no audioCheck STT transcription events; may be a VAD tuning issue, not an LLM issue. Redirect to voice-pipeline on-call.

Step 4 — Mitigate or escalate

  • If the root cause is a template or prompt issue, open a fix PR against the template. Include the trace ID in the PR.
  • If the root cause is model behaviour (e.g., the LLM ignored the system prompt), escalate to the LLM team with the trace and recording.
  • If the root cause is infra (STT, telephony), open an incident in your usual channel and redirect.

Close out

Post a short summary in the alert thread: cause, action taken, whether a follow-up is needed. This is how the team learns what patterns recur.

Deduplication

You won’t be paged again for the same call_sid within the next hour. If the same type of failure keeps firing across different calls, that’s a real pattern — escalate.

Next steps

Was this helpful?