Langfuse alert response
On-call runbook for a Langfuse LLM evaluator firing a zero-score Slack alert.
A Slack alert named “breeze buddy outcome correctness” (or another configured evaluator) has fired. A call just failed an LLM-judged evaluation — the bot said or did something wrong. This runbook gets you from “I got paged” to either “fixed” or “escalated” in under 15 minutes.
Prerequisites
- You are on-call and have Slack access.
- You have Langfuse viewer access on the configured base URL.
- You can reach the recording URL (GCS / S3 depending on deployment).
Symptom
The Slack message contains:
- Evaluator name (e.g.,
breeze buddy outcome correctness). - Trace ID (links to Langfuse).
- Call SID.
- Merchant ID.
- Reported outcome.
- Failure reason from the evaluator.
- Recording URL.
Diagnosis
Work through these in order. Most alerts resolve at step 2 or 3.
Step 1 — Confirm the alert is real
Open the Langfuse trace. Verify the evaluator actually ran, actually scored zero, and the reason is coherent. False positives from evaluator bugs happen; don’t chase them.
If the evaluator is clearly broken, mark the alert “evaluator bug” in thread and ping the evaluation team. Stop here.
Step 2 — Listen to the recording
Open the recording URL. Listen to the last 60–90 seconds. You are trying to confirm the evaluator’s failure reason matches what the bot actually did.
- If the bot did the wrong thing: go to Step 3.
- If the bot did the right thing but the evaluator scored it wrong: evaluator bug — see Step 1 disposition.
- If the call ended abnormally (abrupt disconnect, telephony error): see Call troubleshooting instead.
Step 3 — Classify the failure
| Pattern | What to do |
|---|---|
| Hallucinated data (invented an order number, price, policy) | Ping the prompt / template owner. Usually fixed by tightening node task_messages or adding an explicit task_instructions. |
| Wrong branch taken | Inspect function_calls in the trace. The LLM called the wrong function — tighten function descriptions or node transitions. |
| Incorrect language / code-switch | Check payload_based_language_selection and the language in the payload. Often a misconfigured template. |
| Stuck in a loop | Look at node transitions in the trace. Usually a missing transition_to or a function returning the wrong outcome. |
| Silent / no audio | Check STT transcription events; may be a VAD tuning issue, not an LLM issue. Redirect to voice-pipeline on-call. |
Step 4 — Mitigate or escalate
- If the root cause is a template or prompt issue, open a fix PR against the template. Include the trace ID in the PR.
- If the root cause is model behaviour (e.g., the LLM ignored the system prompt), escalate to the LLM team with the trace and recording.
- If the root cause is infra (STT, telephony), open an incident in your usual channel and redirect.
Close out
Post a short summary in the alert thread: cause, action taken, whether a follow-up is needed. This is how the team learns what patterns recur.
Deduplication
You won’t be paged again for the same call_sid within the next hour. If the same type of failure keeps firing across different calls, that’s a real pattern — escalate.