Runbook

Langfuse alert response

On-call runbook for a Langfuse LLM evaluator firing a zero-score Slack alert.

A Slack alert named “breeze buddy outcome correctness” (or another configured evaluator) has fired. A call just failed an LLM-judged evaluation — the bot said or did something wrong. This runbook gets you from “I got paged” to either “fixed” or “escalated” in under 15 minutes.

Prerequisites

You are on-call and have Slack access.
You have Langfuse viewer access on the configured base URL.
You can reach the recording URL (GCS / S3 depending on deployment).

Symptom

The Slack message contains:

Evaluator name (e.g., breeze buddy outcome correctness).
Trace ID (links to Langfuse).
Call SID.
Merchant ID.
Reported outcome.
Failure reason from the evaluator.
Recording URL.

Diagnosis

Work through these in order. Most alerts resolve at step 2 or 3.

Step 1 — Confirm the alert is real

Open the Langfuse trace. Verify the evaluator actually ran, actually scored zero, and the reason is coherent. False positives from evaluator bugs happen; don’t chase them.

If the evaluator is clearly broken, mark the alert “evaluator bug” in thread and ping the evaluation team. Stop here.

Step 2 — Listen to the recording

Open the recording URL. Listen to the last 60–90 seconds. You are trying to confirm the evaluator’s failure reason matches what the bot actually did.

If the bot did the wrong thing: go to Step 3.
If the bot did the right thing but the evaluator scored it wrong: evaluator bug — see Step 1 disposition.
If the call ended abnormally (abrupt disconnect, telephony error): see Call troubleshooting instead.

Step 3 — Classify the failure

Pattern	What to do
Hallucinated data (invented an order number, price, policy)	Ping the prompt / template owner. Usually fixed by tightening node `task_messages` or adding an explicit `task_instructions`.
Wrong branch taken	Inspect `function_calls` in the trace. The LLM called the wrong function — tighten function descriptions or node transitions.
Incorrect language / code-switch	Check `payload_based_language_selection` and the language in the payload. Often a misconfigured template.
Stuck in a loop	Look at node transitions in the trace. Usually a missing `transition_to` or a function returning the wrong outcome.
Silent / no audio	Check STT transcription events; may be a VAD tuning issue, not an LLM issue. Redirect to voice-pipeline on-call.

Step 4 — Mitigate or escalate

If the root cause is a template or prompt issue, open a fix PR against the template. Include the trace ID in the PR.
If the root cause is model behaviour (e.g., the LLM ignored the system prompt), escalate to the LLM team with the trace and recording.
If the root cause is infra (STT, telephony), open an incident in your usual channel and redirect.

Close out

Post a short summary in the alert thread: cause, action taken, whether a follow-up is needed. This is how the team learns what patterns recur.

Deduplication

You won’t be paged again for the same call_sid within the next hour. If the same type of failure keeps firing across different calls, that’s a real pattern — escalate.

Next steps

Langfuse auto-evaluation How the alert loop works under the hood.Debug a failed call Related guide for developer-side debugging.

Was this helpful?

Edit on GitHub