Concept

Observability

Langfuse tracing, OpenTelemetry propagation, contextual logging, Slack alerts, and auto-evaluation.

Overview

Breeze Buddy provides comprehensive observability across the entire voice pipeline — from LLM call tracing to structured logging with correlation IDs. This enables debugging, performance optimization, cost tracking, and automated quality monitoring.

Developers: observability ships on by default

Every call you push through the Leads API emits a Langfuse trace, contextual logs, and OpenTelemetry spans keyed on call_id — no instrumentation code required on your side. When a call fails, jump straight to Debug a failed call which walks the trace with you. This page is for operators configuring and tuning the stack.

Langfuse integration

Langfuse is integrated for full LLM observability. Every LLM call is traced with inputs, outputs, latency, and token usage.

Capabilities

FeatureDescription
LLM TracingFull request/response tracing for every LLM call including system prompts, user messages, and assistant responses.
Auto-EvaluationBackground task scheduler runs periodic scoring on completed conversations.
Cost TrackingToken-level cost attribution per call, template, and merchant.
Latency MetricsTime-to-first-token and total response time for each LLM invocation.

Langfuse Scores

Auto-evaluation scores are stored on the LeadCallTracker via the langfuse_scores field. These scores are computed by the background evaluation scheduler and include metrics like task completion, sentiment, and custom rubrics.

json
{
  "langfuse_scores": {
    "task_completion": 1.0,
    "sentiment": 0.85,
    "greeting_quality": 0.9,
    "objection_handling": 0.75
  }
}

OpenTelemetry

Breeze Buddy uses OpenTelemetry for distributed trace context propagation across async boundaries. This ensures that traces are connected across the voice pipeline’s concurrent processors.

Features

FeatureDescription
Trace Context PropagationTraces flow across async boundaries (STT → LLM → TTS) with proper parent-child span relationships.
Span AttributesCustom attributes on spans include lead ID, template ID, node name, and provider details.
Exporter SupportCompatible with any OTLP-compatible backend (Jaeger, Zipkin, Grafana Tempo, etc.).

Contextual logging

All log entries are enriched with contextual information using Python’s contextvars. This enables structured logging with automatic correlation across the entire call lifecycle.

Contextual Fields

FieldDescription
correlation_idUnique ID for the entire call session — correlates all logs for one call.
lead_idThe lead being processed.
template_idThe template driving the conversation.
node_nameCurrent flow node (updated as conversation progresses).
providerActive telephony or STT/TTS provider.
json
{
  "timestamp": "2025-01-15T10:35:02.123Z",
  "level": "INFO",
  "message": "LLM response generated",
  "correlation_id": "corr_abc123",
  "lead_id": "lead_abc123",
  "template_id": "tpl_xyz789",
  "node_name": "greeting",
  "latency_ms": 342
}

Slack alerts

Automated Slack alerting triggers notifications when evaluation scores fall below configured thresholds. This enables proactive monitoring of voice agent quality.

Alert Triggers

TriggerDescription
Evaluation FailureLangfuse auto-evaluation score drops below the threshold.
LLM Error RateElevated error rate from the LLM provider.
Pipeline FailureVoice pipeline crashes or fails to initialize.

Configuration

Slack webhook URLs and alert thresholds are configured via environment variables. Each template can have its own alerting rules.

Langfuse auto-Evaluation

A background task scheduler periodically evaluates completed conversations against configurable scoring rubrics. Scores are written back to Langfuse traces and stored on the LeadCallTracker.

Evaluation Flow

  1. Call completes and transcription is finalized.
  2. Background scheduler picks up the conversation for evaluation.
  3. LLM-based scoring evaluates the conversation against rubrics.
  4. Scores are written to Langfuse and stored in langfuse_scores on the lead.
  5. If any score breaches the alert threshold, a Slack notification is dispatched.

Custom Rubrics

Define custom evaluation rubrics per template to measure domain-specific quality metrics like appointment confirmation rate, upsell success, or compliance adherence.

Was this helpful?