Concept

Observability

Langfuse tracing, OpenTelemetry propagation, contextual logging, Slack alerts, and auto-evaluation.

Overview

Breeze Buddy provides comprehensive observability across the entire voice pipeline — from LLM call tracing to structured logging with correlation IDs. This enables debugging, performance optimization, cost tracking, and automated quality monitoring.

Developers: observability ships on by default

Every call you push through the Leads API emits a Langfuse trace, contextual logs, and OpenTelemetry spans keyed on call_id — no instrumentation code required on your side. When a call fails, jump straight to Debug a failed call which walks the trace with you. This page is for operators configuring and tuning the stack.

Langfuse integration

Langfuse is integrated for full LLM observability. Every LLM call is traced with inputs, outputs, latency, and token usage.

Capabilities

Feature	Description
LLM Tracing	Full request/response tracing for every LLM call including system prompts, user messages, and assistant responses.
Auto-Evaluation	Background task scheduler runs periodic scoring on completed conversations.
Cost Tracking	Token-level cost attribution per call, template, and merchant.
Latency Metrics	Time-to-first-token and total response time for each LLM invocation.

Langfuse Scores

Auto-evaluation scores are stored on the LeadCallTracker via the langfuse_scores field. These scores are computed by the background evaluation scheduler and include metrics like task completion, sentiment, and custom rubrics.

json

{
  "langfuse_scores": {
    "task_completion": 1.0,
    "sentiment": 0.85,
    "greeting_quality": 0.9,
    "objection_handling": 0.75
  }
}

OpenTelemetry

Breeze Buddy uses OpenTelemetry for distributed trace context propagation across async boundaries. This ensures that traces are connected across the voice pipeline’s concurrent processors.

Features

Feature	Description
Trace Context Propagation	Traces flow across async boundaries (STT → LLM → TTS) with proper parent-child span relationships.
Span Attributes	Custom attributes on spans include lead ID, template ID, node name, and provider details.
Exporter Support	Compatible with any OTLP-compatible backend (Jaeger, Zipkin, Grafana Tempo, etc.).

Contextual logging

All log entries are enriched with contextual information using Python’s contextvars. This enables structured logging with automatic correlation across the entire call lifecycle.

Contextual Fields

Field	Description
`correlation_id`	Unique ID for the entire call session — correlates all logs for one call.
`lead_id`	The lead being processed.
`template_id`	The template driving the conversation.
`node_name`	Current flow node (updated as conversation progresses).
`provider`	Active telephony or STT/TTS provider.

json

{
  "timestamp": "2025-01-15T10:35:02.123Z",
  "level": "INFO",
  "message": "LLM response generated",
  "correlation_id": "corr_abc123",
  "lead_id": "lead_abc123",
  "template_id": "tpl_xyz789",
  "node_name": "greeting",
  "latency_ms": 342
}

Slack alerts

Automated Slack alerting triggers notifications when evaluation scores fall below configured thresholds. This enables proactive monitoring of voice agent quality.

Alert Triggers

Trigger	Description
Evaluation Failure	Langfuse auto-evaluation score drops below the threshold.
LLM Error Rate	Elevated error rate from the LLM provider.
Pipeline Failure	Voice pipeline crashes or fails to initialize.

Configuration

Slack webhook URLs and alert thresholds are configured via environment variables. Each template can have its own alerting rules.

Langfuse auto-Evaluation

A background task scheduler periodically evaluates completed conversations against configurable scoring rubrics. Scores are written back to Langfuse traces and stored on the LeadCallTracker.

Evaluation Flow

Call completes and transcription is finalized.
Background scheduler picks up the conversation for evaluation.
LLM-based scoring evaluates the conversation against rubrics.
Scores are written to Langfuse and stored in langfuse_scores on the lead.
If any score breaches the alert threshold, a Slack notification is dispatched.

Custom Rubrics

Define custom evaluation rubrics per template to measure domain-specific quality metrics like appointment confirmation rate, upsell success, or compliance adherence.

Was this helpful?

Edit on GitHub