Observability
Langfuse tracing, OpenTelemetry propagation, contextual logging, Slack alerts, and auto-evaluation.
Overview
Breeze Buddy provides comprehensive observability across the entire voice pipeline — from LLM call tracing to structured logging with correlation IDs. This enables debugging, performance optimization, cost tracking, and automated quality monitoring.
Developers: observability ships on by default
Every call you push through the Leads API emits a Langfuse trace, contextual logs, and OpenTelemetry spans keyed on call_id — no instrumentation code required on your side. When a call fails, jump straight to Debug a failed call which walks the trace with you. This page is for operators configuring and tuning the stack.
Langfuse integration
Langfuse is integrated for full LLM observability. Every LLM call is traced with inputs, outputs, latency, and token usage.
Capabilities
| Feature | Description |
|---|---|
| LLM Tracing | Full request/response tracing for every LLM call including system prompts, user messages, and assistant responses. |
| Auto-Evaluation | Background task scheduler runs periodic scoring on completed conversations. |
| Cost Tracking | Token-level cost attribution per call, template, and merchant. |
| Latency Metrics | Time-to-first-token and total response time for each LLM invocation. |
Langfuse Scores
Auto-evaluation scores are stored on the LeadCallTracker via the langfuse_scores field. These scores are computed by the background evaluation scheduler and include metrics like task completion, sentiment, and custom rubrics.
{
"langfuse_scores": {
"task_completion": 1.0,
"sentiment": 0.85,
"greeting_quality": 0.9,
"objection_handling": 0.75
}
}OpenTelemetry
Breeze Buddy uses OpenTelemetry for distributed trace context propagation across async boundaries. This ensures that traces are connected across the voice pipeline’s concurrent processors.
Features
| Feature | Description |
|---|---|
| Trace Context Propagation | Traces flow across async boundaries (STT → LLM → TTS) with proper parent-child span relationships. |
| Span Attributes | Custom attributes on spans include lead ID, template ID, node name, and provider details. |
| Exporter Support | Compatible with any OTLP-compatible backend (Jaeger, Zipkin, Grafana Tempo, etc.). |
Contextual logging
All log entries are enriched with contextual information using Python’s contextvars. This enables structured logging with automatic correlation across the entire call lifecycle.
Contextual Fields
| Field | Description |
|---|---|
correlation_id | Unique ID for the entire call session — correlates all logs for one call. |
lead_id | The lead being processed. |
template_id | The template driving the conversation. |
node_name | Current flow node (updated as conversation progresses). |
provider | Active telephony or STT/TTS provider. |
{
"timestamp": "2025-01-15T10:35:02.123Z",
"level": "INFO",
"message": "LLM response generated",
"correlation_id": "corr_abc123",
"lead_id": "lead_abc123",
"template_id": "tpl_xyz789",
"node_name": "greeting",
"latency_ms": 342
}Slack alerts
Automated Slack alerting triggers notifications when evaluation scores fall below configured thresholds. This enables proactive monitoring of voice agent quality.
Alert Triggers
| Trigger | Description |
|---|---|
| Evaluation Failure | Langfuse auto-evaluation score drops below the threshold. |
| LLM Error Rate | Elevated error rate from the LLM provider. |
| Pipeline Failure | Voice pipeline crashes or fails to initialize. |
Configuration
Slack webhook URLs and alert thresholds are configured via environment variables. Each template can have its own alerting rules.
Langfuse auto-Evaluation
A background task scheduler periodically evaluates completed conversations against configurable scoring rubrics. Scores are written back to Langfuse traces and stored on the LeadCallTracker.
Evaluation Flow
- Call completes and transcription is finalized.
- Background scheduler picks up the conversation for evaluation.
- LLM-based scoring evaluates the conversation against rubrics.
- Scores are written to Langfuse and stored in
langfuse_scoreson the lead. - If any score breaches the alert threshold, a Slack notification is dispatched.
Custom Rubrics
Define custom evaluation rubrics per template to measure domain-specific quality metrics like appointment confirmation rate, upsell success, or compliance adherence.