Reference

Health and metrics

Endpoints to watch and what normal looks like. Useful for load balancer probes, uptime checks, and pool monitoring.

Breeze Buddy exposes a small set of operational endpoints. Use them for load balancer health checks, uptime monitoring, and pool observability.

GET /health

Liveness probe. Returns 200 OK when the process is alive and ready to serve. Use this for container liveness and load balancer health checks.

{ "status": "ok" }

Failure to respond within 1–2 seconds usually means the app is wedged. Restart the pod.

GET /metrics

Prometheus-formatted metrics. Standard FastAPI + Pipecat instrumentation — request counts, latencies, call-state counts, pool sizes.

Scrape interval: 15 s is usually fine. Typical high-signal metrics:

  • breeze_buddy_active_calls — concurrent voice sessions.
  • breeze_buddy_lead_backlog — leads in BACKLOG waiting for cron pickup.
  • breeze_buddy_voice_agent_pool_available — ready voice-agent processes.
  • breeze_buddy_daily_room_pool_available — ready Daily rooms.
  • http_request_duration_seconds — p99 per route.

GET /agent/voice/automatic/pool/status

Voice-agent process pool status.

{
  "pool_size": 3,
  "available": 2,
  "in_use": 1,
  "warming": 0
}

available dropping to zero for more than 30 seconds is a capacity signal — consider raising VOICE_AGENT_POOL_SIZE or scaling pods.

GET /agent/voice/automatic/pool/rooms/status

Daily room pool status. Same shape as the voice-agent pool status.

Normal ranges

MetricHealthy
/health p50 latencyunder 50 ms
/metrics scrape latencyunder 200 ms
Voice-agent pool available≥ 1 almost always
Daily room pool available≥ 2 almost always
Langfuse auto-eval loop heartbeatLog line every SCORE_CHECK_INTERVAL_SECONDS

When to page

  • /health returns non-200 for > 60 s on a single pod → restart the pod.
  • /health fails on > 50% of pods → incident; page on-call.
  • Voice-agent pool available = 0 across all pods for > 2 min → capacity incident.
  • Metric scrape stops → observability incident; traces will go blind.

Instrument your own dashboards

The default metrics are enough for plumbing-level monitoring. Build a Grafana dashboard that tracks lead backlog depth, pool exhaustion rate, and Langfuse zero-score rate — those catch product-level incidents that infra metrics miss.

Next steps

Was this helpful?