Health and metrics
Endpoints to watch and what normal looks like. Useful for load balancer probes, uptime checks, and pool monitoring.
Breeze Buddy exposes a small set of operational endpoints. Use them for load balancer health checks, uptime monitoring, and pool observability.
GET /health
Liveness probe. Returns 200 OK when the process is alive and ready to serve. Use this for container liveness and load balancer health checks.
{ "status": "ok" } Failure to respond within 1–2 seconds usually means the app is wedged. Restart the pod.
GET /metrics
Prometheus-formatted metrics. Standard FastAPI + Pipecat instrumentation — request counts, latencies, call-state counts, pool sizes.
Scrape interval: 15 s is usually fine. Typical high-signal metrics:
breeze_buddy_active_calls— concurrent voice sessions.breeze_buddy_lead_backlog— leads inBACKLOGwaiting for cron pickup.breeze_buddy_voice_agent_pool_available— ready voice-agent processes.breeze_buddy_daily_room_pool_available— ready Daily rooms.http_request_duration_seconds— p99 per route.
GET /agent/voice/automatic/pool/status
Voice-agent process pool status.
{
"pool_size": 3,
"available": 2,
"in_use": 1,
"warming": 0
} available dropping to zero for more than 30 seconds is a capacity signal — consider raising VOICE_AGENT_POOL_SIZE or scaling pods.
GET /agent/voice/automatic/pool/rooms/status
Daily room pool status. Same shape as the voice-agent pool status.
Normal ranges
| Metric | Healthy |
|---|---|
/health p50 latency | under 50 ms |
/metrics scrape latency | under 200 ms |
Voice-agent pool available | ≥ 1 almost always |
Daily room pool available | ≥ 2 almost always |
| Langfuse auto-eval loop heartbeat | Log line every SCORE_CHECK_INTERVAL_SECONDS |
When to page
/healthreturns non-200 for > 60 s on a single pod → restart the pod./healthfails on > 50% of pods → incident; page on-call.- Voice-agent pool
available= 0 across all pods for > 2 min → capacity incident. - Metric scrape stops → observability incident; traces will go blind.
Instrument your own dashboards
The default metrics are enough for plumbing-level monitoring. Build a Grafana dashboard that tracks lead backlog depth, pool exhaustion rate, and Langfuse zero-score rate — those catch product-level incidents that infra metrics miss.