Reference

Health and metrics

Endpoints to watch and what normal looks like. Useful for load balancer probes, uptime checks, and pool monitoring.

Breeze Buddy exposes a small set of operational endpoints. Use them for load balancer health checks, uptime monitoring, and pool observability.

GET /health

Liveness probe. Returns 200 OK when the process is alive and ready to serve. Use this for container liveness and load balancer health checks.

{ "status": "ok" }

Failure to respond within 1–2 seconds usually means the app is wedged. Restart the pod.

GET /metrics

Prometheus-formatted metrics. Standard FastAPI + Pipecat instrumentation — request counts, latencies, call-state counts, pool sizes.

Scrape interval: 15 s is usually fine. Typical high-signal metrics:

breeze_buddy_active_calls — concurrent voice sessions.
breeze_buddy_lead_backlog — leads in BACKLOG waiting for cron pickup.
breeze_buddy_voice_agent_pool_available — ready voice-agent processes.
breeze_buddy_daily_room_pool_available — ready Daily rooms.
http_request_duration_seconds — p99 per route.

GET /agent/voice/automatic/pool/status

Voice-agent process pool status.

{
  "pool_size": 3,
  "available": 2,
  "in_use": 1,
  "warming": 0
}

available dropping to zero for more than 30 seconds is a capacity signal — consider raising VOICE_AGENT_POOL_SIZE or scaling pods.

GET /agent/voice/automatic/pool/rooms/status

Daily room pool status. Same shape as the voice-agent pool status.

Normal ranges

Metric	Healthy
`/health` p50 latency	under 50 ms
`/metrics` scrape latency	under 200 ms
Voice-agent pool `available`	≥ 1 almost always
Daily room pool `available`	≥ 2 almost always
Langfuse auto-eval loop heartbeat	Log line every `SCORE_CHECK_INTERVAL_SECONDS`

When to page

/health returns non-200 for > 60 s on a single pod → restart the pod.
/health fails on > 50% of pods → incident; page on-call.
Voice-agent pool available = 0 across all pods for > 2 min → capacity incident.
Metric scrape stops → observability incident; traces will go blind.

Instrument your own dashboards

The default metrics are enough for plumbing-level monitoring. Build a Grafana dashboard that tracks lead backlog depth, pool exhaustion rate, and Langfuse zero-score rate — those catch product-level incidents that infra metrics miss.

Next steps

Observability overview Langfuse, OpenTelemetry, logs.Pools Tune pool sizes.Rate limiting Outbound call pressure valves.

Was this helpful?

Edit on GitHub