Design observability for an API with the right metrics, logs, traces, SLOs, and actionable alerting.
## CONTEXT I run an API in production in 2026 and need observability that tells me when something is wrong, lets me debug it fast, and proves I meet my reliability promises. The three pillars (metrics, logs, traces) only help if designed deliberately: I need the RED/USE metrics, structured logs with correlation IDs, distributed tracing across services, SLOs with error budgets, and alerts that fire on user-impacting symptoms rather than noise. Many APIs are over-instrumented with vanity dashboards yet still blind during incidents. I want an observability design focused on detecting, diagnosing, and proving the health of the API. ## ROLE Act as an SRE who has been on call for high-traffic APIs and designed the observability that made incidents short instead of catastrophic. You prioritize signal over noise and instrument for the questions you ask during incidents. ## RESPONSE GUIDELINES - Organize around the three pillars plus SLOs, all tied to user impact. - Recommend specific metrics (RED method) rather than generic dashboards. - Insist on correlation/trace IDs threading logs, metrics, and traces together. - Design alerts on symptoms (SLO burn) not causes, to cut noise. - Cover OpenTelemetry-based instrumentation as the 2026 standard. ## TASK CRITERIA ### 1. Metrics (RED/USE) - Instrument Rate, Errors, Duration per endpoint and dependency. - Capture latency distributions (percentiles), not just averages. - Add saturation/utilization metrics for capacity (USE). - Tag metrics with useful dimensions without cardinality explosions. ### 2. Structured Logging - Define a structured log schema with correlation/request IDs. - Log at the right level and avoid logging sensitive data. - Make logs queryable for incident investigation. - Correlate logs to traces and the originating request. ### 3. Distributed Tracing - Implement OpenTelemetry tracing across service boundaries. - Propagate trace context through sync and async calls. - Sample intelligently to balance cost and coverage. - Capture spans for external dependencies and DB calls. ### 4. SLOs & Error Budgets - Define SLIs (availability, latency) reflecting real user experience. - Set SLO targets and the resulting error budget. - Use error-budget burn rate for alerting decisions. - Tie SLOs to the API's contractual or product commitments. ### 5. Alerting & Operations - Alert on user-impacting symptoms and SLO burn, not every blip. - Design alert severity, routing, and runbook links. - Build dashboards that answer incident questions, not vanity metrics. - List observability anti-patterns (alert fatigue, averages, no correlation). ## ASK THE USER FOR - Your API's scale, architecture, and the dependencies it calls. - Your current observability stack and tooling (OTel, vendors). - Your reliability targets or SLAs and the most painful past incidents. - Constraints on cost, data retention, and PII in telemetry.
Or press ⌘C to copy