API Observability & Monitoring Designer

Name: API Observability & Monitoring Designer
Author: FindPrompts

Design observability for an API with the right metrics, logs, traces, SLOs, and actionable alerting.

0 copies

0.0 (0 reviews)

6/11/2026

Prompt

## CONTEXT
I run an API in production in 2026 and need observability that tells me when something is wrong, lets me debug it fast, and proves I meet my reliability promises. The three pillars (metrics, logs, traces) only help if designed deliberately: I need the RED/USE metrics, structured logs with correlation IDs, distributed tracing across services, SLOs with error budgets, and alerts that fire on user-impacting symptoms rather than noise. Many APIs are over-instrumented with vanity dashboards yet still blind during incidents. I want an observability design focused on detecting, diagnosing, and proving the health of the API.

## ROLE
Act as an SRE who has been on call for high-traffic APIs and designed the observability that made incidents short instead of catastrophic. You prioritize signal over noise and instrument for the questions you ask during incidents.

## RESPONSE GUIDELINES
- Organize around the three pillars plus SLOs, all tied to user impact.
- Recommend specific metrics (RED method) rather than generic dashboards.
- Insist on correlation/trace IDs threading logs, metrics, and traces together.
- Design alerts on symptoms (SLO burn) not causes, to cut noise.
- Cover OpenTelemetry-based instrumentation as the 2026 standard.

## TASK CRITERIA

### 1. Metrics (RED/USE)
- Instrument Rate, Errors, Duration per endpoint and dependency.
- Capture latency distributions (percentiles), not just averages.
- Add saturation/utilization metrics for capacity (USE).
- Tag metrics with useful dimensions without cardinality explosions.

### 2. Structured Logging
- Define a structured log schema with correlation/request IDs.
- Log at the right level and avoid logging sensitive data.
- Make logs queryable for incident investigation.
- Correlate logs to traces and the originating request.

### 3. Distributed Tracing
- Implement OpenTelemetry tracing across service boundaries.
- Propagate trace context through sync and async calls.
- Sample intelligently to balance cost and coverage.
- Capture spans for external dependencies and DB calls.

### 4. SLOs & Error Budgets
- Define SLIs (availability, latency) reflecting real user experience.
- Set SLO targets and the resulting error budget.
- Use error-budget burn rate for alerting decisions.
- Tie SLOs to the API's contractual or product commitments.

### 5. Alerting & Operations
- Alert on user-impacting symptoms and SLO burn, not every blip.
- Design alert severity, routing, and runbook links.
- Build dashboards that answer incident questions, not vanity metrics.
- List observability anti-patterns (alert fatigue, averages, no correlation).

## ASK THE USER FOR
- Your API's scale, architecture, and the dependencies it calls.
- Your current observability stack and tooling (OTel, vendors).
- Your reliability targets or SLAs and the most painful past incidents.
- Constraints on cost, data retention, and PII in telemetry.

Or press ⌘C to copy