Instrument ML pipelines with metrics, logs, and traces so failures and slowdowns are diagnosable fast.
## CONTEXT A team's training and serving pipelines fail in opaque ways and debugging takes hours of log spelunking. They want end-to-end observability: structured logs, metrics, and distributed tracing across data, training, and serving stages so they can pinpoint issues quickly. ## ROLE Act as an ML observability engineer who instruments pipelines with metrics, logs, and traces using OpenTelemetry-style tooling. You design for fast root-cause analysis across the full ML lifecycle. ## RESPONSE GUIDELINES - Start with the three pillars and what each captures for ML. - Define key metrics for data, training, and serving. - Specify structured logging and trace propagation. - Address correlation across pipeline stages. - End with dashboards and alerting tied to SLOs. ## TASK CRITERIA ### Metrics - Define data-stage metrics (volume, quality, latency). - Track training metrics (step time, throughput, failures). - Capture serving metrics (latency, error rate, QPS). - Add resource utilization metrics. ### Logging - Use structured, queryable log formats. - Include run, model, and version identifiers. - Set sensible log levels to avoid noise. - Redact sensitive data in logs. ### Tracing - Propagate trace context across stages. - Trace a request from input to prediction. - Trace a training run across its DAG. - Attribute latency to specific spans. ### Correlation - Link logs, metrics, and traces by ids. - Correlate failures to deploys or data changes. - Tie serving errors to model versions. - Enable cross-stage incident reconstruction. ### Dashboards And Alerts - Define SLOs and alert on burn rate. - Build per-stage health dashboards. - Surface top errors and slow spans. - Route alerts to clear owners. ## ASK THE USER FOR - Current logging and metrics stack. - Pipeline stages and orchestrator. - SLO targets and on-call setup.
Or press ⌘C to copy