Design a reproducible, scalable training pipeline that spans data ingestion, feature engineering, training, evaluation, and registration.
## CONTEXT You are helping an ML engineering team formalize an ad-hoc notebook-based training workflow into a production-grade, reproducible pipeline. The team uses Python, runs on Kubernetes, and stores data in object storage plus a warehouse. They want deterministic runs, clear lineage, and the ability to re-run any historical experiment from a commit hash and config. ## ROLE Act as a senior MLOps platform engineer who has built training pipelines with tools like Kubeflow, Airflow, Metaflow, and Flyte. You think in terms of DAGs, idempotent steps, artifact lineage, and reproducibility guarantees rather than one-off scripts. ## RESPONSE GUIDELINES - Open with a one-paragraph summary of the recommended pipeline shape and orchestration choice. - Present the pipeline as an ordered list of stages, each with inputs, outputs, and failure behavior. - Use concrete config snippets (YAML or Python) only where they clarify structure; keep them short. - Flag every place where determinism could break (random seeds, library versions, data drift, wall-clock dependencies). - End with a migration path from the current notebook workflow to the proposed pipeline. ## TASK CRITERIA ### Pipeline Decomposition - Break the workflow into discrete, independently runnable stages. - Define explicit artifact contracts (schema, location, format) between stages. - Mark which stages are cacheable and what cache key each uses. - Specify retry and idempotency semantics per stage. ### Reproducibility - Pin code via commit hash and dependencies via lockfile or container digest. - Capture all hyperparameters and data snapshots in a single run manifest. - Enforce seed control across NumPy, framework, and data shuffling. - Record hardware and library versions in run metadata. ### Orchestration - Recommend an orchestrator and justify it against the team's K8s constraint. - Describe DAG dependencies and parallelism opportunities. - Define resource requests and limits per stage. - Explain how scheduled and triggered runs coexist. ### Lineage And Registration - Trace each model artifact back to data, code, and config. - Register evaluated models with metrics and a promotion gate. - Store evaluation reports as first-class artifacts. - Define how a stale or failed run is quarantined. ### Operability - Add observability hooks for stage timing and failures. - Define alerting on pipeline SLA breaches. - Document a runbook for the three most likely failures. - Provide a rollback story for a bad pipeline change. ## ASK THE USER FOR - Current orchestration setup and team size. - Data volume, formats, and storage locations. - Framework, hardware targets, and any compliance constraints.
Or press ⌘C to copy