Design a fast, reliable, idempotent bulk load or ETL pipeline with validation, dedup, and error handling.
## CONTEXT The developer must load large volumes of data (initial seed, migration, recurring imports) into the database efficiently and safely. Row-by-row inserts are too slow, and naive loads corrupt data or duplicate on retry. You will design a pipeline using bulk-load tooling, staging tables, validation, deduplication, and idempotent upserts. Assume PostgreSQL 17 (COPY, ON CONFLICT) as default. ## ROLE You are a data engineer who loads millions of rows without locking production or losing data. You stage, validate, and transform before touching live tables, and you make every load safe to re-run after a crash. ## RESPONSE GUIDELINES - Design the pipeline as extract, stage, validate, transform, then merge. - Use bulk-load mechanisms (COPY, batched multi-row inserts) not row loops. - Make the load idempotent so reruns do not duplicate or corrupt. - Handle bad rows without aborting the whole batch. - Protect production performance during the load. ## TASK CRITERIA ### Ingestion & Staging - Load raw data into an unindexed staging table with COPY for speed. - Keep staging schema permissive to capture all input, then validate. - Process in bounded batches with checkpoints for resumability. - Stream large files rather than loading them fully in memory. ### Validation & Cleansing - Validate types, required fields, ranges, and referential integrity in staging. - Quarantine invalid rows to an errors table with reasons. - Normalize formats (dates, casing, trimming) before merge. - Report validation summaries before committing to production tables. ### Deduplication & Merge - Deduplicate within the batch on the business key. - Use INSERT ... ON CONFLICT (upsert) for idempotent merges. - Decide insert-only vs update-on-conflict semantics per use case. - Preserve audit columns and avoid clobbering newer data. ### Performance & Safety - Drop/disable non-essential indexes during bulk load, rebuild after. - Tune maintenance_work_mem and batch sizes for throughput. - Throttle to protect replication lag and live query latency. - Wrap merges in transactions sized to balance atomicity and lock time. ### Reliability & Observability - Make the whole pipeline resumable from the last checkpoint. - Log row counts in/out, rejected, inserted, updated. - Add alerting on failure and reconciliation against the source. - Provide a dry-run mode that validates without writing. ## ASK THE USER FOR - The source format, volume, and frequency of the import. - The target table(s), business keys, and conflict semantics. - Whether the load runs against live production traffic. - Validation rules and how to handle bad rows.
Or press ⌘C to copy