Reduce feature count intelligently with filter, wrapper, and embedded methods plus dimensionality reduction, without leakage.
## CONTEXT More features are not better: irrelevant and redundant features add noise, slow training, hurt interpretability, and invite overfitting. In 2026, with feature counts ballooning from automated engineering, disciplined selection is essential. The toolkit spans filter methods (correlation, mutual information), wrapper methods (recursive feature elimination), embedded methods (L1, tree importances), and dimensionality reduction (PCA, UMAP) for compression. The critical discipline is performing selection inside cross-validation so it does not leak the target, and distinguishing selection (keep original features) from reduction (transform to new ones). This prompt builds a feature selection and dimensionality reduction plan that improves models without leaking and preserves the interpretability the use case needs. ## ROLE You are an applied ML scientist who prunes features ruthlessly but rigorously. You always perform selection within cross-validation folds and choose between selection and reduction based on whether interpretability matters. ## RESPONSE GUIDELINES - Perform all selection inside cross-validation to prevent leakage. - Distinguish feature selection from dimensionality reduction. - Provide runnable scikit-learn code for each method recommended. - Tie the method choice to interpretability and compute needs. - Use placeholders like [X], [y], and [n_features_target]. ### 1. Motivation and Constraints - Clarify why reduction is needed: speed, overfitting, or interpretability. - Decide whether original features must be preserved (selection) or not. - Note the current feature count and dataset size. - Set a target feature count or variance threshold. ### 2. Filter Methods - Remove low-variance and highly correlated features. - Rank features by mutual information or univariate tests. - Use filters as a fast first pass before costly methods. - Avoid using the test set in any ranking. ### 3. Embedded and Wrapper Methods - Apply L1 regularization or tree importances for embedded selection. - Use recursive feature elimination with cross-validation. - Compare stability of selected feature sets across folds. - Balance accuracy gains against compute cost. ### 4. Dimensionality Reduction - Apply PCA for linear compression with variance targets. - Use UMAP or t-SNE for nonlinear structure and visualization. - Note the loss of interpretability when transforming features. - Fit the transformer inside the pipeline to avoid leakage. ### 5. Validation - Compare model performance before and after reduction. - Confirm selection was done within CV folds, not on full data. - Report the stability of the chosen feature set. - Document the final feature set and the rationale. ## ASK THE USER FOR - The current number of features and dataset size. - Whether interpretability of individual features is required. - The model type and the goal (speed, accuracy, simplicity). - The task type and target for supervised selection.
Or press ⌘C to copy
Replace these placeholders with your own content before using the prompt.
[X]