Computational Budget Should Be Considered in Data Selection

Published in NeurIPS, 2025

Data selection is crucial for efficient training, but existing methods typically ignore the available computational budget, treating data importance as a static property. This paper argues that the optimal training subset depends heavily on the compute budget: limited budgets favor cleaner, easier-to-learn data (low-frequency features), while larger budgets benefit from more diverse, complex data.

To address this, the authors propose Computational budget-Aware Data Selection (CADS), a bilevel optimization framework that jointly optimizes the data subset and model parameters under explicit budget constraints. By employing a probabilistic reparameterization and a penalty-based single-level relaxation, CADS overcomes the high cost of traditional bilevel optimization. The method is implemented in two variants: CADS-E for fine-grained example-level selection and CADS-S for scalable source-level weighting. Experiments show that CADS achieves up to 14.42% performance gains over baselines on vision and language tasks while offering significant speedups.