Add shuffle to dataset split in `_set_and_validate_datasets`in MIPROv2 #2009

Haknt · 2025-01-03T14:42:13Z

Summary

Added random.shuffle to _set_and_validate_datasets for better randomness in train/validation split when no validation set is provided.

Notes

Improves representativeness of validation set.
Open to feedback on whether this responsibility fits within the method's scope.

…er validation set randomness in MIPROv2

okhat · 2025-01-03T14:46:09Z

Thanks @Haknt ! I have a few different thoughts. One is: shuffling should ideally use an RNG object to avoid messing with the global seed, but maybe this is not already respected in the current MIPRO implementation? (can't recall how it was done)

Another one is that ideally the user pre-shuffles their data. It's a little easier to reason about the current behavior as the first 20% vs last 80% but I do see the value of shuffling "just in case" too, so a bit conflicted.

…lobal random state in MIPROv2

Haknt · 2025-01-03T15:21:52Z

Thanks for the feedback! (1) You’re absolutely right, using random.shuffle directly does affect the global RNG state, and I overlooked this.

I’ve updated the code to use the existing self.rng for shuffling, ensuring it doesn’t interfere with the global state and aligns with the seed parameter for reproducibility. Now, self.rng.shuffle() is used instead of random.shuffle().

(2) I'm thinking out loud, In the ideal scenario, the user should provide the validation set explicitly. If they don’t, it’s likely they are either unaware or prefer not to handle it themselves. In both cases, shuffling would be beneficial to ensure a representative split.

Alternatively, we could make the validation set mandatory and throw an error if it’s not provided. This would ensure users are deliberate about their validation strategy.

Let me know your thoughts!

Add shuffle to dataset split in _set_and_validate_datasets for bett…

436fdb9

…er validation set randomness in MIPROv2

Replace random.shuffle with self.rng.shuffle to avoid affecting g…

401dc5e

…lobal random state in MIPROv2

okhat closed this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shuffle to dataset split in `_set_and_validate_datasets`in MIPROv2 #2009

Add shuffle to dataset split in `_set_and_validate_datasets`in MIPROv2 #2009

Haknt commented Jan 3, 2025

okhat commented Jan 3, 2025

Haknt commented Jan 3, 2025

Add shuffle to dataset split in _set_and_validate_datasetsin MIPROv2 #2009

Add shuffle to dataset split in _set_and_validate_datasetsin MIPROv2 #2009

Conversation

Haknt commented Jan 3, 2025

Summary

Notes

okhat commented Jan 3, 2025

Haknt commented Jan 3, 2025

Add shuffle to dataset split in `_set_and_validate_datasets`in MIPROv2 #2009

Add shuffle to dataset split in `_set_and_validate_datasets`in MIPROv2 #2009