Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shuffle to dataset split in _set_and_validate_datasetsin MIPROv2 #2009

Closed
wants to merge 2 commits into from

Conversation

Haknt
Copy link
Contributor

@Haknt Haknt commented Jan 3, 2025

Summary

Added random.shuffle to _set_and_validate_datasets for better randomness in train/validation split when no validation set is provided.

Notes

  • Improves representativeness of validation set.
  • Open to feedback on whether this responsibility fits within the method's scope.

@okhat
Copy link
Collaborator

okhat commented Jan 3, 2025

Thanks @Haknt ! I have a few different thoughts. One is: shuffling should ideally use an RNG object to avoid messing with the global seed, but maybe this is not already respected in the current MIPRO implementation? (can't recall how it was done)

Another one is that ideally the user pre-shuffles their data. It's a little easier to reason about the current behavior as the first 20% vs last 80% but I do see the value of shuffling "just in case" too, so a bit conflicted.

@Haknt
Copy link
Contributor Author

Haknt commented Jan 3, 2025

Thanks for the feedback! (1) You’re absolutely right, using random.shuffle directly does affect the global RNG state, and I overlooked this.

I’ve updated the code to use the existing self.rng for shuffling, ensuring it doesn’t interfere with the global state and aligns with the seed parameter for reproducibility. Now, self.rng.shuffle() is used instead of random.shuffle().

(2) I'm thinking out loud, In the ideal scenario, the user should provide the validation set explicitly. If they don’t, it’s likely they are either unaware or prefer not to handle it themselves. In both cases, shuffling would be beneficial to ensure a representative split.

Alternatively, we could make the validation set mandatory and throw an error if it’s not provided. This would ensure users are deliberate about their validation strategy.

Let me know your thoughts!

@okhat okhat closed this Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants