-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Fix partition related issue #9491
base: master
Are you sure you want to change the base?
Conversation
2106c43
to
51322bf
Compare
…sonable cache settings during retry
a6cde12
to
780cf7d
Compare
2. Make XGBoost always repartition input to num of works. 3. Make XGBoost partition key as group column if group column exists and row hash if no group column exists.
780cf7d
to
853934a
Compare
Hi @jinmfeng001 , thank you for working on the ranking implementation (and other JVM package issues). I will look into this after 2.0 is released. Please note that we have revised the LTR implementation in 2.0, which doesn't affect the current status of the JVM package, but it brings a couple to-do items for us to explore and might create conflicts with changes made in this PR:
|
@trivialfis Has this been fixed after version 2.0? |
Hi @jinmfeng001, Recently, I've been working on re-writing jvm package, and most of work is done. The PR can fix the group issue, please also help to review it. https://github.com/dmlc/xgboost/tree/jvm-rewrite |
Hi, may i know this issue is resolved in the latest version or not? thank you. |
The PR is in review. BTW, would you like to take a look at it? #10639 |
Thank you for your response. The PR #10639 is a redesign work taking care of lot of things, and not fixing this issue. I think it will take a long time to merge it to master. The issue here is a bug blocking users to train an correct model. Hope we can review this PR first and merge it to master, so that we can migrate from our self-maintained version to the open source version. Thank you. |
@jinmfeng Could you please share how many query groups and how many workers are you working with? In addition, could you please share the approximate size of each groups if possible? Normally, when a query group is split between two workers, we simply sample pairs from sub-groups and then aggregate the gradient during histogram construction. The number of samples is reduced for the split group, but since we are using pairwise comparison for relevance degree is still correct. I would love to learn the impact of group spliting in your case. |
Hi @trivialfis, You can see each group only have 5 items, it's easy to have only 1 row for each group in a worker. I think it's not reasonable to assume the number of workers is much bigger than the group size to train a correct model. |
Based on my understanding of #6713 , it's basically every group got split into multiple workers, instead of some unlucky groups sitting at the boundaries of the workers. Am I correct? If that's the case, then indeed, the performance would be disastrous. |
Yes, you're right. Every group got split into multiple workers. And actually after the shuffle, the records of the same group are not stored together. |
Thank you for sharing, then yes, will sync with @wbo4958 and see what we can do with the latest packages. We definitely need to fix this. |
sure, I will have a PR to fix it. |
In jvm package, there are some partition related issues.
After this fix, it's gauranteed that data with same group are in the same partition, and xgboost is always deterministic no matter checkpoint is enabled or not.