Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Fix partition related issue #9491

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jinmfeng001
Copy link
Contributor

@jinmfeng001 jinmfeng001 commented Aug 16, 2023

In jvm package, there are some partition related issues.

  1. For ranking case(group column defined), it assumes that the input data is partitioned by group column and the data for the same group is in the same partition as aggByGroupInfo. But the assumptionis not guaranteed. In our real prodution use case, we have to repartition the input dataset by the group column and set the partition number same as the number of worker to bypass the repartition. Otherwise our ranking model perform extremely bad.
  2. When checkpoint is enabled, needDeterministicRepartitioning is set to true. And the data set will be repartitioned base on the attached key. But for the ranking case the repartition key should be the group column, otherwise the data with the same group will be shuffled to different partitions which broken the ranking model assumption. I don't see any reasons to only make xgboost deterministic when checkpoint is enabled, i think we can make xgboost always being deterministic, and partition the data based on the group column when it's defined and on the row hash when it's not defined.

After this fix, it's gauranteed that data with same group are in the same partition, and xgboost is always deterministic no matter checkpoint is enabled or not.

@trivialfis trivialfis changed the title Fix partition related issue [jvm-packages] Fix partition related issue Aug 16, 2023
@jinmfeng001 jinmfeng001 force-pushed the fix_partition_related_issue branch 2 times, most recently from 2106c43 to 51322bf Compare August 16, 2023 09:11
@jinmfeng001 jinmfeng001 force-pushed the fix_partition_related_issue branch 2 times, most recently from a6cde12 to 780cf7d Compare August 16, 2023 12:48
2. Make XGBoost always repartition input to num of works.
3. Make XGBoost partition key as group column if group column exists and row hash if no group column exists.
@trivialfis
Copy link
Member

Hi @jinmfeng001 , thank you for working on the ranking implementation (and other JVM package issues). I will look into this after 2.0 is released. Please note that we have revised the LTR implementation in 2.0, which doesn't affect the current status of the JVM package, but it brings a couple to-do items for us to explore and might create conflicts with changes made in this PR:

  • XGBoost might group and sort the samples internally based on query IDs.
  • XGBoost might change the strategy for training LTR models on distributed systems.

@jinmfeng001
Copy link
Contributor Author

Hi @jinmfeng001 , thank you for working on the ranking implementation (and other JVM package issues). I will look into this after 2.0 is released. Please note that we have revised the LTR implementation in 2.0, which doesn't affect the current status of the JVM package, but it brings a couple to-do items for us to explore and might create conflicts with changes made in this PR:

  • XGBoost might group and sort the samples internally based on query IDs.
  • XGBoost might change the strategy for training LTR models on distributed systems.

@trivialfis Has this been fixed after version 2.0?

@wbo4958
Copy link
Contributor

wbo4958 commented Jul 16, 2024

Hi @jinmfeng001, Recently, I've been working on re-writing jvm package, and most of work is done. The PR can fix the group issue, please also help to review it. https://github.com/dmlc/xgboost/tree/jvm-rewrite

@jinmfeng
Copy link

Hi @jinmfeng001 , thank you for working on the ranking implementation (and other JVM package issues). I will look into this after 2.0 is released. Please note that we have revised the LTR implementation in 2.0, which doesn't affect the current status of the JVM package, but it brings a couple to-do items for us to explore and might create conflicts with changes made in this PR:

  • XGBoost might group and sort the samples internally based on query IDs.
  • XGBoost might change the strategy for training LTR models on distributed systems.

Hi, may i know this issue is resolved in the latest version or not? thank you.

@wbo4958
Copy link
Contributor

wbo4958 commented Aug 23, 2024

The PR is in review. BTW, would you like to take a look at it? #10639

@jinmfeng
Copy link

jinmfeng commented Sep 4, 2024

The PR is in review. BTW, would you like to take a look at it? #10639

Thank you for your response. The PR #10639 is a redesign work taking care of lot of things, and not fixing this issue. I think it will take a long time to merge it to master. The issue here is a bug blocking users to train an correct model. Hope we can review this PR first and merge it to master, so that we can migrate from our self-maintained version to the open source version.

Thank you.

@trivialfis
Copy link
Member

@jinmfeng Could you please share how many query groups and how many workers are you working with? In addition, could you please share the approximate size of each groups if possible?

Normally, when a query group is split between two workers, we simply sample pairs from sub-groups and then aggregate the gradient during histogram construction. The number of samples is reduced for the split group, but since we are using pairwise comparison for relevance degree is still correct. I would love to learn the impact of group spliting in your case.

@trivialfis
Copy link
Member

Related #6713 . @wbo4958 Could you please help take a look at whether the issue persists?

@jinmfeng
Copy link

@jinmfeng Could you please share how many query groups and how many workers are you working with? In addition, could you please share the approximate size of each groups if possible?

Normally, when a query group is split between two workers, we simply sample pairs from sub-groups and then aggregate the gradient during histogram construction. The number of samples is reduced for the split group, but since we are using pairwise comparison for relevance degree is still correct. I would love to learn the impact of group spliting in your case.

Hi @trivialfis,
we have about 5M query groups, and we train our model on 100 workers, the approximate size of each group is about 5.

You can see each group only have 5 items, it's easy to have only 1 row for each group in a worker. I think it's not reasonable to assume the number of workers is much bigger than the group size to train a correct model.

@trivialfis
Copy link
Member

Based on my understanding of #6713 , it's basically every group got split into multiple workers, instead of some unlucky groups sitting at the boundaries of the workers. Am I correct?

If that's the case, then indeed, the performance would be disastrous.

@jinmfeng
Copy link

Based on my understanding of #6713 , it's basically every group got split into multiple workers, instead of some unlucky groups sitting at the boundaries of the workers. Am I correct?

If that's the case, then indeed, the performance would be disastrous.

Yes, you're right. Every group got split into multiple workers. And actually after the shuffle, the records of the same group are not stored together.

@trivialfis
Copy link
Member

Thank you for sharing, then yes, will sync with @wbo4958 and see what we can do with the latest packages. We definitely need to fix this.

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 25, 2024

sure, I will have a PR to fix it.

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 27, 2024

Hi @jinmfeng, Could you help to review #11023 and help to verify it in the real case? Really appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants