-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions here:
- What shall we expect the row count here? The right count same to the original row count?
- Previously all node/edge types shared the same row count dict including all feature's information, but now it only stored the ones relative to itself. Will it bring any changes?
graphstorm-processing/graphstorm_processing/graph_loaders/row_count_utils.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/row_count_utils.py
Show resolved
Hide resolved
graphstorm-processing/graphstorm_processing/graph_loaders/row_count_utils.py
Show resolved
Hide resolved
For the high level questions @jalencato
The row counter is used right after gs-processing has finished, and it counts the rows of each generated file for every node feature/mask/label, for each edge structure (src,dst) file, and for each edge feature/mask label. gs-repartition uses these values as input to determine if re-partitioning is needed for any of the sets of files, as DGL assumes that all row counts for every node/edge type are the same. In this case, what was happening was that some features that shared a name ended up without any row counts, because their key kept getting overwritten. Only the first-encountered key got to have row_counts, the rest had nothing. The "right" count is whatever that node's feature file counts actually were, before it was missing.
In terms of expected input/output there won't be any changes, this just fixes the case where same-name features were not getting their row counts populated at all. |
cfbc313
to
fc53df3
Compare
fc53df3
to
82d5389
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.