Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

thvasilo
Copy link
Contributor

Issue #, if available:

Description of changes:

  • Previously, if some node/edge types shared the same feature names, we would end up overwriting the original feature name dict because we shared it between types.
  • Now we create copies of the "data" dict for each type, and write the row counts directly to the corresponding type's dictionary, under the "row_counts" key.
  • We also make the same fix for the graph structure entries (edges), although it's unlikely this would happen there.
  • Finally used GenAI to generate some test cases for the file because it was untested before.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@thvasilo thvasilo added bug Something isn't working ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library 0.4.1 labels Jan 18, 2025
@thvasilo thvasilo added this to the 0.4.1 release milestone Jan 18, 2025
@thvasilo thvasilo requested a review from jalencato January 18, 2025 01:52
@thvasilo thvasilo self-assigned this Jan 18, 2025
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions here:

  • What shall we expect the row count here? The right count same to the original row count?
  • Previously all node/edge types shared the same row count dict including all feature's information, but now it only stored the ones relative to itself. Will it bring any changes?

@thvasilo
Copy link
Contributor Author

thvasilo commented Jan 22, 2025

For the high level questions @jalencato

What shall we expect the row count here? The right count same to the original row count?

The row counter is used right after gs-processing has finished, and it counts the rows of each generated file for every node feature/mask/label, for each edge structure (src,dst) file, and for each edge feature/mask label.

gs-repartition uses these values as input to determine if re-partitioning is needed for any of the sets of files, as DGL assumes that all row counts for every node/edge type are the same.

In this case, what was happening was that some features that shared a name ended up without any row counts, because their key kept getting overwritten. Only the first-encountered key got to have row_counts, the rest had nothing. The "right" count is whatever that node's feature file counts actually were, before it was missing.

Previously all node/edge types shared the same row count dict including all feature's information, but now it only stored the ones relative to itself. Will it bring any changes?

In terms of expected input/output there won't be any changes, this just fixes the case where same-name features were not getting their row counts populated at all.

@thvasilo thvasilo force-pushed the row-counts-fix branch 3 times, most recently from cfbc313 to fc53df3 Compare January 22, 2025 23:23
Copy link
Collaborator

@jalencato jalencato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thvasilo thvasilo merged commit 528de62 into awslabs:main Jan 24, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4.1 bug Something isn't working gsprocessing For issues and PRs related the the GSProcessing library ready able to trigger the CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants