[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

thvasilo · 2025-01-18T01:52:34Z

Issue #, if available:

Description of changes:

Previously, if some node/edge types shared the same feature names, we would end up overwriting the original feature name dict because we shared it between types.
Now we create copies of the "data" dict for each type, and write the row counts directly to the corresponding type's dictionary, under the "row_counts" key.
We also make the same fix for the graph structure entries (edges), although it's unlikely this would happen there.
Finally used GenAI to generate some test cases for the file because it was untested before.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jalencato

A few questions here:

What shall we expect the row count here? The right count same to the original row count?
Previously all node/edge types shared the same row count dict including all feature's information, but now it only stored the ones relative to itself. Will it bring any changes?

graphstorm-processing/graphstorm_processing/graph_loaders/row_count_utils.py

graphstorm-processing/tests/test_row_count_utils.py

graphstorm-processing/graphstorm_processing/graph_loaders/row_count_utils.py

thvasilo · 2025-01-22T18:06:03Z

For the high level questions @jalencato

What shall we expect the row count here? The right count same to the original row count?

The row counter is used right after gs-processing has finished, and it counts the rows of each generated file for every node feature/mask/label, for each edge structure (src,dst) file, and for each edge feature/mask label.

gs-repartition uses these values as input to determine if re-partitioning is needed for any of the sets of files, as DGL assumes that all row counts for every node/edge type are the same.

In this case, what was happening was that some features that shared a name ended up without any row counts, because their key kept getting overwritten. Only the first-encountered key got to have row_counts, the rest had nothing. The "right" count is whatever that node's feature file counts actually were, before it was missing.

Previously all node/edge types shared the same row count dict including all feature's information, but now it only stored the ones relative to itself. Will it bring any changes?

In terms of expected input/output there won't be any changes, this just fixes the case where same-name features were not getting their row counts populated at all.

…e-name features

jalencato

LGTM

thvasilo added bug Something isn't working ready able to trigger the CI gsprocessing For issues and PRs related the the GSProcessing library 0.4.1 labels Jan 18, 2025

thvasilo added this to the 0.4.1 release milestone Jan 18, 2025

thvasilo requested a review from jalencato January 18, 2025 01:52

thvasilo self-assigned this Jan 18, 2025

jalencato reviewed Jan 21, 2025

View reviewed changes

thvasilo force-pushed the row-counts-fix branch 3 times, most recently from cfbc313 to fc53df3 Compare January 22, 2025 23:23

[GSProcessing] Fix ParquetRowCounter bug when different types had sam…

82d5389

…e-name features

thvasilo force-pushed the row-counts-fix branch from fc53df3 to 82d5389 Compare January 22, 2025 23:25

jalencato approved these changes Jan 24, 2025

View reviewed changes

thvasilo merged commit 528de62 into awslabs:main Jan 24, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

thvasilo commented Jan 18, 2025

jalencato left a comment

thvasilo commented Jan 22, 2025 •

edited

Loading

jalencato left a comment

[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

[GSProcessing] Fix ParquetRowCounter bug when different types had same-name features #1140

Conversation

thvasilo commented Jan 18, 2025

jalencato left a comment

Choose a reason for hiding this comment

thvasilo commented Jan 22, 2025 • edited Loading

jalencato left a comment

Choose a reason for hiding this comment

thvasilo commented Jan 22, 2025 •

edited

Loading