Confusion Between `fill_na` and `na_value` in Dataset Configuration #144

rsliu94 · 2025-02-24T15:30:16Z

I noticed that the na_value parameter, which appears in the dataset_config documentation (link), is not actually utilized in the code. Instead, both the feature preprocessing (link) and tokenization process (link) still use fill_na to handle missing values.

This inconsistency can cause unexpected behavior. If I create a dataset_config file following the documentation and use na_value to pass missing values, the vocabulary size (for bucketized integer features) will differ—resulting in an extra token (null string) compared to using fill_na as seen in the BARS repo config files. This discrepancy affects the final model performance.

I see two possible solutions:

Update the documentation and clarify that fill_na should be used instead of na_value.
Modify the code to support na_value.

Would love to hear the maintainers' thoughts on this! Was na_value intended to replace fill_na at some point? Should we align the documentation or the code itself? Happy to contribute a PR if needed.

The text was updated successfully, but these errors were encountered:

zhujiem · 2025-02-26T06:54:11Z

This is caused since the new version has been updated to use ``fill_na". However, we cannot update all BARS benchmarks to new versions since the overhead is too large. So we keep the version fixed at each benchmark so that the results can be reproduced. If you use new FuxiCTR version to run old configurations, issues may happen like yours. The suggestion is to update configurations accordingly.

zhujiem · 2025-02-26T06:55:09Z

It is very welcome to help update the docs to new versions.

rsliu94 · 2025-02-26T11:11:43Z

If fill_na is the correct parameter in the new version, updating na_value in the documentation to fill_na should resolve the misalignment. I'm happy to contribute a PR for this.

rsliu94 · 2025-02-26T11:28:38Z

tiny PR: #145

zhujiem closed this as completed Feb 26, 2025

rsliu94 mentioned this issue Feb 26, 2025

Update the docs to support new version (na_value -> fill_na) #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion Between `fill_na` and `na_value` in Dataset Configuration #144

Confusion Between `fill_na` and `na_value` in Dataset Configuration #144

rsliu94 commented Feb 24, 2025

zhujiem commented Feb 26, 2025

zhujiem commented Feb 26, 2025

rsliu94 commented Feb 26, 2025

rsliu94 commented Feb 26, 2025

Confusion Between fill_na and na_value in Dataset Configuration #144

Confusion Between fill_na and na_value in Dataset Configuration #144

Comments

rsliu94 commented Feb 24, 2025

zhujiem commented Feb 26, 2025

zhujiem commented Feb 26, 2025

rsliu94 commented Feb 26, 2025

rsliu94 commented Feb 26, 2025

Confusion Between `fill_na` and `na_value` in Dataset Configuration #144

Confusion Between `fill_na` and `na_value` in Dataset Configuration #144