Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion Between fill_na and na_value in Dataset Configuration #144

Closed
rsliu94 opened this issue Feb 24, 2025 · 4 comments
Closed

Confusion Between fill_na and na_value in Dataset Configuration #144

rsliu94 opened this issue Feb 24, 2025 · 4 comments

Comments

@rsliu94
Copy link

rsliu94 commented Feb 24, 2025

I noticed that the na_value parameter, which appears in the dataset_config documentation (link), is not actually utilized in the code. Instead, both the feature preprocessing (link) and tokenization process (link) still use fill_na to handle missing values.

This inconsistency can cause unexpected behavior. If I create a dataset_config file following the documentation and use na_value to pass missing values, the vocabulary size (for bucketized integer features) will differ—resulting in an extra token (null string) compared to using fill_na as seen in the BARS repo config files. This discrepancy affects the final model performance.

Image
Image

I see two possible solutions:

  • Update the documentation and clarify that fill_na should be used instead of na_value.
  • Modify the code to support na_value.

Would love to hear the maintainers' thoughts on this! Was na_value intended to replace fill_na at some point? Should we align the documentation or the code itself? Happy to contribute a PR if needed.

@zhujiem
Copy link
Contributor

zhujiem commented Feb 26, 2025

This is caused since the new version has been updated to use ``fill_na". However, we cannot update all BARS benchmarks to new versions since the overhead is too large. So we keep the version fixed at each benchmark so that the results can be reproduced. If you use new FuxiCTR version to run old configurations, issues may happen like yours. The suggestion is to update configurations accordingly.

@zhujiem zhujiem closed this as completed Feb 26, 2025
@zhujiem
Copy link
Contributor

zhujiem commented Feb 26, 2025

It is very welcome to help update the docs to new versions.

@rsliu94
Copy link
Author

rsliu94 commented Feb 26, 2025

If fill_na is the correct parameter in the new version, updating na_value in the documentation to fill_na should resolve the misalignment. I'm happy to contribute a PR for this.

@rsliu94
Copy link
Author

rsliu94 commented Feb 26, 2025

tiny PR: #145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants