-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion Between fill_na
and na_value
in Dataset Configuration
#144
Comments
This is caused since the new version has been updated to use ``fill_na". However, we cannot update all BARS benchmarks to new versions since the overhead is too large. So we keep the version fixed at each benchmark so that the results can be reproduced. If you use new FuxiCTR version to run old configurations, issues may happen like yours. The suggestion is to update configurations accordingly. |
It is very welcome to help update the docs to new versions. |
If |
tiny PR: #145 |
I noticed that the
na_value
parameter, which appears in thedataset_config
documentation (link), is not actually utilized in the code. Instead, both the feature preprocessing (link) and tokenization process (link) still usefill_na
to handle missing values.This inconsistency can cause unexpected behavior. If I create a
dataset_config
file following the documentation and usena_value
to pass missing values, the vocabulary size (for bucketized integer features) will differ—resulting in an extra token (null string) compared to usingfill_na
as seen in the BARS repo config files. This discrepancy affects the final model performance.I see two possible solutions:
fill_na
should be used instead ofna_value
.na_value
.Would love to hear the maintainers' thoughts on this! Was
na_value
intended to replacefill_na
at some point? Should we align the documentation or the code itself? Happy to contribute a PR if needed.The text was updated successfully, but these errors were encountered: