Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble training/sampling on data with high-cardinality categorical features #23

Open
reed-peterson-947 opened this issue Aug 22, 2023 · 2 comments

Comments

@reed-peterson-947
Copy link

I've had success in training/generating data with this package on a variety of different datasets, but I have noticed when there is a very high cardinality feature present in a dataset this package fails with a very uninformative error message: "Killed" and nothing else. As soon as I remove the high-cardinality feature, it runs fine. By high-cardinality I mean on the order of tens of thousands of unique values for a given column. Not sure how to debug or where to start given the uninformative nature of the error message. The last line of code that seems to be executed before it gets killed is line 579 in lib/data,py. Any ideas? Anyone else have this same issue?

@rotot0
Copy link
Collaborator

rotot0 commented Oct 3, 2023

Hello,

I am not sure, but maybe you are out of RAM due to OneHotEncoder and high-cardinality of features

@paulduf
Copy link

paulduf commented Oct 18, 2023

Moreover, even such a sophisticated model won't make magic out of a categorical feature with so many modalities ... unless you have millions of rows, and even in this case I bet you'll have many modalities unrepresented in the synthetic data.
So you could try to pre-process this column with domain-based knowledge ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants