Trouble training/sampling on data with high-cardinality categorical features #23

reed-peterson-947 · 2023-08-22T18:21:55Z

I've had success in training/generating data with this package on a variety of different datasets, but I have noticed when there is a very high cardinality feature present in a dataset this package fails with a very uninformative error message: "Killed" and nothing else. As soon as I remove the high-cardinality feature, it runs fine. By high-cardinality I mean on the order of tens of thousands of unique values for a given column. Not sure how to debug or where to start given the uninformative nature of the error message. The last line of code that seems to be executed before it gets killed is line 579 in lib/data,py. Any ideas? Anyone else have this same issue?

rotot0 · 2023-10-03T15:35:50Z

Hello,

I am not sure, but maybe you are out of RAM due to OneHotEncoder and high-cardinality of features

paulduf · 2023-10-18T15:35:00Z

Moreover, even such a sophisticated model won't make magic out of a categorical feature with so many modalities ... unless you have millions of rows, and even in this case I bet you'll have many modalities unrepresented in the synthetic data.
So you could try to pre-process this column with domain-based knowledge ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble training/sampling on data with high-cardinality categorical features #23

Trouble training/sampling on data with high-cardinality categorical features #23

reed-peterson-947 commented Aug 22, 2023

rotot0 commented Oct 3, 2023

paulduf commented Oct 18, 2023 •

edited

Loading

Trouble training/sampling on data with high-cardinality categorical features #23

Trouble training/sampling on data with high-cardinality categorical features #23

Comments

reed-peterson-947 commented Aug 22, 2023

rotot0 commented Oct 3, 2023

paulduf commented Oct 18, 2023 • edited Loading

paulduf commented Oct 18, 2023 •

edited

Loading