Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "drop_similar" argument to the TableVectorizer #1249

Open
GaelVaroquaux opened this issue Feb 26, 2025 · 6 comments
Open

Add a "drop_similar" argument to the TableVectorizer #1249

GaelVaroquaux opened this issue Feb 26, 2025 · 6 comments

Comments

@GaelVaroquaux
Copy link
Member

It would be great to make it really easy drop the redundant columns in the TableVectorizer (Using the DropSimilar transformer inside the TableVectorizer), by a simple additional argument.

This would have improvements both is speed/memory and maybe in statistical performance.

I realize that it makes the TableVectorizer even more a swiss-army knife than it currently is, but honestly, it's sooo useful and we use it everywhere, even as an element in complex pipelines.

@Neilblaze
Copy link

@GaelVaroquaux is this up for grabs? If yes, I'd like to contribute! :)
I have experience with skrub, but this will be my first time contributing to the project itself, so I may ping you in this issue thread itself if I hit any roadblocks. Thanks!

@Neilblaze
Copy link

@GaelVaroquaux does this change look good? Tests are passing although I can see UserWarning warnings.

@GaelVaroquaux
Copy link
Member Author

Hi @Neilblaze ,

This is still up for grabs, and the change that you suggest are totally going in the right direction.

However, unless I am wrong, we don't have the DropSimilar transformer yet. It is issue #1001 , the parent issue. That issue would need to be tackled first, and I suggest that we go through the whole process of tackling it and merging it before moving on with this one. Understanding the process will then make you more productive.

Thanks!!

@Neilblaze
Copy link

@GaelVaroquaux Makes sense and thanks for the update! I'll keep track of #1001 for now, and hopefully when it gets merged into the main, we can continue with this.

@GaelVaroquaux
Copy link
Member Author

There is no PR for #1001 for now. I'm not sure when we'll get it done. We'd love to, but we're shuffling so many things 😁

@Neilblaze
Copy link

Yep, I saw that. So for the meantime, I'll try to grab some other open issues. Will ping here/on discord if I get stuck anywhere, and thanks! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants