-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Baumgartner-Weiss-Schindler test to sc.tl.rank_genes_groups() #3503
Add Baumgartner-Weiss-Schindler test to sc.tl.rank_genes_groups() #3503
Comments
happy to hear it! FYI: You merged JakeLehle#1 into your own fork of scanpy. That way the changes won’t make it into the package!
Do you have a paper to read about that application of it? |
Hi yes sorry for that confusion. I'm making changes to my "1.11.1" locally and running a custom build of scanpy on my server before I open a pull request with you guys. I merged the branch locally just so when I run a I put the #"issue number" on my merges to match the style I saw you guys were using for updates so that would make it easy to make a pull request later. Haha I didn't know that would close this out issue, funny. After I get the code working right I can refork the repo and make a clean swap of my changes before a pull request. ################# Anywho, here is a paper where they use BWS to find DEGs with microarray data that does have a comparison to wilcoxon and t-test https://pubmed.ncbi.nlm.nih.gov/15284098/ That team claims the BWS test does better but I think that should come with the caveat it should only be applied to genes where you know the majority of cells aren't expressing the gene. so the valuable information in the distribution of the read counts will be seen in the tails of the dataset. This is something I've hit in the past with my own analysis. We will use rank_gene_groups to get marker genes which are often super highly expressed with most of the cells in the cluster being pushed by that gene as an eigenvector on the UMAP and then we use that to get a good idea about cell types. But after that, we wanna move on to our team's own favorite gene families for pathway analysis or custom analysis and often these aren't expressed highly so the tails hold all the data where the majority of the cells have 0 reads and the gene has largely dropped out of the dataset. I don't think wilcoxon handles these cases well and thus these genes kinda hide in the data so I wanted a test that really prioritized the little genes and focused on tails of gene distributions. I'll attach the changes that I have made to the _rank_gene_groups.py below. So far the code looks like its working but it's slow and cumbersome. for each gene comparison in the for loop it's chewing up 100GB of RAM and it's gonna have to do that 20K times! I'm trying to build off what you have already set up for the wilcoxon test but let me know and just using bws to come up with new scores and pvals. Let me know if I'm way off the mark of if I need to play around with writing out the math into the method function to increase the processing speed. |
|
I think I got this figured out. I'm gonna run some tests and then submit a pull request probably tomorrow. |
What kind of feature would you like to request?
Additional function parameters / changed functionality / changed defaults?
Please describe your wishes
Hello,
Long-time user first-time complainer. Love the package. It quite literally has changed the way I do science.
So I have a request for an enhancement for the sc.tl.rank_genes_groups() function. I'm curious why there isn't an option to select a Baumgartner-Weiss-Schindler test when research groups are interested in ranking genes that are more highly variable and could be subject to drop-out in a dataset and thus would have heavy tails in their distributions. I recently encountered this issue while working on a family of genes that I'm interested in but which are also expressed at lower values and so many of the cells have a read count of 0. I got some interesting results and was thinking about the data but I was feeling cautious about how to interpret the results from the Wilcoxon comparison when I compare my groups.
I was thinking about using some autoencoder deep learning to impute the dropout values in the genes with scanpy.external.pp.dca() and then seeing how much my sc.get.rank_genes_groups_df() changes but I would also like to compare those results to the outputs from a SciPy bws_test() on just the raw data and nothing super processed.
I'm sure other people have hit this issue and because those statistical functions exist they have made a weird hacky way to compute them on their own sanity check but I figured It might be interesting to put this up here and see if this is something the community would like to see incorporated just to streamline this kind of analysis.
Thanks,
Jake Lehle
The text was updated successfully, but these errors were encountered: