Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I digged into the code to see why I cannot reproduce stats::wilcox.test() result #29

Open
mvfki opened this issue Feb 21, 2024 · 0 comments

Comments

@mvfki
Copy link

mvfki commented Feb 21, 2024

Hi team,

Thanks for the incredibly fast implementation. I have been trying to clone the repo and learn from the code. When I was trying to extend it from a default two-sided hypothesis to a one-sided hypothesis, I found something potentially mistaken.

For the hypothesis extension, it's rather simple by switching a correction value and pnorm() call in compute_pval(), referring to R's implementation. However, when I'm testing it with a sparse matrix where one of the features/genes is all zero, I found the result pretty odd -- It got a <1 p-value! So I digged into the code and suspected the ties returned from Rcpp side lacks the counting on zeros, which I believe is intentional for sparse data. I can see that the ties on zeros are easily addressed when calculating the ustats, but I never found it even mentioned when you calculate p-values. With suspecting that this is the cause, I modified my the Rcpp code in my copy:

src/fast_wilcox.cpp function cpp_rank_matrix_dgc

std::vector<std::list<float> > cpp_rank_matrix_dgc(
        arma::vec& x, const arma::vec& p, int nrow, int ncol) {
    vector<list<float> > ties(ncol);
    int n_zero;
    for (int i = 0; i < ncol; i++) {
        n_zero = nrow - (p[i+1] - p[i]);
        if (p[i+1] == p[i]) {
            ties[i].push_back(n_zero);
            continue;
        }
        ties[i] = cpp_in_place_rank_mean(x, p[i], p[i + 1] - 1);
        ties[i].push_back(n_zero);
        x.rows(p[i], p[i + 1] - 1) += n_zero;
    }
    return ties;
}

and also in function cpp_in_place_rank_mean, it seems the lastly counted ties are also not pushed back as well.

// code originally between line 125-143
    for (i = 1U; i < v_sort.size(); i++) {
        if (v_sort[i].first != v_sort[i - 1].first) {
            // .....
        } else {
            // .....
        }
    }
    // Things look good until now but I guess we need to add the following line 
    // right after the loop ends
    if (n > 1) ties.push_back(n);

It's very easy to test with something like

vec <- rep(0:2, 5)
y <- factor(rep(c(1,2), c(7,8)))
sparse <- as(t(vec), "CsparseMatrix")
presto::wilcoxauc(sparse, y)
   feature group   avgExpr      logFC statistic       auc      pval      padj   pct_in  pct_out
1 Feature1     1 0.8571429 -0.2678571        23 0.4107143 0.5888988 0.5888988 57.14286 75.00000
2 Feature1     2 1.1250000  0.2678571        33 0.5892857 0.5888988 0.5888988 75.00000 57.14286

wilcox.test(vec[1:7], vec[8:15])$p.value
[1] 0.581541

And if I apply the modification I described above, we can then have

wilcoxauc(sparse, y)
   feature group   avgExpr      logFC statistic       auc     pval     padj   pct_in  pct_out
1 Feature1     1 0.8571429 -0.2678571        23 0.4107143 0.581541 0.581541 57.14286 75.00000
2 Feature1     2 1.1250000  0.2678571        33 0.5892857 0.581541 0.581541 75.00000 57.14286

OK, all of these are based on that I want the function to be able to replicate what stats::wilcox.test() can produce. If it was on purpose to ignore the ties on zeros and the last happening ties when calculating p-values, I would really love to learn about the reason. But overall I appreciate it so much that you have made this package!

-Yichen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant