Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unjustified NA output? #98

Open
apoliakov opened this issue Jan 21, 2019 · 2 comments
Open

Unjustified NA output? #98

apoliakov opened this issue Jan 21, 2019 · 2 comments
Assignees

Comments

@apoliakov
Copy link

Hello and thanks for the work!

We've been looking at various association tools. Based on the code, it looks like PLINK2 computes the logistic regression model using a Cholesky decomposition method? If that is the case, we were wondering how potential numerical stability issues are handled.

Here is an example where PLINK2 appears to return NA. However if we load the same data into R and compute a logistic model using glm() we successfully get a result with a significant p-value.

The example data files are attached. The example is run from ~/plink_example

1. This is the PLINK version we used.

$ ~/plinks/plink2 --version
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)

2. Run the example in PLINK

Observe plink returns NA. There doesn't appear to be a warning

$ ~/plinks/plink2 --pfile ~/plink_example/plink_example --pheno ~/plink_example/pheno.txt --pheno-col-nums 3 --1 --covar-col-nums 4,5,6,7,8,9 --maf 0 --logistic --out ~/plink_example/out
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)        www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/scidb/plink_example/out.log.
Options in effect:
  --1
  --covar-col-nums 4,5,6,7,8,9
  --glm
  --maf 0
  --out /home/scidb/plink_example/out
  --pfile /home/scidb/plink_example/plink_example
  --pheno /home/scidb/plink_example/pheno.txt
  --pheno-col-nums 3

Start time: Sun Jan 20 21:11:37 2019
62959 MiB RAM detected; reserving 31479 MiB for main workspace.
Using up to 16 threads (change this with --threads).
20000 samples (10061 females, 9939 males; 20000 founders) loaded from
/home/scidb/plink_example/plink_example.psam.
1 variant loaded from /home/scidb/plink_example/plink_example.pvar.
1 binary phenotype loaded (2257 cases, 17743 controls).
6 covariates loaded from /home/scidb/plink_example/pheno.txt.
Calculating allele frequencies... done.
--glm logistic regression on phenotype 'PHENO1': done.
Results written to /home/scidb/plink_example/out.PHENO1.glm.logistic .
End time: Sun Jan 20 21:11:37 2019

$ cat out.PHENO1.glm.logistic
#CHROM	POS	ID	REF	ALT	A1	TEST	OBS_CT	OR	SE	Z_STAT	P
1	1	rs8675309	A	G	G	ADD	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR1	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR2	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR3	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR4	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR5	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR6	20000	NA	NA	NA	NA

3. Use PLINK export to convert to a text file

$ ~/plinks/plink2 --pfile ~/plink_example/plink_example --export A
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)        www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --export A
  --pfile /home/scidb/plink_example/plink_example

Start time: Sun Jan 20 21:12:08 2019
62959 MiB RAM detected; reserving 31479 MiB for main workspace.
Using up to 16 threads (change this with --threads).
20000 samples (10061 females, 9939 males; 20000 founders) loaded from
/home/scidb/plink_example/plink_example.psam.
1 variant loaded from /home/scidb/plink_example/plink_example.pvar.
Note: No phenotype data present.
--export A pass 1/1: writing... done.
--export A: plink2.raw written.
End time: Sun Jan 20 21:12:08 2019

4. Run that text file through R using standard glm()

For this case, R computes successfully and returns a significant p-value of 2.6e-57

$ R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> geno = read.table('~/plink_example/plink2.raw', header=T)
> pheno = read.table('~/plink_example/pheno.txt', header=T)
> M = pheno[c('pheno', 'sex', paste0('covar.',c(1:5)))]
> M$dose = geno$rs8675309_A
> coef(summary(glm(pheno ~ ., data=M, family=binomial())))
                 Estimate   Std. Error     z value     Pr(>|z|)
(Intercept)  9.384146e+00 7.215112e-01  13.0062369 1.127595e-38
sex         -7.266692e-02 4.734120e-02  -1.5349614 1.247933e-01
covar.1     -1.710821e+04 2.377739e+04  -0.7195161 4.718230e-01
covar.2      1.270773e-02 2.366693e-02   0.5369406 5.913086e-01
covar.3     -4.139534e-03 2.340506e-02  -0.1768650 8.596145e-01
covar.4      1.418545e-02 2.367348e-02   0.5992127 5.490311e-01
covar.5      4.802263e-02 2.379772e-02   2.0179509 4.359637e-02
dose        -5.735118e+00 3.594524e-01 -15.9551527 2.623314e-57

Thanks for your time! Please let us know if maybe we missed anything.

The data we used is attached:
plink_example.tar.gz

@chrchang
Copy link
Owner

chrchang commented Jan 22, 2019

The main issue here is that covar.1 is on a wildly different scale (never exceeding +/- 0.00001) than your other covariates (which have variance close to 1). plink2's current implementation of logistic regression isn't designed to handle this scale difference. The simplest way to address this is to add the --covar-variance-standardize flag, which linearly rescales all covariates to have mean 0, variance 1. This is sufficient to solve your immediate problem.

(I'm strongly considering replacing the current single-precision implementation with a slower double-precision one; but I haven't done so yet since, when used properly, single-precision should be enough. In fact, one recent trend in machine learning is to use even lower precision to achieve higher throughput.

Maybe the best temporary solution is for plink2 to error out when this large of a scale difference is present, unless a 'yes-really' modifier is present. I'll try to add this tonight.)

Another issue to keep in mind is that this variant has a rather low minor allele frequency (0.73%). Basic logistic regression becomes increasingly unstable as MAF decreases. The Firth logistic regression built into plink2, which you can invoke by replacing --logistic with "--glm firth" or "--glm firth-fallback" (the latter only requests Firth regression when the basic logistic regression yields "NA") is often a good solution here; in addition to almost always reporting an actual result instead of just "NA", it's designed to counter the basic logistic regression's tendency to yield overconfident p-values like 2.6e-57 when there are very few meaningful observations. In this particular case, it doesn't really make a difference when --covar-variance-standardize is already present (in fact, it reports a slightly more extreme p-value of 2.8e-59), since even 0.73% of 40000 allele observations isn't that small. But if you're analyzing more variants like this with plink2, it's practically always a good idea to use at least firth-fallback.

@chrchang
Copy link
Owner

The temporary solution was implemented on 22 Jan 2019.

A better solution is for plink2 to automatically perform variance-standardization, and then convert beta/OR/SE/CI back to the original units in the final report; I'll plan on implementing this within the next two weeks.

@chrchang chrchang self-assigned this Jan 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants