Unjustified NA output? #98

apoliakov · 2019-01-21T02:40:13Z

Hello and thanks for the work!

We've been looking at various association tools. Based on the code, it looks like PLINK2 computes the logistic regression model using a Cholesky decomposition method? If that is the case, we were wondering how potential numerical stability issues are handled.

Here is an example where PLINK2 appears to return NA. However if we load the same data into R and compute a logistic model using glm() we successfully get a result with a significant p-value.

The example data files are attached. The example is run from ~/plink_example

1. This is the PLINK version we used.

$ ~/plinks/plink2 --version
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)

2. Run the example in PLINK

Observe plink returns NA. There doesn't appear to be a warning

$ ~/plinks/plink2 --pfile ~/plink_example/plink_example --pheno ~/plink_example/pheno.txt --pheno-col-nums 3 --1 --covar-col-nums 4,5,6,7,8,9 --maf 0 --logistic --out ~/plink_example/out
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)        www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/scidb/plink_example/out.log.
Options in effect:
  --1
  --covar-col-nums 4,5,6,7,8,9
  --glm
  --maf 0
  --out /home/scidb/plink_example/out
  --pfile /home/scidb/plink_example/plink_example
  --pheno /home/scidb/plink_example/pheno.txt
  --pheno-col-nums 3

Start time: Sun Jan 20 21:11:37 2019
62959 MiB RAM detected; reserving 31479 MiB for main workspace.
Using up to 16 threads (change this with --threads).
20000 samples (10061 females, 9939 males; 20000 founders) loaded from
/home/scidb/plink_example/plink_example.psam.
1 variant loaded from /home/scidb/plink_example/plink_example.pvar.
1 binary phenotype loaded (2257 cases, 17743 controls).
6 covariates loaded from /home/scidb/plink_example/pheno.txt.
Calculating allele frequencies... done.
--glm logistic regression on phenotype 'PHENO1': done.
Results written to /home/scidb/plink_example/out.PHENO1.glm.logistic .
End time: Sun Jan 20 21:11:37 2019

$ cat out.PHENO1.glm.logistic
#CHROM	POS	ID	REF	ALT	A1	TEST	OBS_CT	OR	SE	Z_STAT	P
1	1	rs8675309	A	G	G	ADD	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR1	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR2	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR3	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR4	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR5	20000	NA	NA	NA	NA
1	1	rs8675309	A	G	G	COVAR6	20000	NA	NA	NA	NA

3. Use PLINK export to convert to a text file

$ ~/plinks/plink2 --pfile ~/plink_example/plink_example --export A
PLINK v2.00a2LM AVX2 Intel (2 Jan 2019)        www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --export A
  --pfile /home/scidb/plink_example/plink_example

Start time: Sun Jan 20 21:12:08 2019
62959 MiB RAM detected; reserving 31479 MiB for main workspace.
Using up to 16 threads (change this with --threads).
20000 samples (10061 females, 9939 males; 20000 founders) loaded from
/home/scidb/plink_example/plink_example.psam.
1 variant loaded from /home/scidb/plink_example/plink_example.pvar.
Note: No phenotype data present.
--export A pass 1/1: writing... done.
--export A: plink2.raw written.
End time: Sun Jan 20 21:12:08 2019

4. Run that text file through R using standard glm()

For this case, R computes successfully and returns a significant p-value of 2.6e-57

$ R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> geno = read.table('~/plink_example/plink2.raw', header=T)
> pheno = read.table('~/plink_example/pheno.txt', header=T)
> M = pheno[c('pheno', 'sex', paste0('covar.',c(1:5)))]
> M$dose = geno$rs8675309_A
> coef(summary(glm(pheno ~ ., data=M, family=binomial())))
                 Estimate   Std. Error     z value     Pr(>|z|)
(Intercept)  9.384146e+00 7.215112e-01  13.0062369 1.127595e-38
sex         -7.266692e-02 4.734120e-02  -1.5349614 1.247933e-01
covar.1     -1.710821e+04 2.377739e+04  -0.7195161 4.718230e-01
covar.2      1.270773e-02 2.366693e-02   0.5369406 5.913086e-01
covar.3     -4.139534e-03 2.340506e-02  -0.1768650 8.596145e-01
covar.4      1.418545e-02 2.367348e-02   0.5992127 5.490311e-01
covar.5      4.802263e-02 2.379772e-02   2.0179509 4.359637e-02
dose        -5.735118e+00 3.594524e-01 -15.9551527 2.623314e-57

Thanks for your time! Please let us know if maybe we missed anything.

The data we used is attached:
plink_example.tar.gz

The text was updated successfully, but these errors were encountered:

chrchang · 2019-01-22T18:18:15Z

The main issue here is that covar.1 is on a wildly different scale (never exceeding +/- 0.00001) than your other covariates (which have variance close to 1). plink2's current implementation of logistic regression isn't designed to handle this scale difference. The simplest way to address this is to add the --covar-variance-standardize flag, which linearly rescales all covariates to have mean 0, variance 1. This is sufficient to solve your immediate problem.

(I'm strongly considering replacing the current single-precision implementation with a slower double-precision one; but I haven't done so yet since, when used properly, single-precision should be enough. In fact, one recent trend in machine learning is to use even lower precision to achieve higher throughput.

Maybe the best temporary solution is for plink2 to error out when this large of a scale difference is present, unless a 'yes-really' modifier is present. I'll try to add this tonight.)

Another issue to keep in mind is that this variant has a rather low minor allele frequency (0.73%). Basic logistic regression becomes increasingly unstable as MAF decreases. The Firth logistic regression built into plink2, which you can invoke by replacing --logistic with "--glm firth" or "--glm firth-fallback" (the latter only requests Firth regression when the basic logistic regression yields "NA") is often a good solution here; in addition to almost always reporting an actual result instead of just "NA", it's designed to counter the basic logistic regression's tendency to yield overconfident p-values like 2.6e-57 when there are very few meaningful observations. In this particular case, it doesn't really make a difference when --covar-variance-standardize is already present (in fact, it reports a slightly more extreme p-value of 2.8e-59), since even 0.73% of 40000 allele observations isn't that small. But if you're analyzing more variants like this with plink2, it's practically always a good idea to use at least firth-fallback.

chrchang · 2019-01-26T17:13:07Z

The temporary solution was implemented on 22 Jan 2019.

A better solution is for plink2 to automatically perform variance-standardization, and then convert beta/OR/SE/CI back to the original units in the final report; I'll plan on implementing this within the next two weeks.

chrchang self-assigned this Jan 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unjustified NA output? #98

Unjustified NA output? #98

apoliakov commented Jan 21, 2019

chrchang commented Jan 22, 2019 •

edited

Loading

chrchang commented Jan 26, 2019

Unjustified NA output? #98

Unjustified NA output? #98

Comments

apoliakov commented Jan 21, 2019

1. This is the PLINK version we used.

2. Run the example in PLINK

3. Use PLINK export to convert to a text file

4. Run that text file through R using standard glm()

chrchang commented Jan 22, 2019 • edited Loading

chrchang commented Jan 26, 2019

chrchang commented Jan 22, 2019 •

edited

Loading