stat.tsv counts differ strongly from DIA-NN results #251

tobiasko · 2024-06-19T14:53:39Z

tobiasko
Jun 19, 2024

Describe the bug
number of precursors reported by alphadia in stat.tsv deviates strongly from precursor values reported by DIA-NN (library-free mode) on the same files.

run     channel precursors      proteins        ms1_accuracy    fwhm_rt fwhm_mobility
20240507_022_HeLa_100ng_40SPD   0       75157   7863    0.000000        3.257659        0.000000
20240507_023_HeLa_100ng_40SPD   0       66053   7826    0.000000        3.342673        0.000000
20240507_024_HeLa_100ng_40SPD   0       66883   7829    0.000000        3.309790        0.000000

vs.

File.Name	Precursors.Identified	Proteins.Identified	Total.Quantity	MS1.Signal	MS2.Signal	FWHM.Scans	FWHM.RT	Median.Mass.Acc.MS1	Median.Mass.Acc.MS1.Corrected	Median.Mass.Acc.MS2	Median.Mass.Acc.MS2.Corrected	MS2.Mass.Instability	Normalisation.Instability	Median.RT.Prediction.Acc	Average.Peptide.Length	Average.Peptide.Charge	Average.Missed.Tryptic.Cleavages
/scratch/DIANN_A314/WU304711/20240507_024_HeLa_100ng_40SPD.mzML	139825	8600	3.06597e+10	3603428016128	323304587264	2.277	0.061	0.455805	0.349079	0.98658	0.862623	0.2	0.0544855	0.0192902	12.111	2.26	0.115
/scratch/DIANN_A314/WU304711/20240507_023_HeLa_100ng_40SPD.mzML	140577	8663	3.29458e+10	3894134177792	348319154176	2.257	0.061	0.494769	0.373496	1.10151	0.757309	0.2	0.0636678	0.0177708	12.13	2.26	0.115
/scratch/DIANN_A314/WU304711/20240507_022_HeLa_100ng_40SPD.mzML	145582	8699	3.78455e+10	4481101070336	378195836928	2.277	0.061	0.474137	0.378867	1.04699	0.689239	0.2	0.0834113	0.0207038	12.248	2.265	0.114

According to Fig. 5 of your manuscript one would expect similar values on precursor and protein counts on ASTRAL data (in-house generated using 2 Da fixed window DIA). How do you filter the precursor data for the stat.tsv file?

Logs
attached
log.txt

Version (please complete the following information):

Installation Type Developer

tobiasko · 2024-06-19T15:00:39Z

tobiasko
Jun 19, 2024
Author

We analyzed 100 ng Pearce HeLa digest using the Evospep One (40 SPD method). So very similar to your manuscript.

0 replies

GeorgWa · 2024-06-19T16:29:26Z

GeorgWa
Jun 19, 2024
Maintainer

Hi, the two big caveats are that we don't have automatic parameter optimization and multi step searches.
So both needs to be done manually in the current version (1.7.0)

To get the same results you will need to set the following parameters in the first search:

target_num_candidates = 3
target_ms1_tolerance = 4 ppm,
target_ms2_tolerance = 7 ppm
target_rt_tolerance = 300s

Then perform a second search without library prediction and select the speclib.mbr.hdf from the first pass

target_num_candidates = 5
target_ms1_tolerance = 4 ppm,
target_ms2_tolerance = 7 ppm
target_rt_tolerance = 100s

This should give you very similar performance :)

0 replies

tobiasko · 2024-06-20T08:06:16Z

tobiasko
Jun 20, 2024
Author

😟 Now I am really surprised. Currently, no parameter opt. and no 2nd search with the opt. library? Automating the 2nd search using a script is easy to do. But how should I guess the opt. parameters without having seen the data a priori?

0 replies

GeorgWa · 2024-06-20T08:17:08Z

GeorgWa
Jun 20, 2024
Maintainer

Hi, it's not as bad as it sounds. 😄

Lots of parameters are already bring automatically optimized but there are still three major parameters: MS1 tolerance, MS2 tolerance and RT tolerance. MS1 and MS2 are self explanatory and RT tolerance is the largest expected error after calibration. We usually set this to 200-300s for 60, 40SPD.

the reason is that we want to offer best performance across methods and gradients. Therefore we are currently compiling a test suite with different methods and setups.

0 replies

tobiasko · 2024-06-20T08:36:03Z

tobiasko
Jun 20, 2024
Author

Ok. Let's get practical. I am having a set of DIA raw files from the ASTRAL and acquired data incl. lock mass correction. So we basically know it is high mass accuracy data (because you can check lock mass correction easily without doing a DB search first). Then you are suggesting to use a default values for all OT scans (MS1), here you used 4 ppm? Why exactly 4? And a default value for ASTRAL scans (MS2), again why exactly 7? And your iRT matching tolerance is simply based on historical knowledge (because you know your column setup and the Evosep performance, so you go for +- 2 min)? Why 2?

0 replies

GeorgWa · 2024-06-20T08:59:38Z

GeorgWa
Jun 20, 2024
Maintainer

Yes, this is similar to how 5ppm and 10ppm has been the default for 70k, 35k resolution on orbitrap instruments or 15ppm for timsTOF default DIA settings. We have seen that Astral data works well with 6-7ppm.

Regarding the RT tolerance you should aim for 30% of total gradient (21min*0.3 = 6.3min; +- 189sec) for the first search and 15% for the second search. Otherwise I would recommend to look at the methods section of the manuscript for inspiration.

That being said, I absolutely I agree that this is not practical and we are working on a prototype to solve this. The reason is really that we have to set priorities and as we have very well controlled and standardized instrument setup with Evosep, ion opticks and the Astrals. This was therefore fine for getting good performance and we could focus on establishing confident FDR, speed, quantification etc.

I'm curious to hear how alphaDIA performs on your data with the updated parameters! I can update you when we have the first release to test. As you are using the developer version anyways, any feedback on the automated optimization would be appreciated once its part of a release candidate.

0 replies

tobiasko · 2024-06-20T12:35:38Z

tobiasko
Jun 20, 2024
Author

As you predicted, using the 2-pass search with the above parameters gives:

run     channel precursors      proteins        ms1_accuracy    fwhm_rt fwhm_mobility
20240507_022_HeLa_100ng_40SPD   0       120283  8611    0.000000        2.986437        0.000000
20240507_023_HeLa_100ng_40SPD   0       117096  8612    0.000000        2.937871        0.000000
20240507_024_HeLa_100ng_40SPD   0       116863  8612    0.000000        2.934775        0.000000

I implemented this using a very basic slurm batch script

#!/bin/bash

#SBATCH --partition=prx
#SBATCH --nodelist=fgcz-c-073
#SBATCH --export=ALL
#SBATCH -o /scratch/tobiasko/slurmlog/slurm-%j.out
#SBATCH --job-name=alphaDIA
#SBATCH --mail-user="[email protected]"
#SBATCH --workdir=/scratch/tobiasko/slurmjobs
#SBATCH --mem=256G
#SBATCH --time=24:0:0

OUTPUT1="out-first-pass"
OUTPUT2="out-second-pass"
RAWFILES="rawfiles"
FASTA="DB.fasta"
CONFIG1="config_1.yaml"
CONFIG2="config_2.yaml"
REGEX="100ng_40SPD.raw"
SPECLIB="/scratch/tobiasko/slurmjobs/$OUTPUT1/speclib.mbr.hdf"

echo "hostname: "
hostname
echo "working directory: "
pwd
echo "activate alphadia env"
source activate alphadia
echo "using alphadia version: "
alphadia --version
mkdir $OUTPUT1
alphadia --output $OUTPUT1 --directory $RAWFILES --regex $REGEX --config $CONFIG1 --fasta $FASTA
mkdir $OUTPUT2
alphadia --output $OUTPUT2 --directory $RAWFILES --regex $REGEX --config $CONFIG2 --library $SPECLIB
echo "slurm job done!"
exit 0

But I think you should really change the manuscript in this respect: "With these state-of-the art predicted libraries, we devised a two-step search workflow in alphaDIA consisting of library refinement and quantification (Fig. 5 a)." This sounds like the alphadia actively manages both steps as part of an integrated workflow.

Are the precursor now counted at 1% FDR? Why is the MS1 accuracy zero? Is it possible to add the MS2 accuracy?

Best,
Tobi

0 replies

GeorgWa · 2024-06-20T13:19:50Z

GeorgWa
Jun 20, 2024
Maintainer

Ah, that's good to see! I would assume that there is some additional performance with parameter optimization.

I use a very similar SLURM script for my two step searches and hope to inlcude it in the docs soon. We wrote it like this in the manuscript because it was designed for multi step searches and therefore builds the MBR library etc. This should allow testing and benchmarking. From a scientific point of view this was the priority. Therefore, also the FDR is not arbitrary but was made to work in such context. It's actually surprisingly hard to get all of this somewhat right :D.

Of course it's important for adoption to make it easily accessible in the GUI and allow multi step searches. At the same time, we don't just want to have a single check box which gives you some multi step search but we want to have this in a configurable, transparent fashion, designed with good software engineering in mind.

It's really cool to see that you have already explored alphaDIA a bit. If you have time I would happy to schedule a Zoom call and hear about your experience from the perspective of a very technical user. This is to be an open source community project, so all contribution is welcome. You can reach me at (lastname)@biochem.mpg.de 😀.

Regarding the FDR, it's always controlled on a local precursor and global protein level. Therefore it will be controlled for the first search, the MBR library and the second search results. Something I would recommend is to set the inference_strategy to library for the second search and only use the fasta in the first search.

0 replies

tobiasko · 2024-06-20T13:40:27Z

tobiasko
Jun 20, 2024
Author

Ok! Will try. I was already guessing that you use alphadia within slurm. I think this is in general something of interest for the community. Are there things that one should be aware of when running alphadia by sbatch or srun? Are you managing CPUs or memory for alphadia jobs?

0 replies

GeorgWa · 2024-06-20T14:12:56Z

GeorgWa
Jun 20, 2024
Maintainer

Yes, we have some pipelines and do most processing on Slurm. I have some templates for transfer learning and two step searches I will share.

We use sbatch with conda, similar to your script. I would generally recommend to only use a single socket at a time. So only use as many threads as available on a single socket and allow two tasks per node if you have two socket machines. For simple searches 64gb to 128 GB should be sufficient, for more complicated searches 256gb is better. We haven't really optimized for memory yet.

Another trick is to use the --config-dict argument in the CLI. https://alphadia.readthedocs.io/en/latest/methods/command-line.html
This allows you to change parameters from the config file for method optimization in the slurm script. For example, you could call it with 200s, 250s and 300s RT tolerance and observe the effect. Something else to test would be 1, 2 and 3 target candidates.

0 replies

tobiasko · 2024-06-21T12:31:22Z

tobiasko
Jun 21, 2024
Author

I am currently testing on a single cluster node like:

scontrol show node fgcz-c-072
NodeName=fgcz-c-072 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUTot=128 CPULoad=0.52
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=fgcz-c-072 NodeHostName=fgcz-c-072 Version=18.08
   OS=Linux 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31)
   RealMemory=1030000 AllocMem=0 FreeMem=227312 Sockets=128 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=5000000 Weight=10 Owner=N/A MCS_label=N/A
   Partitions=prx
   BootTime=2024-03-04T11:38:41 SlurmdStartTime=2024-03-04T11:39:09
   CfgTRES=cpu=128,mem=1030000M,billing=128
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

If I get the above correctly we have 128 CPUs, each sitting in a separate socket having a single core. Each core is set for a single thread. So your recommendation would mean using a single thread? 😄

So far I was running 32 or 64 thread alphadia commands without setting anything in sbatch or srun. I am not a computer scientist, so these things

https://slurm.schedmd.com/cpu_management.html

are not all perfectly clear to me.

0 replies

tobiasko · 2024-06-21T12:37:30Z

tobiasko
Jun 21, 2024
Author

I am constantly modifying the slurm batch script, but currently the 2-pass search jobs like like:

scontrol show jobid -dd 328740
JobId=328740 JobName=alphaDIA
   UserId=tobiasko(42847) GroupId=SG_Employees(10147) MCS_label=N/A
   Priority=4294623242 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:02:24 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2024-06-21T14:27:49 EligibleTime=2024-06-21T14:27:49
   AccrueTime=Unknown
   StartTime=2024-06-21T14:27:49 EndTime=2024-06-22T14:27:49 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2024-06-21T14:27:49
   Partition=prx AllocNode:Sid=fgcz-c-073:1273317
   ReqNodeList=fgcz-c-073 ExcNodeList=(null)
   NodeList=fgcz-c-073
   BatchHost=fgcz-c-073
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=256G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=fgcz-c-073 CPU_IDs=0 Mem=262144 GRES_IDX=
   MinCPUsNode=1 MinMemoryNode=256G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/tobiasko/slurmjobs/run2PassSlurmJob.sh
   WorkDir=/scratch/tobiasko/slurmjobs
   StdErr=/scratch/tobiasko/slurmlog/slurm-328740.out
   StdIn=/dev/null
   StdOut=/scratch/tobiasko/slurmlog/slurm-328740.out
   Power=

0 replies

tobiasko · 2024-06-21T13:58:34Z

tobiasko
Jun 21, 2024
Author

The slurm batch script for the 2-pass search

#!/bin/bash

#SBATCH --partition=prx
#SBATCH --nodelist=fgcz-c-073
#SBATCH --export=ALL
#SBATCH -o /scratch/tobiasko/slurmlog/slurm-%j.out
#SBATCH --job-name=alphaDIA
#SBATCH --mail-user="[email protected]"
#SBATCH --workdir=/scratch/tobiasko/slurmjobs
#SBATCH --mem=256G
#SBATCH --time=24:0:0

OUTPUT1="out-first-pass"
OUTPUT2="out-second-pass"
RAWFILES="rawfiles"
FASTA="DB.fasta"
CONFIG1="config_1.yaml"
CONFIG2="config_2.yaml"
REGEX="100ng_40SPD.raw"
SPECLIB="speclib.mbr.hdf"
CONFDICT="--config-dict "{'thread_count':{32}}""

echo "hostname: "
hostname
echo "working directory: "
pwd
mkdir slurmjob-$SLURM_JOB_ID

#activate conda env
echo "activate alphadia env"
source activate alphadia
echo "using alphadia version: "
alphadia --version

#first-pass search
mkdir slurmjob-$SLURM_JOB_ID/$OUTPUT1
srun alphadia --output slurmjob-$SLURM_JOB_ID/$OUTPUT1 --directory $RAWFILES --regex $REGEX --config $CONFIG1 --fasta $FASTA
if [ $? -eq 0 ]
  then
    echo "First-pass search done!"
  else
    echo "First-pass search exited with non-zero exit status."
    echo "Stopping slurm batch script"
    exit 1
fi

#second-pass search
mkdir slurmjob-$SLURM_JOB_ID/$OUTPUT2
srun alphadia --output slurmjob-$SLURM_JOB_ID/$OUTPUT2 --directory $RAWFILES --regex $REGEX --config $CONFIG2 --library slurmjob-$SLURM_JOB_ID/$OUTPUT1/$SPECLIB
if [ $? -eq 0 ]
  then
    echo "Second-pass search done!"
    exit 0
  else
    echo "Second-pass search exited with non-zero exit status."
    echo "Stopping slurm batch script"
    exit 1
fi

not quite sure if the srun commands around alphadia are a good idea for such a linear job on a single cluster node. For the config-dict argument it is YAML in-line block notation?

0 replies

GeorgWa · 2024-06-24T09:40:02Z

GeorgWa
Jun 24, 2024
Maintainer

Hi, this looks good, although your script is already much more sophisticated than mine :D
I think the single core per socket is not real and due to the configuration of the cluster.

I your case with ThreadsPerCore=1 I would go for 32 to 64 cpus and the same number of threads in alphaDIA.
Our nodes have ThreadsPerCore=2so I set double the number of threads in alphaDIA compared to the --cpus-per-task

The config dict is a json.

#!/usr/bin/env bash
#SBATCH --job-name=alphadia
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=250G
#SBATCH --time=21-00:00:00
#SBATCH --partition=p.hpcl92

RAW_FOLDER="/raw_data/"
LIBRARY_FOLDER="/fasta/"

OUTPUT_FOLDER_FIRST="/first_pass/"
OUTPUT_FOLDER_SECOND="/second_pass/"

echo "First pass saved to: ${OUTPUT_FOLDER_FIRST}"
mkdir -p ${OUTPUT_FOLDER_FIRST}

alphadia \
    -o ${OUTPUT_FOLDER_FIRST} \
    --directory ${RAW_FOLDER} \
    --fasta "${LIBRARY_FOLDER}2024_01_12_human.fasta" \
    --config "config_astral_first_pass.yaml"

echo "Second pass saved to: ${OUTPUT_FOLDER_SECOND}"
mkdir -p ${OUTPUT_FOLDER_SECOND}

alphadia \
    -o "${OUTPUT_FOLDER_SECOND}" \
    --directory ${RAW_FOLDER} \
    --library "${OUTPUT_FOLDER_FIRST}speclib.mbr.hdf" \
    --config "config_astral_second_pass.yaml" \
    --config-dict '{"fdr": {"inference_strategy": "library"}}'

1 reply

tobiasko Jun 25, 2024
Author

Hi @GeorgWa ! You were right. Our nodes have the wrong config. 🙈

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       1
NUMA node(s):                    4
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7742 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         1309.140
CPU max MHz:                     3414.5500
CPU min MHz:                     1500.0000
BogoMIPS:                        4500.20
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15,64-79
NUMA node1 CPU(s):               16-31,80-95
NUMA node2 CPU(s):               32-47,96-111
NUMA node3 CPU(s):               48-63,112-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt
                                 scp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt
                                 aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext per
                                 fctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx
                                 2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clze
                                 ro irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
                                  avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

it is 64 cores on one socket, two threads per core. But it works when selecting 64 CPUs per task, hence thats basically 64 threads. I also found out that all other cluster jobs don't really use the CPU management at all. So why should I care... 🙄 THX for sharing your script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stat.tsv counts differ strongly from DIA-NN results #251

{{title}}

Replies: 14 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

stat.tsv counts differ strongly from DIA-NN results #251

tobiasko Jun 19, 2024

Replies: 14 comments · 1 reply

tobiasko Jun 19, 2024 Author

GeorgWa Jun 19, 2024 Maintainer

tobiasko Jun 20, 2024 Author

GeorgWa Jun 20, 2024 Maintainer

tobiasko Jun 20, 2024 Author

GeorgWa Jun 20, 2024 Maintainer

tobiasko Jun 20, 2024 Author

GeorgWa Jun 20, 2024 Maintainer

tobiasko Jun 20, 2024 Author

GeorgWa Jun 20, 2024 Maintainer

tobiasko Jun 21, 2024 Author

tobiasko Jun 21, 2024 Author

tobiasko Jun 21, 2024 Author

GeorgWa Jun 24, 2024 Maintainer

tobiasko Jun 25, 2024 Author

tobiasko
Jun 19, 2024

Replies: 14 comments 1 reply

tobiasko
Jun 19, 2024
Author

GeorgWa
Jun 19, 2024
Maintainer

tobiasko
Jun 20, 2024
Author

GeorgWa
Jun 20, 2024
Maintainer

tobiasko
Jun 20, 2024
Author

GeorgWa
Jun 20, 2024
Maintainer

tobiasko
Jun 20, 2024
Author

GeorgWa
Jun 20, 2024
Maintainer

tobiasko
Jun 20, 2024
Author

GeorgWa
Jun 20, 2024
Maintainer

tobiasko
Jun 21, 2024
Author

tobiasko
Jun 21, 2024
Author

tobiasko
Jun 21, 2024
Author

GeorgWa
Jun 24, 2024
Maintainer

tobiasko Jun 25, 2024
Author