-
Known issues
-
This workflow is still getting ready to be fully deployed, therefore we have some ongoing issue that we are actively resolving.
-
-- The chunks between the Line 414 to 453 are not being rendered and should not be rendered as we are having some issue with the backend.
-
+
This workflow is still getting ready to be fully deployed, therefore we might have some ongoing issue that we are actively resolving. If we know about them we will list them below.
+
For now, we are not aware of any problem in the code. If you have any issue be sure to check that you have the latest devel version of all the packages. If the issue is not resolved by the updating of packages then please report it with a reproducible example on github here
If you have any other issue, do not hesitate to report them to us.
diff --git a/pkgdown.yml b/pkgdown.yml
index 4b2709f..1d1dcb4 100644
--- a/pkgdown.yml
+++ b/pkgdown.yml
@@ -1,9 +1,9 @@
-pandoc: '3.3'
+pandoc: '3.4'
pkgdown: 2.1.1
pkgdown_sha: ~
articles:
end-to-end-untargeted-metabolomics: end-to-end-untargeted-metabolomics.html
-last_built: 2024-09-25T16:39Z
+last_built: 2024-09-30T13:01Z
urls:
reference: https://rformassspectrometry.github.io/metabonaut/reference
article: https://rformassspectrometry.github.io/metabonaut/articles
diff --git a/search.json b/search.json
index b085b3e..7f2c0ca 100644
--- a/search.json
+++ b/search.json
@@ -1 +1 @@
-[{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"abstract","dir":"Articles","previous_headings":"","what":"Abstract","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Metabolomics provides real-time view metabolic state examined samples, mass spectrometry serving key tool deciphering intricate differences metabolomes due specific factors. context metabolomic investigations, untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) emerges powerful approach thanks versatility resolution. paper focuses dataset aimed identifying differences plasma metabolite levels individuals suffering cardiovascular disease healthy controls. Despite potential insights offered untargeted LC-MS/MS data, significant challenge field lies generation reproducible scalable analysis workflows. struggle due aforementioned high versatility technique, results difficulty one-size-fits-workflow software adapt experimental setups. power R-based analysis workflows lies high customizability adaptability specific instrumental experimental setups; however, various specialized packages exist individual analysis steps, seamless integration application large cohort datasets remain elusive. Addressing gap, present innovative R workflow leverages xcms, packages RforMassSpectrometry environment encompass aspects pre-processing downstream analyses LC-MS/MS datasets reproducible manner allow easy customization generate data-set specific workflows. workflow seamlessly integrates Bioconductor packages, offering adaptability diverse study designs analysis requirements.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"keyword","dir":"Articles","previous_headings":"","what":"Keyword","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"LC-MS/MS, reproducibility, workflow, xcms, R, normalization, feature identification, Bioconductor,…","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) powerful tool metabolomics investigations, providing comprehensive view metabolome. enables identification large number metabolites relative abundance biological samples. Liquid Chromatography (LC) separation technique relies different interactions analytes towards chromatographic column - stationary phase - eluent analysis - mobile phase. stronger affinity analyte stationary phase - dictated polarity, size, charges parameters - longer take compound leave column detected coupled technique - Mass Spectrometer. Mass Spectrometry allows identify quantify ions based mass--charge (m/z) ratio. high selectivity relies capability separate compounds small variations mass, also capacity promote fragmentation. ion initial m/z (parent ion) can broken characteristic fragments (daughter ions), help structure elucidation identification specific compound (Theodoridis et al. 2012). Therefore, LC-MS/MS data usually tridimensional datasets containing retention time compounds separation LC, detected m/z compounds given time, intensity signals. Furthermore, MS signal can two different levels, corresponding signal parent ion (called MS1) signals corresponding fragments (denominanted MS2). high sensitivity specificity LC-MS/MS make indispensable tool biomarker discovery elucidating metabolic pathways. untargeted approach particularly useful hypothesis-free investigations, allowing detection unexpected metabolites pathways. However, analysis LC-MS/MS data complex requires series preprocessing steps extract meaningful information raw data. main challenges include dealing lack ground truth data, high dimensionality data, presence noise artifacts (Gika, Wilson, Theodoridis 2014). Moreover, due different instrumental setups protocols definition single one-fits-workflow impossible. Finally, specialized software packages exist individual step analysis, seamless integration remains elusive. present complete analysis workflow untargeted LC-MS/MS data using R Bioconductor packages, particular RforMassSpectrometry package ecosystem. later initiative initiative aims implement expandable, flexible infrastructure analysis MS data, providing also comprehensive toolbox functions build customized analysis workflows. demonstrate various algorithms can adapted particular data set various R packages can seamlessly integrated ensure efficient reproducible processing. present workflow covers steps LC-MS/MS data analysis, preprocessing, data normalization, differential abundance analysis annotation significant features .e., collections signals retention time mass--charge ratios pertaining ions. Various options visualizations well quality assessment presented analysis steps.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-description","dir":"Articles","previous_headings":"","what":"Data description","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"workflow two datasets utilized, LC-MS-based (MS1 level ) untargeted metabolomics data set quantify small polar metabolites human plasma samples additional LC-MS/MS data set selected samples former study identification/annotation significant features. samples used randomly selected larger study identification metabolites differences abundances individuals suffering cardiovascular disease (CVD) healthy controls (CTR).subset analyzed comprises data three CVD three CTR well four quality control (QC) samples. QC samples represent pool serum samples large cohort repeatedly measured throughout experiment monitor stability signal. data metadata workflow accessible MetaboLight database ID: MTBLS8735. detailed materials method used analysis samples can also found metabolight database. especially pertinent analysis chosen parameters, want highlight samples analyzed using ultra-high-performance liquid chromatography (UHPLC) coupled Q-TOF mass spectrometer (TripleTOF 5600+). chromatographic separation based hydrophilic interaction liquid chromatography (HILIC).","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"workflow-description","dir":"Articles","previous_headings":"","what":"Workflow description","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"present workflow describes steps analysis LC-MS/MS experiment, includes preprocessing raw data generate abundance matrix features various samples, followed data normalization, differential abundance analysis finally annotation features metabolites. Note also alternative analysis options R packages used different steps examples mentioned throughout workflow. [jo: ’ll include maybe later. key justify workflow comprehensive] workflow therefore based following dependencies:","code":"## General bioconductor package library(Biobase) ## Data Import and handling library(readxl) library(MsExperiment) library(MsIO) library(MsBackendMetaboLights) library(SummarizedExperiment) ## Preprocessing of LC-MS data library(xcms) library(Spectra) library(MetaboCoreUtils) ## Statistical analysis library(limma) # Differential abundance library(matrixStats) # Summaries over matrices ## Visualisation library(pander) library(RColorBrewer) library(pheatmap) library(vioplot) library(ggfortify) # Plot PCA library(gridExtra) # To arrange multiple ggplots into single plots ## Annotation library(AnnotationHub) # Annotation resources library(CompoundDb) # Access small compound annotation data. library(MetaboAnnotation) # Functionality for metabolite annotation."},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-import","dir":"Articles","previous_headings":"","what":"Data import","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Note different equipment generate various file extensions, conversion step might needed beforehand, though apply dataset. Spectra package supports variety ways store retrieve MS data, including mzML, mzXML, CDF files, simple flat files, database systems. necessary, several tools, ProteoWizard’s MSConvert, can used convert files .mzML format (Chambers et al. 2012). show extract dataset MetaboLigths database load MsExperiment object. information load data MetaboLights database, refer vignette. type data loading, check link: next configure parallel processing setup. functions xcms package allow per-sample parallel processing, can improve performance analysis, especially large data sets. xcms packages RforMassSpectrometry package ecosystem use parallel processing setup configured BiocParallel Bioconductor package. code use fork-based parallel processing unix system, socket-based parallel processing Windows operating system.","code":"param <- MetaboLightsParam(mtblsId = \"MTBLS8735\", assayName = paste0(\"a_MTBLS8735_LC-MS_positive_\", \"hilic_metabolite_profiling.txt\"), filePattern = \".mzML\") data <- readMsObject(MsExperiment(), param, keepOntology = FALSE, keepProtocol = FALSE, simplify = TRUE) #' Set up parallel processing using 2 cores if (.Platform$OS.type == \"unix\") { register(MulticoreParam(2)) } else{ register(SnowParam(2)) }"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-organisation","dir":"Articles","previous_headings":"","what":"Data organisation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"experimental data now represented MsExperiment object MsExperiment package. MsExperiment object container metadata spectral data provides manages also linkage samples spectra. provide brief overview data structure content. sampleData() function extracts sample information object. next extract data use pander package render show information Table 1 . Throughout document use R pipe operator (|>) avoid nested function calls hence improve code readability. Table 1. Samples data set. (continued ) Table 1. Samples data set. (continued ) 11 samples data set. abbreviations essential proper interpretation metadata information: injection_index: index representing order (position) individual sample measured (injected) within LC-MS measurement run experiment. \"QC\": Quality control sample (pool serum samples external, large cohort). \"CVD\": Sample individual cardiovascular disease. \"CTR\": Sample presumably healthy control. sample_name: arbitrary name/identifier sample. age: (rounded) age individuals. define colors sample groups based sample group using RColorBrewer package: MS data experiment stored Spectra object (Spectra Bioconductor package) within MsExperiment object can accessed using spectra() function. element object spectrum - organised linearly combined Spectra object one (ordered retention time samples). access dataset’s Spectra object summarize available information provide, among things, total number spectra data set. can also summarize number spectra respective MS level (extracted msLevel() function). fromFile() function returns spectrum index sample (data file) can thus used split information (MS level case) sample summarize using base R table() function combine result matrix. Note number spectra acquired run, number spectral features sample. present data set thus contains MS1 data, ideal quantification signal. second (LC-MS/MS) data set also fragment (MS2) spectra samples used later workflow. Note users restrict data evaluation examples shown tutorials. Spectra package enables user-friendly access full MS data functionality extensively used explore, visualize summarize data. another example, determine retention time range entire data set. Data obtained LC-MS experiments typically analyzed along retention time axis, MS data organized spectrum, orthogonal retention time axis.","code":"data ## Object of class MsExperiment ## Spectra: MS1 (17210) ## Experiment data: 10 sample(s) ## Sample data links: ## - spectra: 10 sample(s) to 17210 element(s). #' Access Spectra Object spectra(data) ## MSn data (Spectra) with 17210 spectra in a MsBackendMetaboLights backend: ## msLevel rtime scanIndex ## ## 1 1 0.274 1 ## 2 1 0.553 2 ## 3 1 0.832 3 ## 4 1 1.111 4 ## 5 1 1.390 5 ## ... ... ... ... ## 17206 1 479.052 1717 ## 17207 1 479.331 1718 ## 17208 1 479.610 1719 ## 17209 1 479.889 1720 ## 17210 1 480.168 1721 ## ... 36 more variables/columns. ## ## file(s): ## MS_QC_POOL_1_POS.mzML ## MS_A_POS.mzML ## MS_B_POS.mzML ## ... 7 more files #' Count the number of spectra with a specific MS level per file. spectra(data) |> msLevel() |> split(fromFile(data)) |> lapply(table) |> do.call(what = cbind) ## 1 2 3 4 5 6 7 8 9 10 ## 1 1721 1721 1721 1721 1721 1721 1721 1721 1721 1721 #' Retention time range for entire dataset spectra(data) |> rtime() |> range() ## [1] 0.273 480.169"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-visualization-and-general-quality-assessment","dir":"Articles","previous_headings":"","what":"Data visualization and general quality assessment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Effective visualization paramount inspecting assessing quality MS data. general overview LC-MS data, can: Combine mass peaks (MS1) spectra sample single spectrum mass peak represents maximum signal mass peaks similar m/z. spectrum might called Base Peak Spectrum (BPS), providing information abundant ions sample. Aggregate mass peak intensities spectrum, resulting Base Peak Chromatogram (BPC). BPC shows highest measured intensity distinct retention time (hence spectrum) thus orthogonal BPS. Sum mass peak intensities spectrum create Total Ion Chromatogram (TIC). Compare BPS samples experiment evaluate similarity ion content. Compare BPC samples experiment identify samples similar dissimilar chromatographic signal. addition general data evaluation visualization, also crucial investigate specific signal e.g. internal standards compounds/ions known present samples. providing reliable reference, internal standards help achieve consistent accurate analytical results.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"spectra-data-visualization-bps","dir":"Articles","previous_headings":"Data visualization and general quality assessment","what":"Spectra Data Visualization: BPS","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"BPS collapses data retention time dimension reveals prevalent ions present samples, creation BPS however straightforward. Mass peaks, even representing signals ion, never identical m/z values consecutive spectra due measurement error/resolution instrument. use combineSpectra function combine spectra one file (defined using parameter f = fromFile(data)) single spectrum. mass peaks difference m/z value smaller 3 parts-per-million (ppm) combined one mass peak, intensity representing maximum grouped mass peaks. reduce memory requirement, addition first bin spectrum combining mass peaks within spectrum, aggregating mass peaks bins 0.01 m/z width. case large datasets, also recommended set processingChunkSize() parameter MsExperiment object finite value (default Inf) causing data processed (loaded memory) chunks processingChunkSize() spectra. can reduce memory demand speed process. can now generate BPS sample plot() . , observable overlap ion content files, particularly around 300 m/z 700 m/z. however also differences sets samples. particular, BPS 1, 4, 7 10 (counting row-wise left right) seem different others. fact, four BPS QC samples, remaining six study samples. observed differences might explained fact QC samples pools serum samples different cohort, study samples represent plasma samples, different sample collection. Next visual inspection , can also calculate express similarity BPS heatmap. use compareSpectra() function calculate pairwise similarities BPS use pheatmap() function pheatmap package cluster visualize result. get first glance different samples distribute terms similarity. heatmap confirms observations made BPS, showing distinct clusters QCs study samples, owing different matrices sample collections. also strongly recommended delve deeper data exploring detail. can accomplished carefully assessing data extracting spectra regions interest examination. next chunk, look extract information specific spectrum distinct samples. significant dissimilarities peak distribution intensity confirm difference composition QCs study samples. next compare full MS1 spectrum CVD CTR sample. , can observe spectra CVD CTR samples entirely similar, exhibit similar main peaks 200 600 m/z general higher intensity control samples. However peak distribution (least intensity) seems vary m/z 10 210 m/z 600. CTR spectrum exhibits significant peaks around m/z 150 - 200 much lower intensity CVD sample. delve details specific spectrum, wide range functions can employed: NumericList length 1 [[1]] 18.3266733266736 45.1666666666667 … 27.1048951048951 34.9020979020979 [1] 34.872 NumericList length 1 [[1]] 51.1677328505635 53.0461968245186 … 999.139446289161 999.315208803072 Table 2. Intensity m/z values 125th spectrum one CTR sample.","code":"#' Setting the chunksize chunksize <- 1000 processingChunkSize(spectra(data)) <- chunksize #' Accessing a single spectrum - comparing with QC par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) spec1 <- spectra(data[1])[125] spec2 <- spectra(data[3])[125] plotSpectra(spec1, main = \"QC sample\") plotSpectra(spec2, main = \"CTR sample\") #' Accessing a single spectrum - comparing CVD and CTR par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) spec1 <- spectra(data[2])[125] spec2 <- spectra(data[3])[125] plotSpectra(spec1, main = \"CVD sample\") plotSpectra(spec2, main = \"CTR sample\") #' Checking its intensity intensity(spec2) #' Checking its rtime rtime(spec2) #' Checking its m/z mz(spec2) #' Filtering for a specific m/z range and viewing in a tabular format filt_spec <- filterMzRange(spec2,c(50,200)) data.frame(intensity = unlist(intensity(filt_spec)), mz = unlist(mz(filt_spec))) |> head() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 2. Intensity and m/z values of the 125th spectrum of one CTR sample.\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"chromatographic-data-visualization-bpc-and-tic","dir":"Articles","previous_headings":"Data visualization and general quality assessment","what":"Chromatographic Data Visualization: BPC and TIC","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"chromatogram() function facilitates extraction intensities along retention time. However, access chromatographic information currently efficient seamless spectral information. Work underway develop/improve infrastructure chromatographic data new Chromatograms object aimed flexible user-friendly Spectra object. visualizing LC-MS data, BPC TIC serves valuable tool assess performance liquid chromatography across various samples experiment. case, extract BPC data create plot. BPC captures maximum peak signal spectrum data file plots information retention time spectrum y-axis. BPC can extracted using chromatogram function. setting parameter aggregationFun = \"max\", instruct function report maximum signal per spectrum. Conversely, setting aggregationFun = \"sum\", sums intensities spectrum, thereby creating TIC. 240 seconds signal seems measured. Thus, filter data removing part well first 10 seconds measured LC run. Initially, examined entire BPC subsequently filtered based desired retention times. results smaller file size also facilitates straightforward interpretation BPC. final plot illustrates BPC sample colored phenotype, providing insights signal measured along retention times sample. reveals points compounds eluted LC column. essence, BPC condenses three-dimensional LC-MS data (m/z retention time intensity) two dimensions (retention time intensity). can also compare similarities BPCs heatmap. retention times however identical different samples. Thus bin() chromatographic signal per sample along retention time axis bins two seconds resulting data number bins/data points. can calculate pairwise similarities data vectors using cor() function visualize result using pheatmap(). heatmap reinforces exploration spectra data showed, strong separation QC study samples. important bear mind later analyses. Additionally, study samples group two clusters, cluster containing samples C F cluster II samples. plot TIC samples, using different color cluster. TIC samples look similar, samples cluster show different signal retention time range 40 160 seconds. Whether, strong difference impact following analysis remains determined.","code":"#' Extract and plot BPC for full data bpc <- chromatogram(data, aggregationFun = \"max\") plot(bpc, col = paste0(col_sample, 80), main = \"BPC\", lwd = 1.5) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\") #' Filter the data based on retention time data <- filterRt(data, c(10, 240)) bpc <- chromatogram(data, aggregationFun = \"max\") #' Plot after filtering plot(bpc, col = paste0(col_sample, 80), main = \"BPC after filtering retention time\", lwd = 1.5) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\") #' Total ion chromatogram tic <- chromatogram(data, aggregationFun = \"sum\") |> bin(binSize = 2) #' Calculate similarity (Pearson correlation) between BPCs ticmap <- do.call(cbind, lapply(tic, intensity)) |> cor() rownames(ticmap) <- colnames(ticmap) <- sampleData(data)$sample_name ann <- data.frame(phenotype = sampleData(data)[, \"phenotype\"]) rownames(ann) <- rownames(ticmap) #' Plot heatmap pheatmap(ticmap, annotation_col = ann, annotation_colors = list(phenotype = col_phenotype)) cluster_I_idx <- sampleData(data)$sample_name %in% c(\"F\", \"C\") cluster_II_idx <- sampleData(data)$sample_name %in% c(\"A\", \"B\", \"D\", \"E\") temp_col <- c(\"grey\", \"red\") names(temp_col) <- c(\"Cluster II\", \"Cluster I\") col <- rep(temp_col[1], length(data)) col[cluster_I_idx] <- temp_col[2] col[sampleData(data)$phenotype == \"QC\"] <- NA data |> chromatogram(aggregationFun = \"sum\") |> plot( col = col, main = \"TIC after filtering retention time\", lwd = 1.5) grid() legend(\"topright\", col = temp_col, legend = names(temp_col), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"known-compounds","dir":"Articles","previous_headings":"Data visualization and general quality assessment > Chromatographic Data Visualization: BPC and TIC","what":"Known compounds","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Throughout entire process, crucial reference points within dataset, well-known ions. experiments nowadays include internal standards (), case . strongly recommend using visualization throughout entire analysis. experiment, set 15 spiked samples. reviewing signal , selected two guide analysis process. However, also advise plot evaluate ions steps. illustrate , generate Extracted Ion Chromatograms (EIC) selected test ions. restricting MS data intensities within restricted, small m/z range selected retention time window, EICs expected contain signal single type ion. expected m/z retention times set determined different experiment. Additionally, cases internal standards available, commonly present ions sample matrix can serve suitable alternatives. Ideally, compounds distributed across entire retention time range experiment. Table 3.Internal standard list respective m/z expected retention time [s]. (continued ) plot EICs isotope labeled cystine methionine. can observe clear concentration difference QCs study samples isotope labeled cystine ion. Meanwhile, labeled methionine internal standard exhibits discernible signal amidst noise noticeable retention time shift samples. artificially isotope labeled compounds spiked individual samples, also signal endogenous compounds serum (plasma) samples. Thus, calculate next mass m/z [M+H]+ ion endogenous cystine chemical formula extract also EIC ion. calculation exact mass m/z selected ion adduct use calculateMass() mass2mz() functions r Biocpkg(\"MetaboCoreUtils\") package. two cystine EICs look highly similar (endogenous shown left, isotope labeled right plot ), shift m/z, arises artificial labeling. shift allows us discriminate endogenous non-endogenous compound.","code":"#' Load our list of standard intern_standard <- read.delim(\"intern_standard_list.txt\") # Extract EICs for the list eic_is <- chromatogram( data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) #' Add internal standard metadata fData(eic_is)$mz <- intern_standard$mz fData(eic_is)$rt <- intern_standard$RT fData(eic_is)$name <- intern_standard$name fData(eic_is)$abbreviation <- intern_standard$abbreviation rownames(fData(eic_is)) <- intern_standard$abbreviation #' Summary of IS information cpt <- paste(\"Table 3.Internal standard list with respective m/z and expected\", \"retention time [s].\") fData(eic_is)[, c(\"name\", \"mz\", \"rt\")] |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = cpt) #' Extract the two IS from the chromatogram object. eic_cystine <- eic_is[\"cystine_13C_15N\"] eic_met <- eic_is[\"methionine_13C_15N\"] #' plot both EIC par(mfrow = c(1, 2), mar = c(4, 2, 2, 0.5)) plot(eic_cystine, main = fData(eic_cystine)$name, cex.axis = 0.8, cex.main = 0.8, col = paste0(col_sample, 80)) grid() abline(v = fData(eic_cystine)$rt, col = \"red\", lty = 3) plot(eic_met, main = fData(eic_met)$name, cex.axis = 0.8, cex.main = 0.8, col = paste0(col_sample, 80)) grid() abline(v = fData(eic_met)$rt, col = \"red\", lty = 3) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") #' extract endogenous cystine mass and EIC and plot. cysmass <- calculateMass(\"C6H12N2O4S2\") cys_endo <- mass2mz(cysmass, adduct = \"[M+H]+\")[, 1] #' Plot versus spiked par(mfrow = c(1, 2)) chromatogram(data, mz = cys_endo + c(-0.005, 0.005), rt = unlist(fData(eic_cystine)[, c(\"rtmin\", \"rtmax\")]), aggregationFun = \"max\") |> plot(col = paste0(col_sample, 80)) |> grid() plot(eic_cystine, col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-preprocessing","dir":"Articles","previous_headings":"","what":"Data preprocessing","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Preprocessing stands inaugural step analysis untargeted LC-MS. characterized 3 main stages: chromatographic peak detection, retention time shift correction (alignment) correspondence results features defined. primary objective preprocessing quantification signals ions measured sample, addressing potential retention time drifts samples, ensuring alignment quantified signals across samples within experiment. final result LC-MS data preprocessing numeric matrix abundances quantified entities samples experiment. [anna: silly question: isn’t goal preprocessing align group signals pertaining certain ion feature? obtain matrix abundances][phili: actually really like anna’s simple definition. think ?]","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"chromatographic-peak-detection","dir":"Articles","previous_headings":"Data preprocessing","what":"Chromatographic peak detection","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"initial preprocessing step involves detecting intensity peaks along retention time axis, called chromatographic peaks. achieve , employ findChromPeaks() function within xcms. function supports various algorithms peak detection, can selected configured respective parameter objects. preferred algorithm case, CentWave, utilizes continuous wavelet transformation (CWT)-based peak detection (Tautenhahn, Böttcher, Neumann 2008). method known effectiveness handling non-Gaussian shaped chromatographic peaks peaks varying retention time widths, commonly encountered HILIC separations. apply CentWave algorithm default settings extracted ion chromatogram cystine methionine ions evaluate results. CentWave highly performant algorithm, requires costumized dataset. implies parameters fine-tuned based user’s data. example serves clear motivation users familiarize various parameters need adapt data set. discuss main parameters can easily adjusted suit user’s dataset: peakwidth: Specifies minimal maximal expected width peaks retention time dimension. Highly dependent chromatographic settings used. ppm: maximal allowed difference mass peaks’ m/z values (parts-per-million) consecutive scans consider representing signal ion. integrate: parameter defines integration method. , primarily use integrate = 2 integrates also signal chromatographic peak’s tail considered accurate developers. determine peakwidth, recommend users refer previous EICs estimate range peak widths observe dataset. Ideally, examining multiple EICs goal. dataset, peak widths appear around 2 10 seconds. advise choosing range wide narrow peakwidth parameter can lead false positives negatives. determine ppm, deeper analysis dataset needed. clarified ppm depends instrument, users necessarily input vendor-advertised ppm. ’s determine accurately possible: following steps involve generating highly restricted MS area single mass peak per spectrum, representing cystine ion. m/z peaks extracted, absolute difference calculated finally expressed ppm. therefore, choose value close maximum within range parameter ppm, .e., 15 ppm. can now perform chromatographic peak detection adapted settings EICs. important note , properly estimate background noise, sufficient data points outside chromatographic peak need present. generally problem peak detection performed full LC-MS data set, peak detection EICs retention time range EIC needs sufficiently wide. function fails find peak EIC, initial troubleshooting step increase range. Additionally, signal--noise threshold snthresh reduced peak detection EICs, within small retention time range, enough signal present properly estimate background noise. Finally, case MS1 data points per peaks, setting CentWave’s advanced parameter extendLengthMSW TRUE can help peak detection. customized parameters, chromatographic peak detected sample. , use plot() function EICs visualize results. can see peak seems ot detected sample ions. indicates custom settings seem thus suitable dataset. now proceed apply entire dataset, extracting EICs ions evaluate confirm chromatographic peak detection worked expected. Note: revert value parameter snthresh default, , mentioned , background noise estimation reliable performed full data set. Parameter chunkSize findChromPeaks() defines number data files loaded memory processed simultaneously. parameter thus allows fine-tune memory demand well performance chromatographic peak detection step. plot EICs two selected internal standards evaluate chromatographic peak detection results. Peaks seem detected properly samples ions. indicates peak detection process entire dataset successful.","code":"#' Use default Centwave parameter param <- CentWaveParam() #' Look at the default parameters param ## Object of class: CentWaveParam ## Parameters: ## - ppm: [1] 25 ## - peakwidth: [1] 20 50 ## - snthresh: [1] 10 ## - prefilter: [1] 3 100 ## - mzCenterFun: [1] \"wMean\" ## - integrate: [1] 1 ## - mzdiff: [1] -0.001 ## - fitgauss: [1] FALSE ## - noise: [1] 0 ## - verboseColumns: [1] FALSE ## - roiList: list() ## - firstBaselineCheck: [1] TRUE ## - roiScales: numeric(0) ## - extendLengthMSW: [1] FALSE ## - verboseBetaColumns: [1] FALSE #' Evaluate for Cystine cystine_test <- findChromPeaks(eic_cystine, param = param) chromPeaks(cystine_test) ## rt rtmin rtmax into intb maxo sn row column #' Evaluate for Methionine met_test <- findChromPeaks(eic_met, param = param) chromPeaks(met_test) ## rt rtmin rtmax into intb maxo sn row column #' Restrict the data to signal from cystine in the first sample cst <- data[1L] |> spectra() |> filterRt(rt = c(208, 218)) |> filterMzRange(mz = fData(eic_cystine)[\"cystine_13C_15N\", c(\"mzmin\", \"mzmax\")]) #' Show the number of peaks per m/z filtered spectra lengths(cst) ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 #' Calculate the difference in m/z values between scans mz_diff <- cst |> mz() |> unlist() |> diff() |> abs() #' Express differences in ppm range(mz_diff * 1e6 / mean(unlist(mz(cst)))) ## [1] 0.08829605 14.82188728 #' Parameters adapted for chromatographic peak detection on EICs. param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2, snthresh = 2) #' Evaluate on the cystine ion cystine_test <- findChromPeaks(eic_cystine, param = param) chromPeaks(cystine_test) ## rt rtmin rtmax into intb maxo sn row column ## [1,] 209.251 207.577 212.878 4085.675 2911.376 2157.459 4 1 1 ## [2,] 209.251 206.182 213.995 24625.728 19074.407 12907.487 4 1 2 ## [3,] 209.252 207.020 214.274 19467.836 14594.041 9996.466 4 1 3 ## [4,] 209.251 207.577 212.041 4648.229 3202.617 2458.485 3 1 4 ## [5,] 208.974 206.184 213.159 23801.825 18126.978 11300.289 3 1 5 ## [6,] 209.250 207.018 213.714 25990.327 21036.768 13650.329 5 1 6 ## [7,] 209.252 207.857 212.879 4528.767 3259.039 2445.841 4 1 7 ## [8,] 209.252 207.299 213.995 23119.449 17274.140 12153.410 4 1 8 ## [9,] 208.972 206.740 212.878 28943.188 23436.119 14451.023 4 1 9 ## [10,] 209.252 207.578 213.437 4470.552 3065.402 2292.881 4 1 10 #' Evaluate on the methionine ion met_test <- findChromPeaks(eic_met, param = param) chromPeaks(met_test) ## rt rtmin rtmax into intb maxo sn row column ## [1,] 159.867 157.913 162.378 20026.61 14715.42 12555.601 4 1 1 ## [2,] 160.425 157.077 163.215 16827.76 11843.39 8407.699 3 1 2 ## [3,] 160.425 157.356 163.215 18262.45 12881.67 9283.375 3 1 3 ## [4,] 159.588 157.635 161.820 20987.72 15424.25 13327.811 4 1 4 ## [5,] 160.985 156.799 163.217 16601.72 11968.46 10012.396 4 1 5 ## [6,] 160.982 157.634 163.214 17243.24 12389.94 9150.079 4 1 6 ## [7,] 159.867 158.193 162.099 21120.10 16202.05 13531.844 3 1 7 ## [8,] 160.426 157.356 162.937 18937.40 13739.73 10336.000 3 1 8 ## [9,] 160.704 158.472 163.215 17882.21 12299.43 9395.548 3 1 9 ## [10,] 160.146 157.914 162.379 20275.80 14279.50 12669.821 3 1 10 #' Using the same settings, but with default snthresh param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2) data <- findChromPeaks(data, param = param, chunkSize = 5) #' Update EIC internal standard object eics_is_noprocess <- eic_is eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_noprocess)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"refine-identified-chromatographic-peaks","dir":"Articles","previous_headings":"Data preprocessing > Chromatographic peak detection","what":"Refine identified chromatographic peaks","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"identification chromatographic peaks using CentWave algorithm can sometimes result artifacts, overlapping split peaks. address issue, refineChromPeaks() function utilized, conjunction MergeNeighboringPeaksParam, aims merging split peaks. show examples CentWave peak detection artifacts. examples pre-selected illustrate necessity next step: cases signal presumably single type ion split two separate chromatographic peaks (indicated vertical line). MergeNeigboringPeaksParam allows combine split peaks. parameters algorithm defined : expandMz: Suggested kept relatively small (0.0015) prevent merging isotopes. expandRt: Usually set approximately half size average retention time width used chromatographic peak detection (case, 2.5 seconds). minProp: Used determine whether candidates merged. Chromatographic peaks overlapping m/z ranges (expanded side expandMz) tail--head distance retention time dimension less 2 * expandRt, signal higher minProp apex intensity chromatographic peak lower intensity, merged. Values parameter small avoid merging closely co-eluting ions, isomers. test settings EICs split peaks. can observe artificially split peaks appropriately merged. Therefore, next apply settings entire dataset. peak merging, column \"merged\" result object’s chromPeakData() data frame can used evaluate chromatographic peaks result represent signal merged, originally identified chromatographic peaks. proceeding next preprocessing step generally suggested evaluate results chromatographic peak detection EICs e.g. internal standards compounds/ions known present samples. Additionally, evaluating comparing number identified chromatographic peaks samples data set can help spotting potentially problematic samples. count number chromatographic peaks per sample show numbers table. Table 4.Samples number identified chromatographic peaks. similar number chromatographic peaks identified within various samples data set. Additional options evaluate results chromatographic peak detection can implemented using plotChromPeaks() function summarizing results using base R commands.","code":"#' set up the parameter param <- MergeNeighboringPeaksParam(expandRt = 2.5, expandMz = 0.0015, minProp = 0.75) #' Perform the peak refinement on the EICs eics <- refineChromPeaks(eics, param = param) plot(eics) #' Apply on whole dataset data <- refineChromPeaks(data, param = param, chunkSize = 5) chromPeakData(data)$merged |> table() ## ## FALSE TRUE ## 79908 9274 eics_is_chrompeaks <- eic_is eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_chrompeaks) eic_cystine <- eic_is[\"cystine_13C_15N\", ] eic_met <- eic_is[\"methionine_13C_15N\", ] #' Count the number of peaks per sample and summarize them in a table. data.frame(sample_name = sampleData(data)$sample_name, peak_count = as.integer(table(chromPeaks(data)[, \"sample\"]))) |> pandoc.table( style = \"rmarkdown\", caption = \"Table 4.Samples and number of identified chromatographic peaks.\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"retention-time-alignment","dir":"Articles","previous_headings":"Data preprocessing","what":"Retention time alignment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Despite using chromatographic settings conditions retention time shifts unavoidable. Indeed, performance instrument can change time, example due small variations environmental conditions, temperature pressure. shifts generally small samples measured within batch/measurement run, can considerable data experiment acquired across longer time period. evaluate presence shift extract plot BPC QC samples. QC samples representing sample (pool) measured regular intervals measurement run experiment measured day. Still, small shifts can observed, especially region 100 150 seconds. facilitate proper correspondence signals across samples (hence definition LC-MS features), essential minimize differences retention times. Theoretically, proceed two steps: first select QC samples dataset first alignment , using -called anchor peaks. way can assume linear shift time, since always measuring sample different regular time intervals. Despite external QCs data set, still use subset-based alignment assuming retention time shifts independent different sample matrix (human serum plasma) instead mostly instrument-dependent. Note also possible manually specify anchor peaks, respectively retention times align data set external, reference, data set. information provided vignettes xcms package. calculating much adjust retention time samples, apply shift also study samples. xcms retention time alignment can performed using adjustRtime() function alignment algorithm. example use PeakGroups method (Smith et al. 2006) performs alignment minimizing differences retention times set anchor peaks different samples. method requires initial correspondence analysis match/group chromatographic peaks across samples algorithm selects anchor peaks alignment. initial correspondence, use PeakDensity approach (Smith et al. 2006) groups chromatographic peaks similar m/z retention time LC-MS features. parameters algorithm, can configured using PeakDensityParam object, sampleGroups, minFraction, binSize, ppm bw. binSize, ppm bw allow specify similar chromatographic peaks’ m/z retention time values need consider grouping feature. binSize ppm define required similarity m/z values. Within m/z bin (defined binSize ppm) areas along retention time axis high chromatographic peak density (considering peaks samples) identified, chromatographic peaks within regions considered grouping feature. High density areas identified using base R density() function, bw parameter: higher values define wider retention time areas, lower values require chromatographic peaks similar retention times. parameter can seen black line plot , corresponding smoothness density curve. Whether candidate peaks get grouped feature depends also parameters sampleGroups minFraction: sampleGroups provide, sample, sample group belongs . minFraction expected value 0 1 defining proportion samples within least one sample groups (defined sampleGroups) chromatographic peaks detected group feature. initial correspondence, parameters don’t need fully optimized. Selection dataset-specific parameter values described detail next section. dataset, use small values binSize ppm , importantly, also parameter bw, since data set ultra high performance (UHP) LC setup used [anna: maybe field long, don’t see connection UHPLC choice small values parameters. something empirical? phili: jo can help ?]. minFraction use high value (0.9) ensure features defined chromatographic peaks present almost samples one sample group (can used anchor peaks actual alignment). base alignment later QC samples hence define sampleGroups binary variable grouping samples either study, QC group. PeakGroups-based alignment can next performed using adjustRtime() function PeakGroupsParam parameter object. parameters algorithm : subsetAdjust subset: Allows subset alignment. base retention time alignment QC samples, .e., retention time shifts estimated based repeatedly measured samples. resulting adjustment applied entire data. data sets QC samples (e.g. sample pools) measured repeatedly, strongly suggest use method. Note also subset-based alignment samples ordered injection index (.e., order measured measurement run). minFraction: value 0 1 defining proportion samples (full data set, data subset defined subset) chromatographic peak identified use anchor peak. contrast PeakDensityParam parameter used define proportion within sample group. span: PeakGroups method allows, depending data, adjust regions along retention time axis differently. enable local alignments LOESS function used parameter defines degree smoothing function. Generally, values 0.4 0.6 used, however, suggested evaluate alignment results eventually adapt parameters result satisfactory. perform alignment data set based retention times anchor peaks defined subset QC samples. Alignment adjusted retention times spectra data set, well retention times identified chromatographic peaks. alignment performed, user evaluate results using plotAdjustedRtime() function. function visualizes difference adjusted raw retention time sample y-axis along adjusted retention time x-axis. Dot points represent position used anchor peak along retention time axis. optimal alignment areas along retention time axis, anchor peaks scattered retention time dimension. samples present data set measured within measurement run, resulting small retention time shifts. Therefore, little adjustments needed performed (shifts maximum 1 second can seen plot ). Generally, magnitude adjustment seen plots match expectation analyst. can also compare BPC alignment. get original data, .e. raw retention times, can use dropAdjustedRtime() function: largest shift can observed retention time range 120 130s. Apart retention time range, little changes can observed. next evaluate impact alignment EICs selected internal standards. thus first extract ion chromatograms alignment. can now evaluate alignment effect test ions. plot EICs alignment isotope labeled cystine methionine. non-endogenous cystine ion already well aligned difference minimal. methionine ion, however, shows improvement alignment. addition visual inspection results, also evaluate impact alignment comparing variance retention times internal standards alignment. end, first need identify chromatographic peaks sample m/z retention time close expected values internal standard. use matchValues() function MetaboAnnotation package (Rainer et al. 2022) using MzRtParam method identify chromatographic peaks similar m/z (+/- 50 ppm) retention time (+/- 10 seconds) internal standard’s values. parameters mzColname rtColname specify column names query () target (chromatographic peaks) contain m/z retention time values match entities. perform matching separately sample. internal standard every sample, use filterMatches() function SingleMatchParam() parameter select chromatographic peak highest intensity. now internal standard ID chromatographic peak sample likely represents signal ion. can now extract retention times chromatographic peaks alignment. can now evaluate impact alignment retention times internal standards across full data set: average, variation retention times internal standards across samples slightly reduced alignment.[Phili: actually don’t think can say plot]","code":"#' Get QC samples QC_samples <- sampleData(data)$phenotype == \"QC\" #' extract BPC data[QC_samples] |> chromatogram(aggregationFun = \"max\", chromPeaks = \"none\") |> plot(col = col_phenotype[\"QC\"], main = \"BPC of QC samples\") |> grid() # Initial correspondence analysis param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype == \"QC\", minFraction = 0.9, binSize = 0.01, ppm = 10, bw = 2) data <- groupChromPeaks(data, param = param) plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) #' Define parameters of choice subset <- which(sampleData(data)$phenotype == \"QC\") param <- PeakGroupsParam(minFraction = 0.9, extraPeaks = 50, span = 0.5, subsetAdjust = \"average\", subset = subset) #' Perform the alignment data <- adjustRtime(data, param = param) #' Visualize alignment results plotAdjustedRtime(data, col = paste0(col_sample, 80), peakGroupsPch = 1) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") #' Get data before alignment data_raw <- dropAdjustedRtime(data) #' Apply the adjusted retention time to our dataset data <- applyAdjustedRtime(data) #' Plot the BPC before and after alignment par(mfrow = c(2, 1), mar = c(2, 1, 1, 0.5)) chromatogram(data_raw, aggregationFun = \"max\", chromPeaks = \"none\") |> plot(main = \"BPC before alignment\", col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\", horiz = TRUE) chromatogram(data, aggregationFun = \"max\", chromPeaks = \"none\") |> plot(main = \"BPC after alignment\", col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\", horiz = TRUE) #' Store the EICs before alignment eics_is_refined <- eic_is #' Update the EICs eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_refined) #' Extract the EICs for the test ions eic_cystine <- eic_is[\"cystine_13C_15N\"] eic_met <- eic_is[\"methionine_13C_15N\"] par(mfrow = c(2, 2), mar = c(4, 4.5, 2, 1)) old_eic_cystine <- eics_is_refined[\"cystine_13C_15N\"] plot(old_eic_cystine, main = \"Cystine before alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"cystine_13C_15N\", \"RT\"], col = \"red\", lty = 3) old_eic_met <- eics_is_refined[\"methionine_13C_15N\"] plot(old_eic_met, main = \"Methionine before alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"methionine_13C_15N\", \"RT\"], col = \"red\", lty = 3) plot(eic_cystine, main = \"Cystine after alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"cystine_13C_15N\", \"RT\"], col = \"red\", lty = 3) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") plot(eic_met, main = \"Methionine after alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"methionine_13C_15N\", \"RT\"], col = \"red\", lty = 3) #' Extract the matrix with all chromatographic peaks and add a column #' with the ID of the chromatographic peak chrom_peaks <- chromPeaks(data) |> as.data.frame() chrom_peaks$peak_id <- rownames(chrom_peaks) #' Define the parameters for the matching and filtering of the matches p_1 <- MzRtParam(ppm = 50, toleranceRt = 10) p_2 <- SingleMatchParam(duplicates = \"top_ranked\", column = \"target_maxo\", decreasing = TRUE) #' Iterate over samples and identify for each the chromatographic peaks #' with similar m/z and retention time than the onse from the internal #' standard, and extract among them the ID of the peaks with the #' highest intensity. intern_standard_peaks <- lapply(seq_along(data), function(i) { tmp <- chrom_peaks[chrom_peaks[, \"sample\"] == i, , drop = FALSE] mtch <- matchValues(intern_standard, tmp, mzColname = c(\"mz\", \"mz\"), rtColname = c(\"RT\", \"rt\"), param = p_1) mtch <- filterMatches(mtch, p_2) mtch$target_peak_id }) |> do.call(what = cbind) #' Define the index of the selected chromatographic peaks in the #' full chromPeaks matrix idx <- match(intern_standard_peaks, rownames(chromPeaks(data))) #' Extract the raw retention times for these rt_raw <- chromPeaks(data_raw)[idx, \"rt\"] |> matrix(ncol = length(data_raw)) #' Extract the adjusted retention times for these rt_adj <- chromPeaks(data)[idx, \"rt\"] |> matrix(ncol = length(data_raw)) list(all_raw = rowSds(rt_raw, na.rm = TRUE), all_adj = rowSds(rt_adj, na.rm = TRUE) ) |> vioplot(ylab = \"sd(retention time)\") grid()"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"correspondence","dir":"Articles","previous_headings":"Data preprocessing","what":"Correspondence","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"briefly touched subject correspondence determine anchor peaks alignment. Generally, goal correspondence analysis identify chromatographic peaks originate types ions samples experiment group LC-MS features. point, proper configuration parameter bw crucial. illustrate sensible choices parameter’s value can made. use plotChromPeakDensity() function simulate correspondence analysis default values PeakGroups extracted ion chromatograms two selected isotope labeled ions. plot shows EIC top panel, apex position chromatographic peaks different samples (y-axis), along retention time (x-axis) lower panel. Grouping peaks depends smoothness previousl mentionned density curve can configured parameter bw. seen , smoothness high properly group features. looking default parameters, can observe indeed, bw parameter set bw = 30, high modern UHPLC-MS setups. reduce value parameter 1.8 evaluate impact. can observe peaks now grouped accurately single feature test ion. important parameters optimized : binsize: data generated high resolution MS instrument, thus select low value paramete. ppm: TOF instruments, suggested use value ppm larger 0 accommodate higher measurement error instrument larger m/z values. minFraction: set minFraction = 0.75, hence defining features chromatographic peak identified least 75% samples one sample groups. sampleGroups: use information available sampleData’s \"phenotype\" column. correspondence analysis suggested evaluate results selected EICs. extract signal m/z similar isotope labeled methionine larger retention time range. Importantly, show actual correspondence results, set simulate = FALSE plotChromPeakDensity() function. hoped, signal two different ions now grouped separate features. Generally, correspondence results evaluated extracted chromatograms. Another interesting information look distribution features along retention time axis. Table 5.Distribution features along retention time axis (seconds. (continued ) Table continues results correspondence analysis now stored, along results preprocessing steps, within XcmsExperiment result object. correspondence results, .e., definition LC-MS features, can extracted using featureDefinitions() function. data frame provides average m/z retention time (columns \"mzmed\" \"rtmed\") characterize LC-MS feature. Column, \"peakidx\" contains indices chromatographic peaks assigned feature. actual abundances features, represent also final preprocessing results, can extracted featureValues() function: can note features (e.g. F0003 F0006) missing values samples. expected certain degree samples features, respectively ions, need present. address next section.","code":"#' Default parameter for the grouping and apply them to the test ions BPC param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, bw = 30) param ## Object of class: PeakDensityParam ## Parameters: ## - sampleGroups: [1] \"QC\" \"CVD\" \"CTR\" \"QC\" \"CTR\" \"CVD\" \"QC\" \"CTR\" \"CVD\" \"QC\" ## - bw: [1] 30 ## - minFraction: [1] 0.5 ## - minSamples: [1] 1 ## - binSize: [1] 0.25 ## - maxFeatures: [1] 50 ## - ppm: [1] 0 plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) plotChromPeakDensity(eic_met, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_met)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_met)[, \"sample\"]], 20), peakPch = 16) #' Updating parameters param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, bw = 1.8) plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) plotChromPeakDensity(eic_met, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_met)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_met)[, \"sample\"]], 20), peakPch = 16) #' Define the settings for the param param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, minFraction = 0.75, binSize = 0.01, ppm = 10, bw = 1.8) #' Apply to whole data data <- groupChromPeaks(data, param = param) #' Extract chromatogram for an m/z similar to the one of the labeled methionine chr_test <- chromatogram(data, mz = as.matrix(intern_standard[\"methionine_13C_15N\", c(\"mzmin\", \"mzmax\")]), rt = c(145, 200), aggregationFun = \"max\") plotChromPeakDensity( chr_test, simulate = FALSE, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(chr_test)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(chr_test)[, \"sample\"]], 20), peakPch = 16) # Bin features per RT slices vc <- featureDefinitions(data)$rtmed breaks <- seq(0, max(vc, na.rm = TRUE) + 1, length.out = 15) |> round(0) cuts <- cut(vc, breaks = breaks, include.lowest = TRUE) table(cuts) |> pandoc.table( style = \"rmarkdown\", caption = \"Table 5.Distribution of features along the retention time axis (in seconds.\") #' Definition of the features featureDefinitions(data) |> head() ## mzmed mzmin mzmax rtmed rtmin rtmax npeaks CTR CVD QC ## FT0001 50.98979 50.98949 50.99038 203.6001 203.1181 204.2331 8 1 3 4 ## FT0002 51.05904 51.05880 51.05941 191.1675 190.8787 191.5050 9 2 3 4 ## FT0003 51.98657 51.98631 51.98699 203.1467 202.6406 203.6710 7 0 3 4 ## FT0004 53.02036 53.02009 53.02043 203.2343 202.5652 204.0901 10 3 3 4 ## FT0005 53.52080 53.52051 53.52102 203.1936 202.8490 204.0901 10 3 3 4 ## FT0006 54.01007 54.00988 54.01015 159.2816 158.8499 159.4484 6 1 3 2 ## peakidx ms_level ## FT0001 7702, 16.... 1 ## FT0002 7176, 16.... 1 ## FT0003 7680, 17.... 1 ## FT0004 7763, 17.... 1 ## FT0005 8353, 17.... 1 ## FT0006 5800, 15.... 1 #' Extract feature abundances featureValues(data, method = \"sum\") |> head() ## MS_QC_POOL_1_POS.mzML MS_A_POS.mzML MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ## FT0001 421.6162 689.2422 NA 481.7436 ## FT0002 710.8078 875.9192 NA 693.6997 ## FT0003 445.5711 613.4410 NA 497.8866 ## FT0004 16994.5260 24605.7340 19766.707 17808.0933 ## FT0005 3284.2664 4526.0531 3521.822 3379.8909 ## FT0006 10681.7476 10009.6602 NA 10800.5449 ## MS_C_POS.mzML MS_D_POS.mzML MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## FT0001 NA 635.2732 439.6086 570.5849 ## FT0002 781.2416 648.4344 700.9716 1054.0207 ## FT0003 NA 634.9370 449.0933 NA ## FT0004 22780.6683 22873.1061 16965.7762 23432.1252 ## FT0005 4396.0762 4317.7734 3270.5290 4533.8667 ## FT0006 NA 7296.4262 NA 9236.9799 ## MS_F_POS.mzML MS_QC_POOL_4_POS.mzML ## FT0001 579.9360 437.0340 ## FT0002 534.4577 711.0361 ## FT0003 461.0465 232.1075 ## FT0004 22198.4607 16796.4497 ## FT0005 4161.0132 3142.2268 ## FT0006 6817.8785 NA"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"gap-filling","dir":"Articles","previous_headings":"Data preprocessing","what":"Gap filling","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"previously observed missing values (NA) attributed various reasons. Although might represent genuinely missing value, indicating ion (feature) truly present particular sample, also result failure preceding chromatographic peak detection step. crucial able recover missing values latter category much possible reduce eventual need data imputation. next examine prevalent missing values present dataset: can observe substantial number missing values values dataset. Let’s therefore delve process gap-filling. first evaluate example features chromatographic peak detected samples: instances, chromatographic peak identified one two selected samples (red line), hence missing value reported feature particular samples (blue line). However, cases, signal measured samples, thus, reporting missing value correct example. signal feature low, likely reason peak detection failed. rescue signal cases, fillChromPeaks() function can used ChromPeakAreaParam approach. method defines m/z-retention time area feature based detected peaks, signal respective ion expected. integrates intensities within area samples missing values feature. reported feature abundance. apply method using default parameters. fillChromPeaks() thus rescue missing data data set. Note , even sample ion present, worst case noise integrated, expected much lower actual chromatographic peak signal. Let’s look previously missing values : gap-filling, also blue colored sample chromatographic peak present peak area reported feature abundance sample. assess effectiveness gap-filling method rescuing signals, can also plot average signal features least one missing value average filled-signal. advisable perform analysis repeatedly measured samples; case, QC samples used. , extract: Feature values detected chromatographic peaks setting filled = FALSE featuresValues() call. filled-signal first extracting detected gap-filled abundances replace values detected chromatographic peaks NA. , calculate row averages matrices plot . detected (x-axis) gap-filled (y-axis) values QC samples highly correlated. Especially higher abundances, agreement high, low intensities, can expected, differences higher trending correlation line. , addition, fit linear regression line data summarize results linear regression line slope 1.12 intercept -1.62. indicates filled-signal average 1.12 times higher detected signal.","code":"#' Percentage of missing values sum(is.na(featureValues(data))) / length(featureValues(data)) * 100 ## [1] 26.41597 ftidx <- which(is.na(rowSums(featureValues(data)))) fts <- rownames(featureDefinitions(data))[ftidx] farea <- featureArea(data, features = fts[1:2]) chromatogram(data[c(2, 3)], rt = farea[, c(\"rtmin\", \"rtmax\")], mz = farea[, c(\"mzmin\", \"mzmax\")]) |> plot(col = c(\"red\", \"blue\"), lwd = 2) #' Fill in the missing values in the whole dataset data <- fillChromPeaks(data, param = ChromPeakAreaParam(), chunkSize = 5) #' Percentage of missing values after gap-filling sum(is.na(featureValues(data))) / length(featureValues(data)) * 100 ## [1] 5.155492 #' Get only detected signal in QC samples vals_detect <- featureValues(data, filled = FALSE)[, QC_samples] #' Get detected and filled-in signal vals_filled <- featureValues(data)[, QC_samples] #' Replace detected signal with NA vals_filled[!is.na(vals_detect)] <- NA #' Identify features with at least one filled peak has_filled <- is.na(rowSums(vals_detect)) #' Calculate row averages for features with missing values avg_detect <- rowMeans(vals_detect[has_filled, ], na.rm = TRUE) avg_filled <- rowMeans(vals_filled[has_filled, ], na.rm = TRUE) #' Plot the values against each other (in log2 scale) plot(log2(avg_detect), log2(avg_filled), xlim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), ylim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), pch = 21, bg = \"#00000020\", col = \"#00000080\") grid() abline(0, 1) #' fit a linear regression line to the data l <- lm(log2(avg_filled) ~ log2(avg_detect)) summary(l) ## ## Call: ## lm(formula = log2(avg_filled) ~ log2(avg_detect)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.8176 -0.3807 0.1725 0.5492 6.7504 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.62359 0.11545 -14.06 <2e-16 *** ## log2(avg_detect) 1.11763 0.01259 88.75 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9366 on 2846 degrees of freedom ## (846 observations deleted due to missingness) ## Multiple R-squared: 0.7346, Adjusted R-squared: 0.7345 ## F-statistic: 7877 on 1 and 2846 DF, p-value: < 2.2e-16"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"preprocessing-results","dir":"Articles","previous_headings":"Data preprocessing","what":"Preprocessing results","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"final results LC-MS data preprocessing stored within XcmsExperiment object. includes identified chromatographic peaks, alignment results, well correspondence results. addition, guarantee reproducibility, result object keeps track performed processing steps, including individual parameter objects used configure . processHistory() function returns list various applied processing steps chronological order. , extract information first step performed preprocessing. processParam() function used extract actual parameter class used configure processing step. final result whole LC-MS data preprocessing two-dimensional matrix abundances -called LC-MS features samples. Note stage analysis features characterized m/z retention time don’t yet information metabolite feature represent. seen , feature matrix can extracted featureValues() function corresponding feature characteristics (.e., m/z retention time values) using featureDefinitions() function. Thus, two arrays extracted xcms result object used/imported analysis packages processing. example also exported tab delimited text files, used external tool, used, also MS2 spectra available, feature-based molecular networking GNPS analysis environment (Nothias et al. 2020). processing R, reference link raw MS data required, suggested extract xcms preprocessing result using quantify() function SummarizedExperiment object, Bioconductor’s default container data biological assays/experiments. simplifies integration Bioconductor analysis packages. quantify() function takes parameters featureValues() function, thus, call extract SummarizedExperiment detected, gap-filled, feature abundances: Sample identifications xcms result’s sampleData() now available colData() (column, sample annotations) featureDefinitions() rowData() (row, feature annotations). feature values added first assay() SummarizedExperiment even processing history available object’s metadata(). SummarizedExperiment supports multiple assays, numeric matrices dimensions. thus add detected gap-filled feature abundances additional assay SummarizedExperiment. Feature abundances can extracted assay() function. extract first 6 lines detected gap-filled feature abundances: advantage, addition container full preprocessing results also possibility easy intuitive creation data subsets ensuring data integrity. example easy subset full data selection features /samples: XcmsExperiment object can also saved later use using storeResults() function. data can exported different formats, enable easier integration non-R-based software. Currently, possible export data R-specific RData format (separate) plain text files. Export community-developed open mzTab-M format currently developed supported future. export xcms result object R’s default binary format object serialization.","code":"#' Check first step of the process history processHistory(data)[[1]] ## Object of class \"XProcessHistory\" ## type: Peak detection ## date: Wed Sep 25 16:44:55 2024 ## info: ## fileIndex: 1,2,3,4,5,6,7,8,9,10 ## Parameter class: CentWaveParam ## MS level(s) 1 #' Extract results as a SummarizedExperiment res <- quantify(data, method = \"sum\", filled = FALSE) res ## class: SummarizedExperiment ## dim: 9068 10 ## metadata(6): '' '' ... '' '' ## assays(1): raw ## rownames(9068): FT0001 FT0002 ... FT9067 FT9068 ## rowData names(11): mzmed mzmin ... QC ms_level ## colnames(10): MS_QC_POOL_1_POS.mzML MS_A_POS.mzML ... MS_F_POS.mzML ## MS_QC_POOL_4_POS.mzML ## colData names(11): sample_name derived_spectra_data_file ... phenotype ## injection_index assays(res)$raw_filled <- featureValues(data, method = \"sum\", filled = TRUE ) #' Different assay in the SummarizedExperiment object assayNames(res) ## [1] \"raw\" \"raw_filled\" assay(res, \"raw_filled\") |> head() ## MS_QC_POOL_1_POS.mzML MS_A_POS.mzML MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ## FT0001 421.6162 689.2422 411.3295 481.7436 ## FT0002 710.8078 875.9192 457.5920 693.6997 ## FT0003 445.5711 613.4410 277.5022 497.8866 ## FT0004 16994.5260 24605.7340 19766.7069 17808.0933 ## FT0005 3284.2664 4526.0531 3521.8221 3379.8909 ## FT0006 10681.7476 10009.6602 9599.9701 10800.5449 ## MS_C_POS.mzML MS_D_POS.mzML MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## FT0001 314.7567 635.2732 439.6086 570.5849 ## FT0002 781.2416 648.4344 700.9716 1054.0207 ## FT0003 425.3774 634.9370 449.0933 556.2544 ## FT0004 22780.6683 22873.1061 16965.7762 23432.1252 ## FT0005 4396.0762 4317.7734 3270.5290 4533.8667 ## FT0006 4792.2390 7296.4262 2382.1788 9236.9799 ## MS_F_POS.mzML MS_QC_POOL_4_POS.mzML ## FT0001 579.9360 437.0340 ## FT0002 534.4577 711.0361 ## FT0003 461.0465 232.1075 ## FT0004 22198.4607 16796.4497 ## FT0005 4161.0132 3142.2268 ## FT0006 6817.8785 6911.5439 res[1:14, 3:8] ## class: SummarizedExperiment ## dim: 14 6 ## metadata(6): '' '' ... '' '' ## assays(2): raw raw_filled ## rownames(14): FT0001 FT0002 ... FT0013 FT0014 ## rowData names(11): mzmed mzmin ... QC ms_level ## colnames(6): MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ... ## MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## colData names(11): sample_name derived_spectra_data_file ... phenotype ## injection_index"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-normalization","dir":"Articles","previous_headings":"","what":"Data normalization","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"preprocessing, data normalization scaling might need applied remove technical variances data. simple approaches like median scaling can implemented lines R code, advanced normalization algorithms available packages Bioconductor’s preprocessCore. comprehensive workflow “Notame” also propose interesting normalization approach adaptable scalable user dataset (Klåvus et al. 2020). Generally, LC-MS data, bias can categorized three main groups(Broadhurst et al. 2018): Variances introduced sample collection initial processing, can include differences sample amounts. type bias expected sample-specific affect signals sample way. Methods like median scaling, LOESS quantiles normalization can adjust bias. Signal drifts along measurement samples experiment. Reasons drifts can related aging instrumentation used (columns, detector), also changes metabolite abundances characteristics due reactions modifications, oxidation. changes expected affect samples measured later run rather ones measured beginning. reason, bias can play major role large experiments bias can play major role large experiments measured long time range usually considered affect individual metabolites (metabolite groups) differently. adjustment, moving average linear regression-based approaches can used. latter can example performed using adjust_lm() function MetaboCoreUtils package. Batch-related biases. comprise noise specific larger set samples, can set samples measured one LC-MS measurement run (.e. one analysis plate) samples measured using specific batch reagents. noise assumed affect samples one batch way linear modeling-based approaches can used adjust . Unwanted variation can arise various sources highly dependent experiment. Therefore, data normalization chosen carefully based experimental design, statistical aims, balance accuracy precision achieved use auxiliary information. Sample preparation biases can evaluated using internal standards, depending however also added sample mixes sample processing. Repeated measurements QC samples hand allows estimate correct LC-MS specific biases. Also, proper planning experiment, measurement study samples random order, can largely avoid biases introduced mentioned sources variance. workflow present tools assess data quality evaluate need normalization well options normalization. space reasons able provide solutions adjust possible sources variation.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"initial-quality-assessment","dir":"Articles","previous_headings":"Data normalization","what":"Initial quality assessment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"principal component analysis (PCA) helpful tool initial, unsupervised, visualization data also provides insights potential quality issues data. order apply PCA measured feature abundances, need however impute (still present) missing values. assume missing values (gap-filling step) represent signal detection limit. cases, missing values can replaced random values sampled uniform distribution, ranging half smallest measured value smallest measured value specific feature. uniform distribution defined two parameters (minimum maximum) values equal probability selected. impute missing values approach add resulting data matrix new assay result object.","code":"#' Load preprocessing results ## load(\"SumExp.RData\") ## loadResults(RDataParam(\"data.RData\")) #' Impute missing values using an uniform distribution na_unidis <- function(z) { na <- is.na(z) if (any(na)) { min = min(z, na.rm = TRUE) z[na] <- runif(sum(na), min = min/2, max = min) } z } #' Row-wise impute missing values and add the data as a new assay tmp <- apply(assay(res, \"raw_filled\"), MARGIN = 1, na_unidis) assays(res)$raw_filled_imputed <- t(tmp)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"principal-component-analysis","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Principal Component Analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"PCA powerful tool detecting biases data. dimensionality reduction technique, enables visualization data lower-dimensional space. context LC-MS data, PCA can used identify overall biases batch, sample, injection index, etc. However, important note PCA linear method may able detect biases data. plotting PCA, apply log2 transform, center scale data. log2 transformation applied stabilize variance centering remove dependency absolute abundances. PCA shows clear separation study samples (plasma) QC samples (serum) first principal component (PC1). separation based phenotype visible third principal component (PC3). cases, can better option remove imputed values evaluate PCA . especially true imputed values replacing large proportion data.","code":"#' Log2 transform and scale data vals <- assay(res, \"raw_filled_imputed\") |> log2() |> t() |> scale(center = TRUE, scale = TRUE) #' Perform the PCA pca_res <- prcomp(vals, scale = FALSE, center = FALSE) #' Plot the results vals_st <- cbind(vals, phenotype = res$phenotype) pca_12 <- autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) pca_34 <- autoplot(pca_res, data = vals_st, colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_12, pca_34, ncol = 2)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"intensity-evaluation","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Intensity evaluation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Global differences feature abundances samples (e.g. due sample-specific biases) can evaluated plotting distribution log2 transformed feature abundances using boxplots violin plots. show number detected chromatographic peaks per sample distribution log2 transformed feature abundances. upper part plot show gap filling steps allowed rescue substantial number NAs allowed us consistent number feature values per sample. consistency aligns asspumption every sample similar amount features detected. Additionally observe , average, signal distribution individual samples similar. alternative way evaluate differences abundances samples relative log abundance (RLA) plots (De Livera et al. 2012). RLA value abundance feature sample relative median abundance feature across multiple samples. can discriminate within group across group RLAs, depending whether abundance compared samples within sample group across samples. Within group RLA plots assess tightness replicates within groups median close zero low variation around . used across groups, allow compare behavior groups. Generally, -sample differences can easily spotted using RLA plots. calculate visualize within group RLA values using rowRla() function r Biocpkg(\"MsCoreUtils\") package defining parameter f sample groups. RLA plot raw data filled data. Note: outliers drawn. RLA plot , can observe medians samples indeed centered around 0. Exception two CVD samples. Thus, distribution signals across samples comparable, differences seem present require sample normalization.","code":"layout(mat = matrix(1:3, ncol = 1), height = c(0.2, 0.2, 0.8)) par(mar = c(0.2, 4.5, 0.2, 3)) barplot(apply(assay(res, \"raw\"), MARGIN = 2, function(x) sum(!is.na(x))), col = paste0(col_sample, 80), border = col_sample, ylab = \"# detected peaks\", xaxt = \"n\", space = 0.012) grid(nx = NA, ny = NULL) barplot(apply(assay(res, \"raw_filled\"), MARGIN = 2, function(x) sum(!is.na(x))), col = paste0(col_sample, 80), border = col_sample, ylab = \"# detected + filled peaks\", xaxt = \"n\", space = 0.012) grid(nx = NA, ny = NULL) vioplot(log2(assay(res, \"raw_filled\")), xaxt = \"n\", ylab = expression(log[2]~feature~abundance), col = paste0(col_sample, 80), border = col_sample) points(colMedians(log2(assay(res, \"raw_filled\")), na.rm = TRUE), type = \"b\", pch = 1) grid(nx = NA, ny = NULL) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.8, bty = \"n\") par(mfrow = c(1, 1), mar = c(3.5, 4.5, 2.5, 1)) boxplot(MsCoreUtils::rowRla(assay(res, \"raw_filled\"), f = res$phenotype, transform = \"log2\"), cex = 0.5, pch = 16, col = paste0(col_sample, 80), ylab = \"RLA\", border = col_sample, boxwex = 1, outline = FALSE, xaxt = \"n\", main = \"Relative log abundance\", cex.main = 1) axis(side = 1, at = seq_len(ncol(res)), labels = colData(res)$sample_name) grid(nx = NA, ny = NULL) abline(h = 0, lty=3, lwd = 1, col = \"black\") legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.8, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"internal-standards","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Internal standards","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Depending added sample mixes, allow evaluation variances introduced subsequent processing analysis steps. present experiment, added original plasma samples sample extraction included also protein lipid removal steps. can therefore used evaluate variances introduced sample extraction subsequent steps, can however used infer conclusions performance differences original sample collection (blood drawing, storage, plasma creation). use matchValues() function identify features representing signal . filter matches keep match single feature using filterMatches() function combination SingleMatchParam. internal standards play crucial role guiding normalization process. Given assumption samples artificially spiked, possess known ground truth—abundance intensity internal standard consistent. difference expected due technical differences/variance. Consequently, normalization aims minimize variation samples internal standard, reinforcing reliability analyses.","code":"# Do we keep IS in normalisation ? Does not give much info... Would simplify a bit #' Creating a column within our IS table intern_standard$feature_id <- NA_character_ #' Identify features matching m/z and RT of internal standards. fdef <- featureDefinitions(data) fdef$feature_id <- rownames(fdef) match_intern_standard <- matchValues( query = intern_standard, target = fdef, mzColname = c(\"mz\", \"mzmed\"), rtColname = c(\"RT\", \"rtmed\"), param = MzRtParam(ppm = 50, toleranceRt = 10)) #' Keep only matches with a 1:1 mapping standard to feature. param <- SingleMatchParam(duplicates = \"remove\", column = \"score_rt\", decreasing = TRUE) match_intern_standard <- filterMatches(match_intern_standard, param) intern_standard$feature_id <- match_intern_standard$target_feature_id intern_standard <- intern_standard[!is.na(intern_standard$feature_id), ]"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"between-sample-normalisation","dir":"Articles","previous_headings":"Data normalization","what":"Between sample normalisation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"previous RLA plot showed data biases need corrected. Therefore, implement -sample normalization using filled-features. process effectively mitigates variations influenced technical issues, differences sample preparation, processing injection methods. instance, employ commonly used technique known median scaling (De Livera et al. 2012).","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"median-scaling","dir":"Articles","previous_headings":"Data normalization > Between sample normalisation","what":"Median scaling","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"method involves computing median sample, followed determining median individual sample medians. ensures consistent median values sample throughout entire data set. Maintaining uniformity average total metabolite abundance across samples crucial effective implementation. process aims establish shared baseline central tendency metabolite abundance, mitigating impact sample-specific technical variations. approach fosters robust comparable analysis top features across data set. assumption normalizing based median, known lower sensitivity extreme values, enhances comparability top features ensures consistent average abundance across samples. median scaling calculated imputed non-imputed data, set stored separately within SummarizedExperiment object. approach facilitates testing various normalization strategies maintaining record processing steps undertaken, enabling easy regression previous stages necessary.","code":"#' Compute median and generate normalization factor mdns <- apply(assay(res, \"raw_filled\"), MARGIN = 2, median, na.rm = TRUE ) nf_mdn <- mdns / median(mdns) #' divide dataset by median of median and create a new assay. assays(res)$norm <- sweep(assay(res, \"raw_filled\"), MARGIN = 2, nf_mdn, '/') assays(res)$norm_imputed <- sweep(assay(res, \"raw_filled_imputed\"), MARGIN = 2, nf_mdn, '/')"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"assessing-overall-effectiveness-of-the-normalization-approach","dir":"Articles","previous_headings":"Data normalization","what":"Assessing overall effectiveness of the normalization approach","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"crucial evaluate effectiveness normalization process. can achieved comparing distribution log2 transformed feature abundances normalization. Additionally, RLA plots can used assess tightness replicates within groups compare behavior groups.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"principal-component-analysis-1","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Principal Component Analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Normalization large impact PC1 PC2, separation study groups PC3 seems better difference QC samples lower normalization (see ). PCA plots show normalization process changed overall structure data. separation study QC samples remains . expected results normalization correct biological variance technical.","code":"#' Data before normalization vals_st <- cbind(vals, phenotype = res$phenotype) pca_raw <- autoplot(pca_res, data = vals_st, colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) #' Data after normalization vals_norm <- apply(assay(res, \"norm\"), MARGIN = 1, na_unidis) |> log2() |> scale(center = TRUE, scale = TRUE) pca_res_norm <- prcomp(vals_norm, scale = FALSE, center = FALSE) vals_st_norm <- cbind(vals_norm, phenotype = res$phenotype) pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_raw, pca_adj, ncol = 2) pca_raw <- autoplot(pca_res, data = vals_st , colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_raw, pca_adj, ncol = 2)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"intensity-evaluation-1","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Intensity evaluation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"compare RLA plots -sample normalization evaluate impact data. RLA plot normalization. Note: outliers drawn. normalization process effectively centered data around median medians samples now closer zero.","code":"par(mfrow = c(2, 1), mar = c(3.5, 4.5, 2.5, 1)) boxplot(rowRla(assay(res, \"raw_filled\"), group = res$phenotype), cex = 0.5, pch = 16, ylab = \"RLA\", border = col_sample, col = paste0(col_sample, 80), cex.main = 1, outline = FALSE, xaxt = \"n\", main = \"Raw data\", boxwex = 1) grid(nx = NA, ny = NULL) legend(\"topright\", inset = c(0, -0.2), col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.7, bty = \"n\") abline(h = 0, lty=3, lwd = 1, col = \"black\") boxplot(rowRla(assay(res, \"norm\"), group = res$phenotype), cex = 0.5, pch = 16, ylab = \"RLA\", border = col_sample, col = paste0(col_sample, 80), boxwex = 1, outline = FALSE, xaxt = \"n\", main = \"Normallized data\", cex.main = 1) axis(side = 1, at = seq_len(ncol(res)), labels = res$sample_name) grid(nx = NA, ny = NULL) abline(h = 0, lty = 3, lwd = 1, col = \"black\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"coefficient-of-variation","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Coefficient of variation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"next evaluate coefficient variation (CV, also referred relative standard deviation RSD) features across samples either QC study samples. QC samples, CV represent technical noise, study samples include also expected biological differences. Thus, normalization reduce CV QC samples, slightly reducing CV study samples. CV calculated using rowRsd() function MetaboCoreUtils package. setting mad = TRUE use robust calculation using median absolute deviation instead standard deviation. Table 6. Distribution CV values across samples raw normalized data. table shows distribution CV raw normalized data. first column highlights % data given CV value, e.g. 25% data CV equal lower 0.04557 QC_raw data. anticipated, CV values QCs, reflect technical variance, lower compared study samples, include technical biological variance. Overall, minimal disparity exists raw normalized data, positive indication normalization process introduced bias dataset, also reflects little differences average abundances sample raw data.","code":"index_study <- res$phenotype %in% c(\"CTR\", \"CVD\") index_QC <- res$phenotype == \"QC\" sample_res <- cbind( QC_raw = rowRsd(assay(res, \"raw_filled\")[, index_QC], na.rm = TRUE, mad = TRUE), QC_norm = rowRsd(assay(res, \"norm\")[, index_QC], na.rm = TRUE, mad = TRUE), Study_raw = rowRsd(assay(res, \"raw_filled\")[, index_study], na.rm = TRUE, mad = TRUE), Study_norm = rowRsd(assay(res, \"norm\")[, index_study], na.rm = TRUE, mad = TRUE) ) #' Summarize the values across features res_df <- data.frame( QC_raw = quantile(sample_res[, \"QC_raw\"], na.rm = TRUE), QC_norm = quantile(sample_res[, \"QC_norm\"], na.rm = TRUE), Study_raw = quantile(sample_res[, \"Study_raw\"], na.rm = TRUE), Study_norm = quantile(sample_res[, \"Study_norm\"], na.rm = TRUE) ) cpt <- paste0(\"Table 6. Distribution of CV values across samples for the raw and \", \"normalized data.\") pandoc.table(res_df, style = \"rmarkdown\", caption = cpt) save(data, file = \"data_afternorm.RData\") save(res, file = \"SumExp_afternorm.RData\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"conclusion-on-normalization","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Conclusion on normalization","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"overall conclusion normalization process little variance present beginning, normalization however able center data around median (shown RLA plot). Given simplicity limited size example dataset, conclude normalization process stage. intricate datasets diverse biases, tailored approach devised. include also approaches adjust signal drifts batch effects. One possible option use linear-model based approach can example applied adjust_lm() function MetaboCoreUtils package.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"quality-control-feature-prefiltering","dir":"Articles","previous_headings":"","what":"Quality control: Feature prefiltering","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"normalizing data can now pre-filter clean data performing statistical analysis. general, pre-filtering samples features performed remove outliers. copy original result object also keep unfiltered data later comparisons. eliminate features exhibit high variability dataset. Repeatedly measured QC samples typically serve robust basis cleansing datasets allowing identify features excessively high noise. data set external QC samples used, .e. pooled samples different collection using slightly different sample matrix, utility filtering somewhat limited. comprehensive description guidelines data filtering untargeted metabolomic studies, please refer (Broadhurst et al. 2018). first restrict data set features chromatographic peak detected least 2/3 samples least one study samples groups. ensures statistical tests carried later study samples performed reliable signal. Also, filter remove features mostly detected QC samples, study samples. filter can performed filterFeatures() function xcms package PercentMissingFilter setting. parameters filer: threshold: defines maximal acceptable percentage samples missing value(s) least one sample groups defined parameter f. f: factor defining sample groups. replacing \"QC\" sample group NA parameter f exclude QC samples evaluation consider study samples. threshold = 40 keep features peak detected 2 3 samples one sample groups. consider detected chromatographic peaks per sample, apply filter \"raw\" assay result object, contains abundance values detected chromatographic peaks (prior gap-filling). Following guidelines stated decided still use QC samples pre-filtering, basis represent similar bio-fluids study samples, thus, anticipate observing relatively similar metabolites affected similar measurement biases. therefore evaluate dispersion ratio (Dratio) (Broadhurst et al. 2018) features data set. accomplish task using function time DratioFilter parameter. filters exist function invite user explore decide best dataset. Dratio filter powerful tool identify features exhibit high variability data, relating variance observed QC samples study samples. setting threshold 0.4, remove features high degree variability QC study samples. example, feature deviation QC higher 40% (threshold = 0.4)deviation study samples removed. filtering step ensures features retained considerably lower technical biological variance. Note rowDratio() rowRsd() functions MetaboCoreUtils package used calculate actual numeric values estimates used filtering, e.g. evaluate distribution whole data set identify data set-dependent threshold values. Finally, evaluate number features left filtering steps calculate percentage features removed. dataset reduced 9068 4275 features. remove considerable amount features expected want focus reliable features analysis. rest analysis need separate QC samples study samples. store QC samples separate object later use. addition calculate CV QC samples add additional column rowData() result object. used later prioritize identified significant features e.g. low technical noise. Now data set preprocessed, normalized filtered, can start evaluate distribution data estimate variation due biology.","code":"load(\"SumExp_afternorm.RData\") load(\"data_afternorm.RData\") #' Number of features before filtering nrow(res) ## [1] 9068 #' keep unfiltered object res_unfilt <- res #' Limit features to those with at least two detected peaks in one study group. #' Setting the value for QC samples to NA excludes QC samples from the #' calculation. f <- res$phenotype f[f == \"QC\"] <- NA f <- as.factor(f) res <- filterFeatures(res, PercentMissingFilter(f = f, threshold = 40), assay = \"raw\") #' Compute and filter based on the Dratio filter_dratio <- DratioFilter(threshold = 0.4, qcIndex = res$phenotype == \"QC\", studyIndex = res$phenotype != \"QC\", mad = TRUE) res <- filterFeatures(res, filter = filter_dratio, assay = \"norm_imputed\") #' Number of features after analysis nrow(res) ## [1] 4275 #' Percentage left: end/beginning nrow(res)/nrow(res_unfilt) * 100 ## [1] 47.1438 res_qc <- res[, res$phenotype == \"QC\"] res <- res[, res$phenotype != \"QC\"] #' Calculate the QC's CV and add as feature variable to the data set rowData(res)$qc_cv <- assay(res_qc, \"norm\") |> rowRsd()"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"differential-abundance-analysis","dir":"Articles","previous_headings":"","what":"Differential abundance analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"normalization quality control, next step identify features differentially abundant study groups. crucial step allows us identify potential biomarkers metabolites associated study groups. various approaches methods available identification features interest. workflow use multiple linear regression analysis identify features significantly difference abundances CVD CTR study group. performing tests evaluate similarities study samples using PCA (excluding QC samples avoid influencing results). samples clearly separate study group PCA indicating differences metabolite profiles two groups. However, drives separation PC1 clear. evaluate whether explained available variable study, .e., age: According PCA , PC1 seem related age. Even variance data set can’t explain stage, proceed (supervised) statistical tests identify features interest. compute linear models metabolite explaining observed feature abundance available study variables. also use base R function lm(), utilize R Biocpkg(\"limma\") package conduct differential abundance analysis: moderated test statistics (Smyth 2004) provided package specifically well suited experiments limited number replicates. tests use linear model ~ phenotype + age, hence explaining abundances one metabolite accounting study group assignment age individual. analysis might benefit inclusion study covariate associated PC2 explaining variance seen principal component, present analysis participant’s age disease association provided. define design study model.matrix() function fit feature-wise linear models log2-transformed abundances using lmFit() function. P-values significance association calculated using eBayes() function, also performs empirical Bayes-based robust estimation standard errors. See also excellent vignette/user guide limma package examples details linear model procedure. linear models fitted, can now proceed extract results. create data frame containing coefficients, raw adjusted p-values (applying Benjamini-Hochberg correction, .e., method = \"BH\" improved control false discovery rate), average intensity signals CVD CTR samples, indication whether feature deemed significant . consider metabolites adjusted p-value smaller 0.05 significant, also include (absolute) difference abundances cut-criteria. last, add differential abundance results result object’s rowData(). can now proceed visualize distribution raw adjusted p-values. Distribution raw (left) adjusted p-values (right). histograms show distribution raw adjusted p-values. Except enrichment small p-values, raw p-values (less) uniformly distributed, indicates absence strong systematic biases data. adjusted p-values conservative account multiple testing; important fit linear model feature therefore perform large number tests leads high chance false positive findings. see features low p-values, indicating likely significantly different two study groups. plot adjusted p-values log2 fold change (average) abundances. volcano plot allow us visualize features significantly different two study groups. highlighted blue color plot . Volcano plot showing analysis results. interesting features top corners volcano plot (.e., features large difference abundance groups small p-value). significant features negative coefficient (log2 fold change value) indicating abundance lower CVD samples compared CTR samples. features listed, along average difference (log2) abundance compared groups, adjusted p-values, average (log2) abundance sample group RSD (CV) QC samples table . Table 7.Features significant differences abundances. (continued ) visualize EICs significant features evaluate (raw) signal. restrict MS data set study samples. Parameters keepFeatures = TRUE: ensures identified features retained `subset object. peakBg: defines (background) color individual chromatographic peak EIC object. EICs significant features show clear single peak. intensities (already observed ) much larger CTR CVD samples. exception second feature (second EIC top row), intensities significant features however generally low. might make challenging identify using LC-MS/MS setup.","code":"col_sample <- col_phenotype[res$phenotype] #' Log transform and scale the data for PCA analysis vals <- assay(res, \"norm_imputed\") |> t() |> log2() |> scale(center = TRUE, scale = TRUE) pca_res <- prcomp(vals, scale = FALSE, center = FALSE) vals_st <- cbind(vals, phenotype = res$phenotype) autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) vals_st <- cbind(vals, age = res$age) autoplot(pca_res, data = vals_st , colour = 'age', scale = 0) #' Define the linear model to be applied to the data p.cut <- 0.05 # cut-off for significance. m.cut <- 0.5 # cut-off for log2 fold change age <- res$age phenotype <- factor(res$phenotype) design <- model.matrix(~ phenotype + age) #' Fit the linear model to the data, explaining metabolite #' concentrations by phenotype and age. fit <- lmFit(log2(assay(res, \"norm_imputed\")), design = design) fit <- eBayes(fit) #' Compile a result data frame tmp <- data.frame( coef.CVD = fit$coefficients[, \"phenotypeCVD\"], pvalue.CVD = fit$p.value[, \"phenotypeCVD\"], adjp.CVD = p.adjust(fit$p.value[, \"phenotypeCVD\"], method = \"BH\"), avg.CVD = rowMeans( log2(assay(res, \"norm_imputed\")[, res$phenotype == \"CVD\"])), avg.CTR = rowMeans( log2(assay(res, \"norm_imputed\")[, res$phenotype == \"CTR\"])) ) tmp$significant.CVD <- tmp$adjp.CVD < 0.05 #' Add the results to the object's rowData rowData(res) <- cbind(rowData(res), tmp) #' Restrict the raw data to study samples. data_study <- data[sampleData(data)$phenotype != \"QC\", keepFeatures = TRUE] #' Extract EICs for the significant features eic_sign <- featureChromatograms( data_study, features = rownames(tab), expandRt = 5, filled = TRUE) #' Plot the EICs. plot(eic_sign, col = col_sample, peakBg = paste0(col_sample[chromPeaks(eic_sign)[, \"sample\"]], 40)) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1) save(data, file = \"data_after_DA.RData\") save(res, file = \"Sum_Exp_afterDA.RData\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"annotation","dir":"Articles","previous_headings":"","what":"Annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"now identified features significant differences abundances two study groups. provide information metabolic pathways differentiate affected healthy individuals might hence also serve biomarkers. However, stage analysis know compounds/metabolites actually represent. thus need now annotate signals. Annotation can performed different level confidence Schymanski et al. (2014). lowest level annotation, highest rate false positive hits, bases features m/z ratios. Higher levels annotations employ fragment spectra (MS2 spectra) ions interest requiring however acquisition additional data. section, demonstrate multiple ways annotate significant features using functionality provided Bioconductor packages. Alternative approaches external software tools, may better suited, also discussed later section.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"ms1-based-annotation","dir":"Articles","previous_headings":"Annotation","what":"MS1-based annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"data set acquired using LC-MS setup features thus characterized m/z retention times. retention time LC-setup-specific , without prior data/knowledge provide little information features’ identity. Modern MS instruments high accuracy m/z values therefore reliable estimates compound ion’s mass--charge ratio. first approach, use features’ m/z values match reference values, .e., exact masses chemical compounds provided reference database, case MassBank database. full MassBank data re-distributed Bioconductor’s AnnotationHub resource, simplifies integration reproducible R-based analysis workflows. load resource, list available MassBank data sets/releases load one . MassBank data provided self-contained SQLite database data can queried accessed CompoundDb Bioconductor package. use compounds() function extract small compound annotations database. MassBank (small compound annotation databases) provides (exact) molecular mass compound. Since almost small compounds neutral natural state, need first converted m/z values allow matching feature’s m/z. calculate m/z neutral mass, need assume ion (adduct) might generated measured metabolites employed electro-spray ionization. positive polarity, human serum samples, common ions protonated ([M+H]+), bear addition sodium ([M+Na]+) ammonium ([M+H-NH3]+) ions. match observed m/z values reference values potential ions use matchValues() function Mass2MzParam approach, allows specify types expected ions adducts parameter maximal allowed difference compared values using tolerance ppm parameters. first prepare data.frame significant features, set parameters matching perform comparison query features reference database. resulting Matched object shows 4 6 significant features matched ions compounds MassBank database. extract full result Matched object. Thus, total 237 ions compounds MassBank matched significant features based specified tolerance settings. Many compounds, different structure thus function/chemical property, identical chemical formula thus mass. Matching exclusively m/z features hence result many potentially false positive hits thus considered provide low confidence annotation. additional complication annotation resources, like MassBank, community maintained, contain large amount redundant information. reduce redundancy result table iterate hits feature keep matches unique compounds (identified INCHIKEY). INCHI INCHIKEY combine information compound’s chemical formula structure, different compounds can share chemical formula, different structure thus INCHI. Table 8.MS1 annotation results (continued ) table shows results MS1-based annotation process. can see four significant features matched. matches seem pretty accurate low ppm errors. deduplication performed considerably reduced number hits feature, first still matches ions large number compounds (chemical formula). Considering features’ m/z retention times MS1-based annotation increase annotation confidence, requires additional data, recording retention time thepure standard compound LC setup. alternative approach might provide better inside annotations help choose different annotations feature evaluate certain chemical properties possible matches. instance, LogP value, available several databases HMDB, provides insight given compound’s polarity. property highly affects interaction analyte column, usually also directly affects separation. Therefore, comparison analyte’s retention time polarity can help rule possible misidentifications. low confidence, MS1-based annotation can provide first candidate annotations confirmed rejected additional analyses.","code":"#' load reference data ah <- AnnotationHub() #' List available MassBank data sets query(ah, \"MassBank\") ## AnnotationHub with 6 records ## # snapshotDate(): 2024-08-01 ## # $dataprovider: MassBank ## # $species: NA ## # $rdataclass: CompDb ## # additional mcols(): taxonomyid, genome, description, ## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, ## # rdatapath, sourceurl, sourcetype ## # retrieve records with, e.g., 'object[[\"AH107048\"]]' ## ## title ## AH107048 | MassBank CompDb for release 2021.03 ## AH107049 | MassBank CompDb for release 2022.06 ## AH111334 | MassBank CompDb for release 2022.12.1 ## AH116164 | MassBank CompDb for release 2023.06 ## AH116165 | MassBank CompDb for release 2023.09 ## AH116166 | MassBank CompDb for release 2023.11 #' Load one MAssBank release mb <- ah[[\"AH116166\"]] cmps <- compounds(mb, columns = c(\"compound_id\", \"name\", \"formula\", \"exactmass\", \"inchikey\")) head(cmps) ## compound_id formula exactmass inchikey ## 1 1 C27H29NO11 543.1741 AOJJSUZBOXZQNB-UHFFFAOYSA-N ## 2 2 C40H54O4 598.4022 KFNGKYUGHHQDEE-AXWOCEAUSA-N ## 3 3 C10H24N2O2 204.1838 AEUTYOVWOVBAKS-UWVGGRQHSA-N ## 4 4 C16H27NO5 313.1889 LMFKRLGHEKVMNT-UJDVCPFMSA-N ## 5 5 C20H15Cl3N2OS 435.9971 JLGKQTAYUIMGRK-UHFFFAOYSA-N ## 6 6 C15H14O5 274.0841 BWNCKEBBYADFPQ-UHFFFAOYSA-N ## name ## 1 Epirubicin ## 2 Crassostreaxanthin A ## 3 Ethambutol ## 4 Heliotrine ## 5 Sertaconazole ## 6 (R)Semivioxanthin #' Prepare query data frame rowData(res)$feature_id <- rownames(rowData(res)) res_sig <- res[rowData(res)$significant.CVD, ] #' Setup parameters for the matching param <- Mass2MzParam(adducts = c(\"[M+H]+\", \"[M+Na]+\", \"[M+H-NH3]+\"), tolerance = 0, ppm = 5) #' Perform the matching. mtch <- matchValues(res_sig, cmps, param = param, mzColname = \"mzmed\") mtch ## Object of class Matched ## Total number of matches: 237 ## Number of query objects: 6 (4 matched) ## Number of target objects: 117732 (237 matched) #' Extracting the results mtch_res <- matchedData(mtch, c(\"feature_id\", \"mzmed\", \"rtmed\", \"adduct\", \"ppm_error\", \"target_formula\", \"target_name\", \"target_inchikey\")) mtch_res ## DataFrame with 239 rows and 8 columns ## feature_id mzmed rtmed adduct ppm_error target_formula ## ## FT0371 FT0371 138.055 148.396 [M+H]+ 2.08055 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 2.08055 C7H7NO2 ## ... ... ... ... ... ... ... ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT5606 FT5606 560.36 33.5492 NA NA NA ## target_name target_inchikey ## ## FT0371 Benzohydro... VDEUYMSGMP... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Salicylami... SKZKKFZAGN... ## ... ... ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT5606 NA NA rownames(mtch_res) <- NULL #' Keep only info on features that machted - create a utility function for that mtch_res <- split(mtch_res, mtch_res$feature_id) |> lapply(function(x) { lapply(split(x, x$target_inchikey), function(z) { z[which.min(z$ppm_error), ] }) }) |> unlist(recursive = FALSE) |> do.call(what = rbind) #' Display the results mtch_res |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 8.MS1 annotation results\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"ms2-based-annotation","dir":"Articles","previous_headings":"Annotation","what":"MS2-based annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"MS1 annotation fast efficient method annotate features therefore give first insight compounds significantly different two study groups. However, always accurate. MS2 data can provide higher level confidence annotation process provides, observed fragmentation pattern, information structure compound. MS2 data can generated LC-MS/MS measurement MS2 spectra recorded ions either data dependent acquisition (DDA) data independent acquisition (DIA) mode. Generally, advised include LC-MS/MS runs QC samples randomly selected study samples already acquisition MS1 data used quantification signals. alternative, addition, post-hoc LC-MS/MS acquisition can performed generate MS2 data needed annotation. present experiment, separate LC-MS/MS measurement conducted QC samples selected study samples generate data using inclusion list pre-selected ions. represent features found significantly different CVD CTR samples initial analysis full experiment. use subset second LC-MS/MS data set show data can used MS2-based annotation. differential abundance analysis found features significantly higher abundances CTR samples. Consequently, utilize MS2 data obtained CTR samples annotate significant features. load LC-MS/MS data experiment restrict data acquired CTR sample. total 3 LC-MS/MS data files control samples, different collision energy fragment ions. show number MS1 MS2 spectra files. Compared number MS2 spectra, far less MS1 spectra acquired. configuration MS instrument set ensure ions specified inclusion list selected fragmentation, even intensity might low. setting, however, recorded MS2 spectra represent noise. plot shows location precursor ions m/z - retention time plane three files. can see MS2 spectra recorded m/z interest along full retention time range, even actual ions eluting within certain retention time windows. next extract Spectra object MS data data object assign new spectra variable employed collision energy, extract data object sampleData. next filter MS data first restricting MS2 spectra removing mass peaks spectrum intensity lower 5% highest intensity spectrum, assuming low intensity peaks represent background signal. next remove also mass peaks m/z value greater equal precursor m/z ion. puts, later matching reference spectra, weight fragmentation pattern ions avoids hits based precursor m/z peak (hence similar mass compared compounds). last, restrict data spectra least two fragment peaks scale intensities sum 1 spectrum. similarity calculations affected scaling, makes visual comparison fragment spectra easier read. Finally, also speed later comparison spectra reference database, load full MS data memory (changing backend MsBackendMemory) apply processing steps performed data far. Keeping MS data memory performance benefits, generally suggested large data sets. evaluate impact present data set print addition size data object changing backend. thus moderate increase memory demand loading MS data memory (also filtered cleaned MS2 data). proceed match experimental MS2 spectra reference fragment spectra, workflow aim annotate features found significant differential abundance analysis. goal thus identify MS2 spectra second (LC-MS/MS) run represent fragments ions features data first (LC-MS) run. approach match MS2 spectra significant features determined earlier based precursor m/z retention time (given acceptable tolerance) feature’s m/z retention time. can easily done using featureArea() function effectively considers actual m/z retention time ranges features’ chromatographic peaks therefore increase chance finding correct match. however also assumes retention times first second run don’t differ much. Alternatively, need align retention times second LC-MS/MS data set first. first extract feature area, .e., m/z retention time ranges, significant features. next identify fragment spectra precursor m/z retention times within ranges. use filterRanges() function allows filter Spectra object using multiple ranges simultaneously. apply function separately feature (row matrix) extract MS2 spectra representing fragmentation information presumed feature’s ions. result apply() call list Spectra, element representing result one feature. exception last feature, multiple MS2 spectra identified. next combine list Spectra single Spectra object using concatenateSpectra() function add additional spectra variable containing respective feature identifier. now Spectra object fragment spectra significant features differential expression analysis. next build reference data need process way query spectra. extract fragment spectra MassBank database, restrict positive polarity data (since experiment acquired positive polarity) perform processing fragment spectra MassBank database. Note switch MsBackendMemory backend hence loading full data reference database memory. positive impact performance subsequent spectra matching, however also increase memory demand present analysis. Now Spectra object second run database spectra prepared, can proceed matching process. use matchSpectra() function MetaboAnnotation package CompareSpectraParam define settings matching. following parameters: requirePrecursor = TRUE: Limits spectra similarity calculations fragment spectra similar precursor m/z. tolerance ppm: Defines acceptable difference compared m/z values. relaxed tolerance settings ensure find matches even reference spectra acquired instruments lower accuracy. THRESHFUN: Defines matches report. , keep matches resulting spectra similarity score (calculated normalized dot product (Stein Scott 1994), default similarity function) larger 0.6. Thus, total 315 query MS2 spectra, 16 matched (least) one reference fragment spectrum. restrict results matching spectra extract metadata query target spectra well similarity score (complete list available metadata information can listed colnames() function). Now, query-target pairs spectra similarity higher 0.6. Similar MS1-based annotation also result table contains redundant information: multiple fragment spectra per feature also MassBank contains several fragment spectra compound, measured using differing collision energies MS instruments, different laboratories. thus iterate feature-compound pairs select one highest score. identifier compound, use fragment spectra’s INCHI-key, since compound names MassBank accepted consensus/controlled vocabularies. Table 9.MS2 annotation results. Thus, 6 significant features, one annotated compound based MS2-based approach. many reasons failure find matches features. Although MS2 spectra selected feature, appear represent noise, features, LC-MS/MS run, low MS1 signal recorded, indicating selected sample original compound might (longer) present. Also, reference databases contain predominantly fragment spectra protonated ([M+H]+) ions compounds, features might represent signal types ions result different fragmentation pattern. Finally, fragment spectra compounds interest might also simply present used reference database. Thus, combining information MS1- MS2 based annotation can annotate one feature considerable confidence. feature m/z 195.0879 retention time 32 seconds seems ion caffeine. result somewhat disappointing also clearly shows importance proper experimental planning need control potential confounding factors. present experiment, disease-specific biomarker identified, life-style property individuals suffering disease: coffee consumption probably contraindicated patients CVD group reduce risk heart arrhythmia. plot EIC feature highlighting retention time highest scoring MS2 spectra recorded create mirror plot comparing MS2 spectra reference fragment spectra caffeine. plot clearly shows higher signal feature CTR compared CVD samples. QC samples exhibit lower highly consistent signal, suggesting absence strong technical noise biases raw data experiment. vertical line indicates retention time fragment spectrum best match reference spectrum. noted , since fragment spectra measured separate LC-MS/MS experiment, considered indication approximate retention time ions fragmented second experiment. fragment spectrum feature, shown upper panel right plot highly similar reference spectrum caffeine MassBank (shown lower panel). addition matching precursor m/z, two fragments (m/z intensity) present spectra. can also extract additional metadata matching reference spectrum, used collision energy, fragmentation mode, instrument type, instrument well ion (adduct) fragmented.","code":"#' Load form the MetaboLights Database param <- MetaboLightsParam(mtblsId = \"MTBLS8735\", assayName = paste0(\"a_MTBLS8735_LC-MSMS_positive_\", \"hilic_metabolite_profiling.txt\"), filePattern = \".mzML\") msms_data <- readMsObject(MsExperiment(), param, keepOntology = FALSE, keepProtocol = FALSE, simplify = TRUE) #adjust sampleData colnames(sampleData(msms_data)) <- c(\"sample_name\", \"derived_spectra_data_file\", \"metabolite_asssignment_file\", \"source_name\", \"organism\", \"blood_sample_type\", \"sample_type\", \"age\", \"unit\", \"phenotype\") # filter samples to keep MSMS data from CTR samples: sampleData(msms_data) <- sampleData(msms_data)[sampleData(msms_data)$phenotype == \"CTR\", ] sampleData(msms_data) <- sampleData(msms_data)[grepl(\"MSMS\", sampleData(msms_data)$derived_spectra_data_file), ] # Add fragmentation data information (from filenames) sampleData(msms_data)$fragmentation_mode <- c(\"CE20\", \"CE30\", \"CES\") #let's look at the updated sample data sampleData(msms_data)[, c(\"derived_spectra_data_file\", \"phenotype\", \"sample_name\", \"age\")] |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 1. Samples from the data set.\") ## ## ## | derived_spectra_data_file | phenotype | sample_name | age | ## |:----------------------------:|:---------:|:-----------:|:---:| ## | FILES/MSMS_2_E_CE20_POS.mzML | CTR | E | 66 | ## | FILES/MSMS_2_E_CE30_POS.mzML | CTR | E | 66 | ## | FILES/MSMS_2_E_CES_POS.mzML | CTR | E | 66 | ## ## Table: Table 1. Samples from the data set. #' Filter the data to the same RT range as the LC-MS run msms_data <- filterRt(msms_data, c(10, 240)) #' check the number of spectra per ms level spectra(msms_data) |> msLevel() |> split(spectraSampleIndex(msms_data)) |> lapply(table) |> do.call(what = cbind) ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 1 825 186 186 186 825 186 186 186 825 185 186 185 ## 2 825 3121 3118 3124 825 3123 3118 3120 825 3117 3117 3116 plotPrecursorIons(msms_data) ms2_ctr <- spectra(msms_data) ms2_ctr$collision_energy <- sampleData(msms_data)$fragmentation_mode[spectraSampleIndex(msms_data)] #' Remove low intensity peaks low_int <- function(x, ...) { x > max(x, na.rm = TRUE) * 0.05 } ms2_ctr <- filterMsLevel(ms2_ctr, 2L) |> filterIntensity(intensity = low_int) #' Remove precursor peaks and restrict to spectra with a minimum #' number of peaks ms2_ctr <- filterPrecursorPeaks(ms2_ctr, ppm = 50, mz = \">=\") ms2_ctr <- ms2_ctr[lengths(ms2_ctr) > 1] |> scalePeaks() #' Size of the object before loading into memory print(object.size(ms2_ctr), units = \"MB\") ## 5.1 Mb #' Load the MS data subset into memory ms2_ctr <- setBackend(ms2_ctr, MsBackendMemory()) ms2_ctr <- applyProcessing(ms2_ctr) #' Size of the object after loading into memory print(object.size(ms2_ctr), units = \"MB\") ## 18.2 Mb #' Define the m/z and retention time ranges for the significant features target <- featureArea(data)[rownames(res_sig), ] target ## mzmin mzmax rtmin rtmax ## FT0371 138.0544 138.0552 146.32270 152.86115 ## FT0565 161.0391 161.0407 159.00234 164.30799 ## FT0732 182.0726 182.0756 32.71242 42.28755 ## FT0845 195.0799 195.0887 30.73235 35.67337 ## FT1171 229.1282 229.1335 178.01450 183.35303 ## FT5606 560.3539 560.3656 32.06570 35.33456 #' Identify for each feature MS2 spectra with their precursor m/z and #' retention time within the feature's m/z and retention time range ms2_ctr_fts <- apply(target[, c(\"rtmin\", \"rtmax\", \"mzmin\", \"mzmax\")], MARGIN = 1, FUN = filterRanges, object = ms2_ctr, spectraVariables = c(\"rtime\", \"precursorMz\")) lengths(ms2_ctr_fts) ## FT0371 FT0565 FT0732 FT0845 FT1171 FT5606 ## 38 36 135 68 38 0 l <- lengths(ms2_ctr_fts) #' Combine the individual Spectra objects ms2_ctr_fts <- concatenateSpectra(ms2_ctr_fts) #' Assign the feature identifier to each MS2 spectrum ms2_ctr_fts$feature_id <- rep(rownames(res_sig), l) ms2_ref <- Spectra(mb) |> filterPolarity(1L) |> filterIntensity(intensity = low_int) |> filterPrecursorPeaks(ppm = 50, mz = \">=\") ms2_ref <- ms2_ref[lengths(ms2_ref) > 1] |> scalePeaks() register(SerialParam()) #' Define the settings for the spectra matching. prm <- CompareSpectraParam(ppm = 40, tolerance = 0.05, requirePrecursor = TRUE, THRESHFUN = function(x) which(x >= 0.6)) ms2_mtch <- matchSpectra(ms2_ctr_fts, ms2_ref, param = prm) ms2_mtch ## Object of class MatchedSpectra ## Total number of matches: 214 ## Number of query objects: 315 (16 matched) ## Number of target objects: 69561 (21 matched) #' Keep only query spectra with matching reference spectra ms2_mtch <- ms2_mtch[whichQuery(ms2_mtch)] #' Extract the results ms2_mtch_res <- matchedData(ms2_mtch) nrow(ms2_mtch_res) ## [1] 214 #' - split the result per feature #' - select for each feature the best matching result for each compound #' - combine the result again into a data frame ms2_mtch_res <- ms2_mtch_res |> split(f = paste(ms2_mtch_res$feature_id, ms2_mtch_res$target_inchikey)) |> lapply(function(z) { z[which.max(z$score), ] }) |> do.call(what = rbind) |> as.data.frame() #' List the best matching feature-compound pair pandoc.table(ms2_mtch_res[, c(\"feature_id\", \"target_name\", \"score\", \"target_inchikey\")], style = \"rmarkdown\", caption = \"Table 9.MS2 annotation results.\", split.table = Inf) par(mfrow = c(1, 2)) col_sample <- col_phenotype[sampleData(data)$phenotype] #' Extract and plot EIC for the annotated feature eic <- featureChromatograms(data, features = ms2_mtch_res$feature_id[1]) plot(eic, col = col_sample, peakCol = col_sample[chromPeaks(eic)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic)[, \"sample\"]], 20)) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1) #' Identify the best matching query-target spectra pair idx <- which.max(ms2_mtch_res$score) #' Indicate the retention time of the MS2 spectrum in the EIC plot abline(v = ms2_mtch_res$rtime[idx]) #' Get the index of the MS2 spectrum in the query object query_idx <- which(query(ms2_mtch)$.original_query_index == ms2_mtch_res$.original_query_index[idx]) query_ms2 <- query(ms2_mtch)[query_idx] #' Get the index of the MS2 spectrum in the target object target_idx <- which(target(ms2_mtch)$spectrum_id == ms2_mtch_res$target_spectrum_id[idx]) target_ms2 <- target(ms2_mtch)[target_idx] #' Create a mirror plot comparing the two best matching spectra plotSpectraMirror(query_ms2, target_ms2) legend(\"topleft\", legend = paste0(\"precursor m/z: \", format(precursorMz(query_ms2), 3))) spectraData(target_ms2, c(\"collisionEnergy_text\", \"fragmentation_mode\", \"instrument_type\", \"instrument\", \"adduct\")) |> as.data.frame() ## collisionEnergy_text fragmentation_mode instrument_type ## 1 55 (nominal) HCD LC-ESI-ITFT ## instrument adduct ## 1 LTQ Orbitrap XL Thermo Scientific [M+H]+"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"external-tools-or-alternative-annotation-approaches","dir":"Articles","previous_headings":"Annotation","what":"External tools or alternative annotation approaches","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"present workflow highlights annotation performed within R using packages Bioconductor project, also excellent external softwares used alternative, SIRIUS (Dührkop et al. 2019), mummichog (Li et al. 2013) GNPS (Nothias et al. 2020) among others. use , data need exported format supported . MS2 spectra, data easily exported required MGF file format using r Biocpkg(\"MsBackendMgf\") Bioconductor package. Integration xcms feature-based molecular networking GNPS described GNPS documentation. alternative, addition, evidence potential matching chemical formula feature derived evaluating isotope pattern full MS1 scan. provide information isotope composition. Also , various functions isotopologues() r Biocpkg(\"MetaboCoreUtils\") package functionality envipat R package (Loos et al. 2015) used.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"summary","dir":"Articles","previous_headings":"","what":"Summary","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"tutorial, describe end--end workflow LC-MS-based untargeted metabolomics experiments, conducted entirely within R using packages Bioconductor project base R functionality. excellent software exists perform similar analyses, power R-based workflow lies adaptability individual data sets research questions ability build reproducible workflows documentation. Due space restrictions don’t provide comprehensive listing methodologies individual analysis steps. advanced options approaches available, e.g., normalization data, however also heavily dependent size properties analyzed data set, well annotation features. result, found present analysis set features significant abundance differences compared groups. however reliably annotate single feature, related lifestyle individuals rather pathological properties investigated disease. low proportion annotated signals however uncommon untargeted metabolomics experiments reflects need comprehensive reliable reference annotation libraries.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"session-information","dir":"Articles","previous_headings":"","what":"Session information","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"","code":"sessionInfo() ## R version 4.4.1 (2024-06-14) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 22.04.4 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Etc/UTC ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] MetaboAnnotation_1.9.1 CompoundDb_1.9.4 ## [3] AnnotationFilter_1.29.0 AnnotationHub_3.13.3 ## [5] BiocFileCache_2.13.0 dbplyr_2.5.0 ## [7] gridExtra_2.3 ggfortify_0.4.17 ## [9] ggplot2_3.5.1 vioplot_0.5.0 ## [11] zoo_1.8-12 sm_2.2-6.0 ## [13] pheatmap_1.0.12 RColorBrewer_1.1-3 ## [15] pander_0.6.5 limma_3.61.10 ## [17] MetaboCoreUtils_1.13.0 Spectra_1.15.8 ## [19] xcms_4.3.3 BiocParallel_1.39.0 ## [21] SummarizedExperiment_1.35.1 GenomicRanges_1.57.1 ## [23] GenomeInfoDb_1.41.1 IRanges_2.39.2 ## [25] S4Vectors_0.43.2 MatrixGenerics_1.17.0 ## [27] matrixStats_1.4.1 MsBackendMetaboLights_0.99.0 ## [29] MsIO_0.0.6 MsExperiment_1.7.0 ## [31] ProtGenerics_1.37.1 readxl_1.4.3 ## [33] Biobase_2.65.1 BiocGenerics_0.51.1 ## [35] rmarkdown_2.28 knitr_1.48 ## [37] BiocStyle_2.33.1 ## ## loaded via a namespace (and not attached): ## [1] bitops_1.0-8 filelock_1.0.3 ## [3] tibble_3.2.1 cellranger_1.1.0 ## [5] preprocessCore_1.67.0 XML_3.99-0.17 ## [7] lifecycle_1.0.4 doParallel_1.0.17 ## [9] lattice_0.22-6 MASS_7.3-61 ## [11] alabaster.base_1.5.9 MultiAssayExperiment_1.31.5 ## [13] magrittr_2.0.3 sass_0.4.9 ## [15] jquerylib_0.1.4 yaml_2.3.10 ## [17] MsCoreUtils_1.17.2 DBI_1.2.3 ## [19] abind_1.4-8 zlibbioc_1.51.1 ## [21] purrr_1.0.2 RCurl_1.98-1.16 ## [23] rappdirs_0.3.3 GenomeInfoDbData_1.2.12 ## [25] MSnbase_2.31.1 pkgdown_2.1.1 ## [27] ncdf4_1.23 codetools_0.2-20 ## [29] DelayedArray_0.31.11 DT_0.33 ## [31] xml2_1.3.6 tidyselect_1.2.1 ## [33] farver_2.1.2 UCSC.utils_1.1.0 ## [35] base64enc_0.1-3 jsonlite_1.8.9 ## [37] iterators_1.0.14 systemfonts_1.1.0 ## [39] foreach_1.5.2 tools_4.4.1 ## [41] progress_1.2.3 ragg_1.3.3 ## [43] Rcpp_1.0.13 glue_1.7.0 ## [45] SparseArray_1.5.39 xfun_0.47 ## [47] dplyr_1.1.4 withr_3.0.1 ## [49] BiocManager_1.30.25 fastmap_1.2.0 ## [51] rhdf5filters_1.17.0 fansi_1.0.6 ## [53] digest_0.6.37 mime_0.12 ## [55] R6_2.5.1 textshaping_0.4.0 ## [57] colorspace_2.1-1 rsvg_2.6.1 ## [59] RSQLite_2.3.7 utf8_1.2.4 ## [61] tidyr_1.3.1 generics_0.1.3 ## [63] prettyunits_1.2.0 PSMatch_1.9.0 ## [65] httr_1.4.7 htmlwidgets_1.6.4 ## [67] S4Arrays_1.5.8 pkgconfig_2.0.3 ## [69] gtable_0.3.5 blob_1.2.4 ## [71] impute_1.79.0 MassSpecWavelet_1.71.0 ## [73] XVector_0.45.0 htmltools_0.5.8.1 ## [75] bookdown_0.40 MALDIquant_1.22.3 ## [77] clue_0.3-65 scales_1.3.0 ## [79] png_0.1-8 reshape2_1.4.4 ## [81] rjson_0.2.23 curl_5.2.3 ## [83] cachem_1.1.0 rhdf5_2.49.0 ## [85] stringr_1.5.1 BiocVersion_3.20.0 ## [87] parallel_4.4.1 AnnotationDbi_1.67.0 ## [89] mzID_1.43.0 vsn_3.73.0 ## [91] desc_1.4.3 pillar_1.9.0 ## [93] grid_4.4.1 alabaster.schemas_1.5.0 ## [95] vctrs_0.6.5 MsFeatures_1.13.0 ## [97] pcaMethods_1.97.0 cluster_2.1.6 ## [99] evaluate_1.0.0 cli_3.6.3 ## [101] compiler_4.4.1 rlang_1.1.4 ## [103] crayon_1.5.3 labeling_0.4.3 ## [105] QFeatures_1.15.3 ChemmineR_3.57.0 ## [107] affy_1.83.0 plyr_1.8.9 ## [109] fs_1.6.4 stringi_1.8.4 ## [111] munsell_0.5.1 Biostrings_2.73.1 ## [113] lazyeval_0.2.2 Matrix_1.7-0 ## [115] hms_1.1.3 bit64_4.5.2 ## [117] Rhdf5lib_1.27.0 KEGGREST_1.45.1 ## [119] statmod_1.5.0 highr_0.11 ## [121] mzR_2.39.0 igraph_2.0.3 ## [123] memoise_2.0.1 affyio_1.75.0 ## [125] bslib_0.8.0 bit_4.5.0"},{"path":[]},{"path":[]},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"aknowledgment","dir":"Articles","previous_headings":"Appendix","what":"Aknowledgment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Thanks Steffen Neumann continuous work develop maintain xcms software. …","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"alignment-using-manually-selected-anchor-peaks","dir":"Articles","previous_headings":"Appendix","what":"Alignment using manually selected anchor peaks","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"align data set using internal standards. suggested eventually enrich anchor peaks signal ions retention time regions covered internal standards.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"additional-informations","dir":"Articles","previous_headings":"","what":"Additional informations","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"","code":"#possible extra info: # -"},{"path":"https://rformassspectrometry.github.io/metabonaut/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Philippine Louail. Author, maintainer. Johannes Rainer. Author.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Louail P, Rainer J (2024). metabonaut: Exploring Analyzing LC-MS data. R package version 0.0.1, https://rformassspectrometry.github.io/metabonaut/, https://github.com/rformassspectrometry/metabonaut/.","code":"@Manual{, title = {metabonaut: Exploring and Analyzing LC-MS data}, author = {Philippine Louail and Johannes Rainer}, year = {2024}, note = {R package version 0.0.1, https://rformassspectrometry.github.io/metabonaut/}, url = {https://github.com/rformassspectrometry/metabonaut/}, }"},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"exploring-and-analyzing-lc-ms-data","dir":"","previous_headings":"","what":"Exploring and Analyzing LC-MS data","title":"Exploring and Analyzing LC-MS data","text":"walks preprocessing small data set emphasizing selection data-dependent settings individual preprocessing steps. full R code examples along comprehensive descriptions provided end--end-untargeted-metabolomics.Rmd file. file can opened e.g. RStudio allows execution individual R commands (see section additionally required R packages). R command rmarkdown::render(\"xcms-preprocessing.Rmd\") generate html file xcms-preprocessing.html.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"important-to-note","dir":"","previous_headings":"","what":"Important to note","title":"Exploring and Analyzing LC-MS data","text":"tutorial expect user basic knowledge R Rmarkdown. advise go short tutorial order comfortable testing code easily adapting data. Rmarkdown, click R, ","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Exploring and Analyzing LC-MS data","text":"workshop files along R runtime environment including required packages RStudio (Posit) editor bundled docker container. installation, docker container can run computer code examples workshop can evaluated within environment (without need install additional packages files). version workshop uses packages Bioconductor devel hence bases Bioconductor’s docker container development version packages. stable version come soon. required steps installation : don’t already , install docker. Find installation information . Get docker image tutorial e.g. command line docker pull rformassspectrometry/metabonaut:latest. Start docker container, either Docker Desktop, command line Enter http://localhost:8787 web browser log username rstudio password bioc. RStudio server version: open R-markdown (.Rmd) files vignettes folder evaluate R code blocks document. manual installation, R version >= 4.4.0 required well recent versions packages used workflow. now 2 packages used workflow bioconductor therefore need downloaded github. Run code follow:","code":"docker run \\ -e PASSWORD=bioc \\ -p 8787:8787 \\ rformassspectrometry/metabonaut:latest install.packages(\"BiocManager\") BiocManager::install(\"RforMassSpectrometry/MsBackendMetaboLights\", dependencies = TRUE) BiocManager::install(\"RforMassSpectrometry/MsIO\", dependencies = TRUE)"},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"known-issues","dir":"","previous_headings":"","what":"Known issues","title":"Exploring and Analyzing LC-MS data","text":"workflow still getting ready fully deployed, therefore ongoing issue actively resolving. chunks Line 414 453 rendered rendered issue backend. issue, hesitate report us.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"contribution","dir":"","previous_headings":"","what":"Contribution","title":"Exploring and Analyzing LC-MS data","text":"contributions, see RforMassSpectrometry contributions guideline.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"code-of-conduct","dir":"","previous_headings":"","what":"Code of Conduct","title":"Exploring and Analyzing LC-MS data","text":"See RforMassSpectrometry Code Conduct.","code":""},{"path":[]},{"path":"https://rformassspectrometry.github.io/metabonaut/news/index.html","id":"changes-in-0-0-1","dir":"Changelog","previous_headings":"","what":"Changes in 0.0.1","title":"metabonaut 0.0.1","text":"Addition basic files workflow package. Addition end--end vignette.","code":""}]
+[{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"abstract","dir":"Articles","previous_headings":"","what":"Abstract","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Metabolomics provides real-time view metabolic state examined samples, mass spectrometry serving key tool deciphering intricate differences metabolomes due specific factors. context metabolomic investigations, untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) emerges powerful approach thanks versatility resolution. paper focuses dataset aimed identifying differences plasma metabolite levels individuals suffering cardiovascular disease healthy controls. Despite potential insights offered untargeted LC-MS/MS data, significant challenge field lies generation reproducible scalable analysis workflows. struggle due aforementioned high versatility technique, results difficulty one-size-fits-workflow software adapt experimental setups. power R-based analysis workflows lies high customizability adaptability specific instrumental experimental setups; however, various specialized packages exist individual analysis steps, seamless integration application large cohort datasets remain elusive. Addressing gap, present innovative R workflow leverages xcms, packages RforMassSpectrometry environment encompass aspects pre-processing downstream analyses LC-MS/MS datasets reproducible manner allow easy customization generate data-set specific workflows. workflow seamlessly integrates Bioconductor packages, offering adaptability diverse study designs analysis requirements.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"keyword","dir":"Articles","previous_headings":"","what":"Keyword","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"LC-MS/MS, reproducibility, workflow, xcms, R, normalization, feature identification, Bioconductor,…","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"introduction","dir":"Articles","previous_headings":"","what":"Introduction","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Liquid chromatography coupled tandem mass spectrometry (LC-MS/MS) powerful tool metabolomics investigations, providing comprehensive view metabolome. enables identification large number metabolites relative abundance biological samples. Liquid Chromatography (LC) separation technique relies different interactions analytes towards chromatographic column - stationary phase - eluent analysis - mobile phase. stronger affinity analyte stationary phase - dictated polarity, size, charges parameters - longer take compound leave column detected coupled technique - Mass Spectrometer. Mass Spectrometry allows identify quantify ions based mass--charge (m/z) ratio. high selectivity relies capability separate compounds small variations mass, also capacity promote fragmentation. ion initial m/z (parent ion) can broken characteristic fragments (daughter ions), help structure elucidation identification specific compound (Theodoridis et al. 2012). Therefore, LC-MS/MS data usually tridimensional datasets containing retention time compounds separation LC, detected m/z compounds given time, intensity signals. Furthermore, MS signal can two different levels, corresponding signal parent ion (called MS1) signals corresponding fragments (denominanted MS2). high sensitivity specificity LC-MS/MS make indispensable tool biomarker discovery elucidating metabolic pathways. untargeted approach particularly useful hypothesis-free investigations, allowing detection unexpected metabolites pathways. However, analysis LC-MS/MS data complex requires series preprocessing steps extract meaningful information raw data. main challenges include dealing lack ground truth data, high dimensionality data, presence noise artifacts (Gika, Wilson, Theodoridis 2014). Moreover, due different instrumental setups protocols definition single one-fits-workflow impossible. Finally, specialized software packages exist individual step analysis, seamless integration remains elusive. present complete analysis workflow untargeted LC-MS/MS data using R Bioconductor packages, particular RforMassSpectrometry package ecosystem. later initiative initiative aims implement expandable, flexible infrastructure analysis MS data, providing also comprehensive toolbox functions build customized analysis workflows. demonstrate various algorithms can adapted particular data set various R packages can seamlessly integrated ensure efficient reproducible processing. present workflow covers steps LC-MS/MS data analysis, preprocessing, data normalization, differential abundance analysis annotation significant features .e., collections signals retention time mass--charge ratios pertaining ions. Various options visualizations well quality assessment presented analysis steps.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-description","dir":"Articles","previous_headings":"","what":"Data description","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"workflow two datasets utilized, LC-MS-based (MS1 level ) untargeted metabolomics data set quantify small polar metabolites human plasma samples additional LC-MS/MS data set selected samples former study identification/annotation significant features. samples used randomly selected larger study identification metabolites differences abundances individuals suffering cardiovascular disease (CVD) healthy controls (CTR).subset analyzed comprises data three CVD three CTR well four quality control (QC) samples. QC samples represent pool serum samples large cohort repeatedly measured throughout experiment monitor stability signal. data metadata workflow accessible MetaboLight database ID: MTBLS8735. detailed materials method used analysis samples can also found metabolight database. especially pertinent analysis chosen parameters, want highlight samples analyzed using ultra-high-performance liquid chromatography (UHPLC) coupled Q-TOF mass spectrometer (TripleTOF 5600+). chromatographic separation based hydrophilic interaction liquid chromatography (HILIC).","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"workflow-description","dir":"Articles","previous_headings":"","what":"Workflow description","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"present workflow describes steps analysis LC-MS/MS experiment, includes preprocessing raw data generate abundance matrix features various samples, followed data normalization, differential abundance analysis finally annotation features metabolites. Note also alternative analysis options R packages used different steps examples mentioned throughout workflow. [jo: ’ll include maybe later. key justify workflow comprehensive] workflow therefore based following dependencies:","code":"## General bioconductor package library(Biobase) ## Data Import and handling library(readxl) library(MsExperiment) library(MsIO) library(MsBackendMetaboLights) library(SummarizedExperiment) ## Preprocessing of LC-MS data library(xcms) library(Spectra) library(MetaboCoreUtils) ## Statistical analysis library(limma) # Differential abundance library(matrixStats) # Summaries over matrices ## Visualisation library(pander) library(RColorBrewer) library(pheatmap) library(vioplot) library(ggfortify) # Plot PCA library(gridExtra) # To arrange multiple ggplots into single plots ## Annotation library(AnnotationHub) # Annotation resources library(CompoundDb) # Access small compound annotation data. library(MetaboAnnotation) # Functionality for metabolite annotation."},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-import","dir":"Articles","previous_headings":"","what":"Data import","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Note different equipment generate various file extensions, conversion step might needed beforehand, though apply dataset. Spectra package supports variety ways store retrieve MS data, including mzML, mzXML, CDF files, simple flat files, database systems. necessary, several tools, ProteoWizard’s MSConvert, can used convert files .mzML format (Chambers et al. 2012). show extract dataset MetaboLigths database load MsExperiment object. information load data MetaboLights database, refer vignette. type data loading, check link: next configure parallel processing setup. functions xcms package allow per-sample parallel processing, can improve performance analysis, especially large data sets. xcms packages RforMassSpectrometry package ecosystem use parallel processing setup configured BiocParallel Bioconductor package. code use fork-based parallel processing unix system, socket-based parallel processing Windows operating system.","code":"param <- MetaboLightsParam(mtblsId = \"MTBLS8735\", assayName = paste0(\"a_MTBLS8735_LC-MS_positive_\", \"hilic_metabolite_profiling.txt\"), filePattern = \".mzML\") data <- readMsObject(MsExperiment(), param, keepOntology = FALSE, keepProtocol = FALSE, simplify = TRUE) #' Set up parallel processing using 2 cores if (.Platform$OS.type == \"unix\") { register(MulticoreParam(2)) } else{ register(SnowParam(2)) }"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-organisation","dir":"Articles","previous_headings":"","what":"Data organisation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"experimental data now represented MsExperiment object MsExperiment package. MsExperiment object container metadata spectral data provides manages also linkage samples spectra. provide brief overview data structure content. sampleData() function extracts sample information object. next extract data use pander package render show information Table 1 . Throughout document use R pipe operator (|>) avoid nested function calls hence improve code readability. Table 1. Samples data set. (continued ) Table 1. Samples data set. (continued ) 11 samples data set. abbreviations essential proper interpretation metadata information: injection_index: index representing order (position) individual sample measured (injected) within LC-MS measurement run experiment. \"QC\": Quality control sample (pool serum samples external, large cohort). \"CVD\": Sample individual cardiovascular disease. \"CTR\": Sample presumably healthy control. sample_name: arbitrary name/identifier sample. age: (rounded) age individuals. define colors sample groups based sample group using RColorBrewer package: MS data experiment stored Spectra object (Spectra Bioconductor package) within MsExperiment object can accessed using spectra() function. element object spectrum - organised linearly combined Spectra object one (ordered retention time samples). access dataset’s Spectra object summarize available information provide, among things, total number spectra data set. can also summarize number spectra respective MS level (extracted msLevel() function). fromFile() function returns spectrum index sample (data file) can thus used split information (MS level case) sample summarize using base R table() function combine result matrix. Note number spectra acquired run, number spectral features sample. present data set thus contains MS1 data, ideal quantification signal. second (LC-MS/MS) data set also fragment (MS2) spectra samples used later workflow. Note users restrict data evaluation examples shown tutorials. Spectra package enables user-friendly access full MS data functionality extensively used explore, visualize summarize data. another example, determine retention time range entire data set. Data obtained LC-MS experiments typically analyzed along retention time axis, MS data organized spectrum, orthogonal retention time axis.","code":"data ## Object of class MsExperiment ## Spectra: MS1 (17210) ## Experiment data: 10 sample(s) ## Sample data links: ## - spectra: 10 sample(s) to 17210 element(s). #' Access Spectra Object spectra(data) ## MSn data (Spectra) with 17210 spectra in a MsBackendMetaboLights backend: ## msLevel rtime scanIndex ## ## 1 1 0.274 1 ## 2 1 0.553 2 ## 3 1 0.832 3 ## 4 1 1.111 4 ## 5 1 1.390 5 ## ... ... ... ... ## 17206 1 479.052 1717 ## 17207 1 479.331 1718 ## 17208 1 479.610 1719 ## 17209 1 479.889 1720 ## 17210 1 480.168 1721 ## ... 36 more variables/columns. ## ## file(s): ## MS_QC_POOL_1_POS.mzML ## MS_A_POS.mzML ## MS_B_POS.mzML ## ... 7 more files #' Count the number of spectra with a specific MS level per file. spectra(data) |> msLevel() |> split(fromFile(data)) |> lapply(table) |> do.call(what = cbind) ## 1 2 3 4 5 6 7 8 9 10 ## 1 1721 1721 1721 1721 1721 1721 1721 1721 1721 1721 #' Retention time range for entire dataset spectra(data) |> rtime() |> range() ## [1] 0.273 480.169"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-visualization-and-general-quality-assessment","dir":"Articles","previous_headings":"","what":"Data visualization and general quality assessment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Effective visualization paramount inspecting assessing quality MS data. general overview LC-MS data, can: Combine mass peaks (MS1) spectra sample single spectrum mass peak represents maximum signal mass peaks similar m/z. spectrum might called Base Peak Spectrum (BPS), providing information abundant ions sample. Aggregate mass peak intensities spectrum, resulting Base Peak Chromatogram (BPC). BPC shows highest measured intensity distinct retention time (hence spectrum) thus orthogonal BPS. Sum mass peak intensities spectrum create Total Ion Chromatogram (TIC). Compare BPS samples experiment evaluate similarity ion content. Compare BPC samples experiment identify samples similar dissimilar chromatographic signal. addition general data evaluation visualization, also crucial investigate specific signal e.g. internal standards compounds/ions known present samples. providing reliable reference, internal standards help achieve consistent accurate analytical results.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"spectra-data-visualization-bps","dir":"Articles","previous_headings":"Data visualization and general quality assessment","what":"Spectra Data Visualization: BPS","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"BPS collapses data retention time dimension reveals prevalent ions present samples, creation BPS however straightforward. Mass peaks, even representing signals ion, never identical m/z values consecutive spectra due measurement error/resolution instrument. use combineSpectra function combine spectra one file (defined using parameter f = fromFile(data)) single spectrum. mass peaks difference m/z value smaller 3 parts-per-million (ppm) combined one mass peak, intensity representing maximum grouped mass peaks. reduce memory requirement, addition first bin spectrum combining mass peaks within spectrum, aggregating mass peaks bins 0.01 m/z width. case large datasets, also recommended set processingChunkSize() parameter MsExperiment object finite value (default Inf) causing data processed (loaded memory) chunks processingChunkSize() spectra. can reduce memory demand speed process. can now generate BPS sample plot() . , observable overlap ion content files, particularly around 300 m/z 700 m/z. however also differences sets samples. particular, BPS 1, 4, 7 10 (counting row-wise left right) seem different others. fact, four BPS QC samples, remaining six study samples. observed differences might explained fact QC samples pools serum samples different cohort, study samples represent plasma samples, different sample collection. Next visual inspection , can also calculate express similarity BPS heatmap. use compareSpectra() function calculate pairwise similarities BPS use pheatmap() function pheatmap package cluster visualize result. get first glance different samples distribute terms similarity. heatmap confirms observations made BPS, showing distinct clusters QCs study samples, owing different matrices sample collections. also strongly recommended delve deeper data exploring detail. can accomplished carefully assessing data extracting spectra regions interest examination. next chunk, look extract information specific spectrum distinct samples. significant dissimilarities peak distribution intensity confirm difference composition QCs study samples. next compare full MS1 spectrum CVD CTR sample. , can observe spectra CVD CTR samples entirely similar, exhibit similar main peaks 200 600 m/z general higher intensity control samples. However peak distribution (least intensity) seems vary m/z 10 210 m/z 600. CTR spectrum exhibits significant peaks around m/z 150 - 200 much lower intensity CVD sample. delve details specific spectrum, wide range functions can employed: NumericList length 1 [[1]] 18.3266733266736 45.1666666666667 … 27.1048951048951 34.9020979020979 [1] 34.872 NumericList length 1 [[1]] 51.1677328505635 53.0461968245186 … 999.139446289161 999.315208803072 Table 2. Intensity m/z values 125th spectrum one CTR sample.","code":"#' Setting the chunksize chunksize <- 1000 processingChunkSize(spectra(data)) <- chunksize #' Accessing a single spectrum - comparing with QC par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) spec1 <- spectra(data[1])[125] spec2 <- spectra(data[3])[125] plotSpectra(spec1, main = \"QC sample\") plotSpectra(spec2, main = \"CTR sample\") #' Accessing a single spectrum - comparing CVD and CTR par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) spec1 <- spectra(data[2])[125] spec2 <- spectra(data[3])[125] plotSpectra(spec1, main = \"CVD sample\") plotSpectra(spec2, main = \"CTR sample\") #' Checking its intensity intensity(spec2) #' Checking its rtime rtime(spec2) #' Checking its m/z mz(spec2) #' Filtering for a specific m/z range and viewing in a tabular format filt_spec <- filterMzRange(spec2,c(50,200)) data.frame(intensity = unlist(intensity(filt_spec)), mz = unlist(mz(filt_spec))) |> head() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 2. Intensity and m/z values of the 125th spectrum of one CTR sample.\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"chromatographic-data-visualization-bpc-and-tic","dir":"Articles","previous_headings":"Data visualization and general quality assessment","what":"Chromatographic Data Visualization: BPC and TIC","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"chromatogram() function facilitates extraction intensities along retention time. However, access chromatographic information currently efficient seamless spectral information. Work underway develop/improve infrastructure chromatographic data new Chromatograms object aimed flexible user-friendly Spectra object. visualizing LC-MS data, BPC TIC serves valuable tool assess performance liquid chromatography across various samples experiment. case, extract BPC data create plot. BPC captures maximum peak signal spectrum data file plots information retention time spectrum y-axis. BPC can extracted using chromatogram function. setting parameter aggregationFun = \"max\", instruct function report maximum signal per spectrum. Conversely, setting aggregationFun = \"sum\", sums intensities spectrum, thereby creating TIC. 240 seconds signal seems measured. Thus, filter data removing part well first 10 seconds measured LC run. Initially, examined entire BPC subsequently filtered based desired retention times. results smaller file size also facilitates straightforward interpretation BPC. final plot illustrates BPC sample colored phenotype, providing insights signal measured along retention times sample. reveals points compounds eluted LC column. essence, BPC condenses three-dimensional LC-MS data (m/z retention time intensity) two dimensions (retention time intensity). can also compare similarities BPCs heatmap. retention times however identical different samples. Thus bin() chromatographic signal per sample along retention time axis bins two seconds resulting data number bins/data points. can calculate pairwise similarities data vectors using cor() function visualize result using pheatmap(). heatmap reinforces exploration spectra data showed, strong separation QC study samples. important bear mind later analyses. Additionally, study samples group two clusters, cluster containing samples C F cluster II samples. plot TIC samples, using different color cluster. TIC samples look similar, samples cluster show different signal retention time range 40 160 seconds. Whether, strong difference impact following analysis remains determined.","code":"#' Extract and plot BPC for full data bpc <- chromatogram(data, aggregationFun = \"max\") plot(bpc, col = paste0(col_sample, 80), main = \"BPC\", lwd = 1.5) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\") #' Filter the data based on retention time data <- filterRt(data, c(10, 240)) bpc <- chromatogram(data, aggregationFun = \"max\") #' Plot after filtering plot(bpc, col = paste0(col_sample, 80), main = \"BPC after filtering retention time\", lwd = 1.5) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\") #' Total ion chromatogram tic <- chromatogram(data, aggregationFun = \"sum\") |> bin(binSize = 2) #' Calculate similarity (Pearson correlation) between BPCs ticmap <- do.call(cbind, lapply(tic, intensity)) |> cor() rownames(ticmap) <- colnames(ticmap) <- sampleData(data)$sample_name ann <- data.frame(phenotype = sampleData(data)[, \"phenotype\"]) rownames(ann) <- rownames(ticmap) #' Plot heatmap pheatmap(ticmap, annotation_col = ann, annotation_colors = list(phenotype = col_phenotype)) cluster_I_idx <- sampleData(data)$sample_name %in% c(\"F\", \"C\") cluster_II_idx <- sampleData(data)$sample_name %in% c(\"A\", \"B\", \"D\", \"E\") temp_col <- c(\"grey\", \"red\") names(temp_col) <- c(\"Cluster II\", \"Cluster I\") col <- rep(temp_col[1], length(data)) col[cluster_I_idx] <- temp_col[2] col[sampleData(data)$phenotype == \"QC\"] <- NA data |> chromatogram(aggregationFun = \"sum\") |> plot( col = col, main = \"TIC after filtering retention time\", lwd = 1.5) grid() legend(\"topright\", col = temp_col, legend = names(temp_col), lty = 1, lwd = 2, horiz = TRUE, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"known-compounds","dir":"Articles","previous_headings":"Data visualization and general quality assessment > Chromatographic Data Visualization: BPC and TIC","what":"Known compounds","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Throughout entire process, crucial reference points within dataset, well-known ions. experiments nowadays include internal standards (), case . strongly recommend using visualization throughout entire analysis. experiment, set 15 spiked samples. reviewing signal , selected two guide analysis process. However, also advise plot evaluate ions steps. illustrate , generate Extracted Ion Chromatograms (EIC) selected test ions. restricting MS data intensities within restricted, small m/z range selected retention time window, EICs expected contain signal single type ion. expected m/z retention times set determined different experiment. Additionally, cases internal standards available, commonly present ions sample matrix can serve suitable alternatives. Ideally, compounds distributed across entire retention time range experiment. Table 3.Internal standard list respective m/z expected retention time [s]. (continued ) plot EICs isotope labeled cystine methionine. can observe clear concentration difference QCs study samples isotope labeled cystine ion. Meanwhile, labeled methionine internal standard exhibits discernible signal amidst noise noticeable retention time shift samples. artificially isotope labeled compounds spiked individual samples, also signal endogenous compounds serum (plasma) samples. Thus, calculate next mass m/z [M+H]+ ion endogenous cystine chemical formula extract also EIC ion. calculation exact mass m/z selected ion adduct use calculateMass() mass2mz() functions r Biocpkg(\"MetaboCoreUtils\") package. two cystine EICs look highly similar (endogenous shown left, isotope labeled right plot ), shift m/z, arises artificial labeling. shift allows us discriminate endogenous non-endogenous compound.","code":"#' Load our list of standard intern_standard <- read.delim(\"intern_standard_list.txt\") # Extract EICs for the list eic_is <- chromatogram( data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) #' Add internal standard metadata fData(eic_is)$mz <- intern_standard$mz fData(eic_is)$rt <- intern_standard$RT fData(eic_is)$name <- intern_standard$name fData(eic_is)$abbreviation <- intern_standard$abbreviation rownames(fData(eic_is)) <- intern_standard$abbreviation #' Summary of IS information cpt <- paste(\"Table 3.Internal standard list with respective m/z and expected\", \"retention time [s].\") fData(eic_is)[, c(\"name\", \"mz\", \"rt\")] |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = cpt) #' Extract the two IS from the chromatogram object. eic_cystine <- eic_is[\"cystine_13C_15N\"] eic_met <- eic_is[\"methionine_13C_15N\"] #' plot both EIC par(mfrow = c(1, 2), mar = c(4, 2, 2, 0.5)) plot(eic_cystine, main = fData(eic_cystine)$name, cex.axis = 0.8, cex.main = 0.8, col = paste0(col_sample, 80)) grid() abline(v = fData(eic_cystine)$rt, col = \"red\", lty = 3) plot(eic_met, main = fData(eic_met)$name, cex.axis = 0.8, cex.main = 0.8, col = paste0(col_sample, 80)) grid() abline(v = fData(eic_met)$rt, col = \"red\", lty = 3) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") #' extract endogenous cystine mass and EIC and plot. cysmass <- calculateMass(\"C6H12N2O4S2\") cys_endo <- mass2mz(cysmass, adduct = \"[M+H]+\")[, 1] #' Plot versus spiked par(mfrow = c(1, 2)) chromatogram(data, mz = cys_endo + c(-0.005, 0.005), rt = unlist(fData(eic_cystine)[, c(\"rtmin\", \"rtmax\")]), aggregationFun = \"max\") |> plot(col = paste0(col_sample, 80)) |> grid() plot(eic_cystine, col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-preprocessing","dir":"Articles","previous_headings":"","what":"Data preprocessing","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Preprocessing stands inaugural step analysis untargeted LC-MS. characterized 3 main stages: chromatographic peak detection, retention time shift correction (alignment) correspondence results features defined. primary objective preprocessing quantification signals ions measured sample, addressing potential retention time drifts samples, ensuring alignment quantified signals across samples within experiment. final result LC-MS data preprocessing numeric matrix abundances quantified entities samples experiment. [anna: silly question: isn’t goal preprocessing align group signals pertaining certain ion feature? obtain matrix abundances][phili: actually really like anna’s simple definition. think ?]","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"chromatographic-peak-detection","dir":"Articles","previous_headings":"Data preprocessing","what":"Chromatographic peak detection","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"initial preprocessing step involves detecting intensity peaks along retention time axis, called chromatographic peaks. achieve , employ findChromPeaks() function within xcms. function supports various algorithms peak detection, can selected configured respective parameter objects. preferred algorithm case, CentWave, utilizes continuous wavelet transformation (CWT)-based peak detection (Tautenhahn, Böttcher, Neumann 2008). method known effectiveness handling non-Gaussian shaped chromatographic peaks peaks varying retention time widths, commonly encountered HILIC separations. apply CentWave algorithm default settings extracted ion chromatogram cystine methionine ions evaluate results. CentWave highly performant algorithm, requires costumized dataset. implies parameters fine-tuned based user’s data. example serves clear motivation users familiarize various parameters need adapt data set. discuss main parameters can easily adjusted suit user’s dataset: peakwidth: Specifies minimal maximal expected width peaks retention time dimension. Highly dependent chromatographic settings used. ppm: maximal allowed difference mass peaks’ m/z values (parts-per-million) consecutive scans consider representing signal ion. integrate: parameter defines integration method. , primarily use integrate = 2 integrates also signal chromatographic peak’s tail considered accurate developers. determine peakwidth, recommend users refer previous EICs estimate range peak widths observe dataset. Ideally, examining multiple EICs goal. dataset, peak widths appear around 2 10 seconds. advise choosing range wide narrow peakwidth parameter can lead false positives negatives. determine ppm, deeper analysis dataset needed. clarified ppm depends instrument, users necessarily input vendor-advertised ppm. ’s determine accurately possible: following steps involve generating highly restricted MS area single mass peak per spectrum, representing cystine ion. m/z peaks extracted, absolute difference calculated finally expressed ppm. therefore, choose value close maximum within range parameter ppm, .e., 15 ppm. can now perform chromatographic peak detection adapted settings EICs. important note , properly estimate background noise, sufficient data points outside chromatographic peak need present. generally problem peak detection performed full LC-MS data set, peak detection EICs retention time range EIC needs sufficiently wide. function fails find peak EIC, initial troubleshooting step increase range. Additionally, signal--noise threshold snthresh reduced peak detection EICs, within small retention time range, enough signal present properly estimate background noise. Finally, case MS1 data points per peaks, setting CentWave’s advanced parameter extendLengthMSW TRUE can help peak detection. customized parameters, chromatographic peak detected sample. , use plot() function EICs visualize results. can see peak seems ot detected sample ions. indicates custom settings seem thus suitable dataset. now proceed apply entire dataset, extracting EICs ions evaluate confirm chromatographic peak detection worked expected. Note: revert value parameter snthresh default, , mentioned , background noise estimation reliable performed full data set. Parameter chunkSize findChromPeaks() defines number data files loaded memory processed simultaneously. parameter thus allows fine-tune memory demand well performance chromatographic peak detection step. plot EICs two selected internal standards evaluate chromatographic peak detection results. Peaks seem detected properly samples ions. indicates peak detection process entire dataset successful.","code":"#' Use default Centwave parameter param <- CentWaveParam() #' Look at the default parameters param ## Object of class: CentWaveParam ## Parameters: ## - ppm: [1] 25 ## - peakwidth: [1] 20 50 ## - snthresh: [1] 10 ## - prefilter: [1] 3 100 ## - mzCenterFun: [1] \"wMean\" ## - integrate: [1] 1 ## - mzdiff: [1] -0.001 ## - fitgauss: [1] FALSE ## - noise: [1] 0 ## - verboseColumns: [1] FALSE ## - roiList: list() ## - firstBaselineCheck: [1] TRUE ## - roiScales: numeric(0) ## - extendLengthMSW: [1] FALSE ## - verboseBetaColumns: [1] FALSE #' Evaluate for Cystine cystine_test <- findChromPeaks(eic_cystine, param = param) chromPeaks(cystine_test) ## rt rtmin rtmax into intb maxo sn row column #' Evaluate for Methionine met_test <- findChromPeaks(eic_met, param = param) chromPeaks(met_test) ## rt rtmin rtmax into intb maxo sn row column #' Restrict the data to signal from cystine in the first sample cst <- data[1L] |> spectra() |> filterRt(rt = c(208, 218)) |> filterMzRange(mz = fData(eic_cystine)[\"cystine_13C_15N\", c(\"mzmin\", \"mzmax\")]) #' Show the number of peaks per m/z filtered spectra lengths(cst) ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 #' Calculate the difference in m/z values between scans mz_diff <- cst |> mz() |> unlist() |> diff() |> abs() #' Express differences in ppm range(mz_diff * 1e6 / mean(unlist(mz(cst)))) ## [1] 0.08829605 14.82188728 #' Parameters adapted for chromatographic peak detection on EICs. param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2, snthresh = 2) #' Evaluate on the cystine ion cystine_test <- findChromPeaks(eic_cystine, param = param) chromPeaks(cystine_test) ## rt rtmin rtmax into intb maxo sn row column ## [1,] 209.251 207.577 212.878 4085.675 2911.376 2157.459 4 1 1 ## [2,] 209.251 206.182 213.995 24625.728 19074.407 12907.487 4 1 2 ## [3,] 209.252 207.020 214.274 19467.836 14594.041 9996.466 4 1 3 ## [4,] 209.251 207.577 212.041 4648.229 3202.617 2458.485 3 1 4 ## [5,] 208.974 206.184 213.159 23801.825 18126.978 11300.289 3 1 5 ## [6,] 209.250 207.018 213.714 25990.327 21036.768 13650.329 5 1 6 ## [7,] 209.252 207.857 212.879 4528.767 3259.039 2445.841 4 1 7 ## [8,] 209.252 207.299 213.995 23119.449 17274.140 12153.410 4 1 8 ## [9,] 208.972 206.740 212.878 28943.188 23436.119 14451.023 4 1 9 ## [10,] 209.252 207.578 213.437 4470.552 3065.402 2292.881 4 1 10 #' Evaluate on the methionine ion met_test <- findChromPeaks(eic_met, param = param) chromPeaks(met_test) ## rt rtmin rtmax into intb maxo sn row column ## [1,] 159.867 157.913 162.378 20026.61 14715.42 12555.601 4 1 1 ## [2,] 160.425 157.077 163.215 16827.76 11843.39 8407.699 3 1 2 ## [3,] 160.425 157.356 163.215 18262.45 12881.67 9283.375 3 1 3 ## [4,] 159.588 157.635 161.820 20987.72 15424.25 13327.811 4 1 4 ## [5,] 160.985 156.799 163.217 16601.72 11968.46 10012.396 4 1 5 ## [6,] 160.982 157.634 163.214 17243.24 12389.94 9150.079 4 1 6 ## [7,] 159.867 158.193 162.099 21120.10 16202.05 13531.844 3 1 7 ## [8,] 160.426 157.356 162.937 18937.40 13739.73 10336.000 3 1 8 ## [9,] 160.704 158.472 163.215 17882.21 12299.43 9395.548 3 1 9 ## [10,] 160.146 157.914 162.379 20275.80 14279.50 12669.821 3 1 10 #' Using the same settings, but with default snthresh param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2) data <- findChromPeaks(data, param = param, chunkSize = 5) #' Update EIC internal standard object eics_is_noprocess <- eic_is eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_noprocess)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"refine-identified-chromatographic-peaks","dir":"Articles","previous_headings":"Data preprocessing > Chromatographic peak detection","what":"Refine identified chromatographic peaks","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"identification chromatographic peaks using CentWave algorithm can sometimes result artifacts, overlapping split peaks. address issue, refineChromPeaks() function utilized, conjunction MergeNeighboringPeaksParam, aims merging split peaks. show examples CentWave peak detection artifacts. examples pre-selected illustrate necessity next step: cases signal presumably single type ion split two separate chromatographic peaks (indicated vertical line). MergeNeigboringPeaksParam allows combine split peaks. parameters algorithm defined : expandMz: Suggested kept relatively small (0.0015) prevent merging isotopes. expandRt: Usually set approximately half size average retention time width used chromatographic peak detection (case, 2.5 seconds). minProp: Used determine whether candidates merged. Chromatographic peaks overlapping m/z ranges (expanded side expandMz) tail--head distance retention time dimension less 2 * expandRt, signal higher minProp apex intensity chromatographic peak lower intensity, merged. Values parameter small avoid merging closely co-eluting ions, isomers. test settings EICs split peaks. can observe artificially split peaks appropriately merged. Therefore, next apply settings entire dataset. peak merging, column \"merged\" result object’s chromPeakData() data frame can used evaluate chromatographic peaks result represent signal merged, originally identified chromatographic peaks. proceeding next preprocessing step generally suggested evaluate results chromatographic peak detection EICs e.g. internal standards compounds/ions known present samples. Additionally, evaluating comparing number identified chromatographic peaks samples data set can help spotting potentially problematic samples. count number chromatographic peaks per sample show numbers table. Table 4.Samples number identified chromatographic peaks. similar number chromatographic peaks identified within various samples data set. Additional options evaluate results chromatographic peak detection can implemented using plotChromPeaks() function summarizing results using base R commands.","code":"#' set up the parameter param <- MergeNeighboringPeaksParam(expandRt = 2.5, expandMz = 0.0015, minProp = 0.75) #' Perform the peak refinement on the EICs eics <- refineChromPeaks(eics, param = param) plot(eics) #' Apply on whole dataset data <- refineChromPeaks(data, param = param, chunkSize = 5) chromPeakData(data)$merged |> table() ## ## FALSE TRUE ## 79908 9274 eics_is_chrompeaks <- eic_is eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_chrompeaks) eic_cystine <- eic_is[\"cystine_13C_15N\", ] eic_met <- eic_is[\"methionine_13C_15N\", ] #' Count the number of peaks per sample and summarize them in a table. data.frame(sample_name = sampleData(data)$sample_name, peak_count = as.integer(table(chromPeaks(data)[, \"sample\"]))) |> pandoc.table( style = \"rmarkdown\", caption = \"Table 4.Samples and number of identified chromatographic peaks.\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"retention-time-alignment","dir":"Articles","previous_headings":"Data preprocessing","what":"Retention time alignment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Despite using chromatographic settings conditions retention time shifts unavoidable. Indeed, performance instrument can change time, example due small variations environmental conditions, temperature pressure. shifts generally small samples measured within batch/measurement run, can considerable data experiment acquired across longer time period. evaluate presence shift extract plot BPC QC samples. QC samples representing sample (pool) measured regular intervals measurement run experiment measured day. Still, small shifts can observed, especially region 100 150 seconds. facilitate proper correspondence signals across samples (hence definition LC-MS features), essential minimize differences retention times. Theoretically, proceed two steps: first select QC samples dataset first alignment , using -called anchor peaks. way can assume linear shift time, since always measuring sample different regular time intervals. Despite external QCs data set, still use subset-based alignment assuming retention time shifts independent different sample matrix (human serum plasma) instead mostly instrument-dependent. Note also possible manually specify anchor peaks, respectively retention times align data set external, reference, data set. information provided vignettes xcms package. calculating much adjust retention time samples, apply shift also study samples. xcms retention time alignment can performed using adjustRtime() function alignment algorithm. example use PeakGroups method (Smith et al. 2006) performs alignment minimizing differences retention times set anchor peaks different samples. method requires initial correspondence analysis match/group chromatographic peaks across samples algorithm selects anchor peaks alignment. initial correspondence, use PeakDensity approach (Smith et al. 2006) groups chromatographic peaks similar m/z retention time LC-MS features. parameters algorithm, can configured using PeakDensityParam object, sampleGroups, minFraction, binSize, ppm bw. binSize, ppm bw allow specify similar chromatographic peaks’ m/z retention time values need consider grouping feature. binSize ppm define required similarity m/z values. Within m/z bin (defined binSize ppm) areas along retention time axis high chromatographic peak density (considering peaks samples) identified, chromatographic peaks within regions considered grouping feature. High density areas identified using base R density() function, bw parameter: higher values define wider retention time areas, lower values require chromatographic peaks similar retention times. parameter can seen black line plot , corresponding smoothness density curve. Whether candidate peaks get grouped feature depends also parameters sampleGroups minFraction: sampleGroups provide, sample, sample group belongs . minFraction expected value 0 1 defining proportion samples within least one sample groups (defined sampleGroups) chromatographic peaks detected group feature. initial correspondence, parameters don’t need fully optimized. Selection dataset-specific parameter values described detail next section. dataset, use small values binSize ppm , importantly, also parameter bw, since data set ultra high performance (UHP) LC setup used [anna: maybe field long, don’t see connection UHPLC choice small values parameters. something empirical? phili: jo can help ?]. minFraction use high value (0.9) ensure features defined chromatographic peaks present almost samples one sample group (can used anchor peaks actual alignment). base alignment later QC samples hence define sampleGroups binary variable grouping samples either study, QC group. PeakGroups-based alignment can next performed using adjustRtime() function PeakGroupsParam parameter object. parameters algorithm : subsetAdjust subset: Allows subset alignment. base retention time alignment QC samples, .e., retention time shifts estimated based repeatedly measured samples. resulting adjustment applied entire data. data sets QC samples (e.g. sample pools) measured repeatedly, strongly suggest use method. Note also subset-based alignment samples ordered injection index (.e., order measured measurement run). minFraction: value 0 1 defining proportion samples (full data set, data subset defined subset) chromatographic peak identified use anchor peak. contrast PeakDensityParam parameter used define proportion within sample group. span: PeakGroups method allows, depending data, adjust regions along retention time axis differently. enable local alignments LOESS function used parameter defines degree smoothing function. Generally, values 0.4 0.6 used, however, suggested evaluate alignment results eventually adapt parameters result satisfactory. perform alignment data set based retention times anchor peaks defined subset QC samples. Alignment adjusted retention times spectra data set, well retention times identified chromatographic peaks. alignment performed, user evaluate results using plotAdjustedRtime() function. function visualizes difference adjusted raw retention time sample y-axis along adjusted retention time x-axis. Dot points represent position used anchor peak along retention time axis. optimal alignment areas along retention time axis, anchor peaks scattered retention time dimension. samples present data set measured within measurement run, resulting small retention time shifts. Therefore, little adjustments needed performed (shifts maximum 1 second can seen plot ). Generally, magnitude adjustment seen plots match expectation analyst. can also compare BPC alignment. get original data, .e. raw retention times, can use dropAdjustedRtime() function: largest shift can observed retention time range 120 130s. Apart retention time range, little changes can observed. next evaluate impact alignment EICs selected internal standards. thus first extract ion chromatograms alignment. can now evaluate alignment effect test ions. plot EICs alignment isotope labeled cystine methionine. non-endogenous cystine ion already well aligned difference minimal. methionine ion, however, shows improvement alignment. addition visual inspection results, also evaluate impact alignment comparing variance retention times internal standards alignment. end, first need identify chromatographic peaks sample m/z retention time close expected values internal standard. use matchValues() function MetaboAnnotation package (Rainer et al. 2022) using MzRtParam method identify chromatographic peaks similar m/z (+/- 50 ppm) retention time (+/- 10 seconds) internal standard’s values. parameters mzColname rtColname specify column names query () target (chromatographic peaks) contain m/z retention time values match entities. perform matching separately sample. internal standard every sample, use filterMatches() function SingleMatchParam() parameter select chromatographic peak highest intensity. now internal standard ID chromatographic peak sample likely represents signal ion. can now extract retention times chromatographic peaks alignment. can now evaluate impact alignment retention times internal standards across full data set: average, variation retention times internal standards across samples slightly reduced alignment.[Phili: actually don’t think can say plot]","code":"#' Get QC samples QC_samples <- sampleData(data)$phenotype == \"QC\" #' extract BPC data[QC_samples] |> chromatogram(aggregationFun = \"max\", chromPeaks = \"none\") |> plot(col = col_phenotype[\"QC\"], main = \"BPC of QC samples\") |> grid() # Initial correspondence analysis param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype == \"QC\", minFraction = 0.9, binSize = 0.01, ppm = 10, bw = 2) data <- groupChromPeaks(data, param = param) plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) #' Define parameters of choice subset <- which(sampleData(data)$phenotype == \"QC\") param <- PeakGroupsParam(minFraction = 0.9, extraPeaks = 50, span = 0.5, subsetAdjust = \"average\", subset = subset) #' Perform the alignment data <- adjustRtime(data, param = param) #' Visualize alignment results plotAdjustedRtime(data, col = paste0(col_sample, 80), peakGroupsPch = 1) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") #' Get data before alignment data_raw <- dropAdjustedRtime(data) #' Apply the adjusted retention time to our dataset data <- applyAdjustedRtime(data) #' Plot the BPC before and after alignment par(mfrow = c(2, 1), mar = c(2, 1, 1, 0.5)) chromatogram(data_raw, aggregationFun = \"max\", chromPeaks = \"none\") |> plot(main = \"BPC before alignment\", col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\", horiz = TRUE) chromatogram(data, aggregationFun = \"max\", chromPeaks = \"none\") |> plot(main = \"BPC after alignment\", col = paste0(col_sample, 80)) grid() legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\", horiz = TRUE) #' Store the EICs before alignment eics_is_refined <- eic_is #' Update the EICs eic_is <- chromatogram(data, rt = as.matrix(intern_standard[, c(\"rtmin\", \"rtmax\")]), mz = as.matrix(intern_standard[, c(\"mzmin\", \"mzmax\")])) fData(eic_is) <- fData(eics_is_refined) #' Extract the EICs for the test ions eic_cystine <- eic_is[\"cystine_13C_15N\"] eic_met <- eic_is[\"methionine_13C_15N\"] par(mfrow = c(2, 2), mar = c(4, 4.5, 2, 1)) old_eic_cystine <- eics_is_refined[\"cystine_13C_15N\"] plot(old_eic_cystine, main = \"Cystine before alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"cystine_13C_15N\", \"RT\"], col = \"red\", lty = 3) old_eic_met <- eics_is_refined[\"methionine_13C_15N\"] plot(old_eic_met, main = \"Methionine before alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"methionine_13C_15N\", \"RT\"], col = \"red\", lty = 3) plot(eic_cystine, main = \"Cystine after alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"cystine_13C_15N\", \"RT\"], col = \"red\", lty = 3) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1, bty = \"n\") plot(eic_met, main = \"Methionine after alignment\", peakType = \"none\", col = paste0(col_sample, 80)) grid() abline(v = intern_standard[\"methionine_13C_15N\", \"RT\"], col = \"red\", lty = 3) #' Extract the matrix with all chromatographic peaks and add a column #' with the ID of the chromatographic peak chrom_peaks <- chromPeaks(data) |> as.data.frame() chrom_peaks$peak_id <- rownames(chrom_peaks) #' Define the parameters for the matching and filtering of the matches p_1 <- MzRtParam(ppm = 50, toleranceRt = 10) p_2 <- SingleMatchParam(duplicates = \"top_ranked\", column = \"target_maxo\", decreasing = TRUE) #' Iterate over samples and identify for each the chromatographic peaks #' with similar m/z and retention time than the onse from the internal #' standard, and extract among them the ID of the peaks with the #' highest intensity. intern_standard_peaks <- lapply(seq_along(data), function(i) { tmp <- chrom_peaks[chrom_peaks[, \"sample\"] == i, , drop = FALSE] mtch <- matchValues(intern_standard, tmp, mzColname = c(\"mz\", \"mz\"), rtColname = c(\"RT\", \"rt\"), param = p_1) mtch <- filterMatches(mtch, p_2) mtch$target_peak_id }) |> do.call(what = cbind) #' Define the index of the selected chromatographic peaks in the #' full chromPeaks matrix idx <- match(intern_standard_peaks, rownames(chromPeaks(data))) #' Extract the raw retention times for these rt_raw <- chromPeaks(data_raw)[idx, \"rt\"] |> matrix(ncol = length(data_raw)) #' Extract the adjusted retention times for these rt_adj <- chromPeaks(data)[idx, \"rt\"] |> matrix(ncol = length(data_raw)) list(all_raw = rowSds(rt_raw, na.rm = TRUE), all_adj = rowSds(rt_adj, na.rm = TRUE) ) |> vioplot(ylab = \"sd(retention time)\") grid()"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"correspondence","dir":"Articles","previous_headings":"Data preprocessing","what":"Correspondence","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"briefly touched subject correspondence determine anchor peaks alignment. Generally, goal correspondence analysis identify chromatographic peaks originate types ions samples experiment group LC-MS features. point, proper configuration parameter bw crucial. illustrate sensible choices parameter’s value can made. use plotChromPeakDensity() function simulate correspondence analysis default values PeakGroups extracted ion chromatograms two selected isotope labeled ions. plot shows EIC top panel, apex position chromatographic peaks different samples (y-axis), along retention time (x-axis) lower panel. Grouping peaks depends smoothness previousl mentionned density curve can configured parameter bw. seen , smoothness high properly group features. looking default parameters, can observe indeed, bw parameter set bw = 30, high modern UHPLC-MS setups. reduce value parameter 1.8 evaluate impact. can observe peaks now grouped accurately single feature test ion. important parameters optimized : binsize: data generated high resolution MS instrument, thus select low value paramete. ppm: TOF instruments, suggested use value ppm larger 0 accommodate higher measurement error instrument larger m/z values. minFraction: set minFraction = 0.75, hence defining features chromatographic peak identified least 75% samples one sample groups. sampleGroups: use information available sampleData’s \"phenotype\" column. correspondence analysis suggested evaluate results selected EICs. extract signal m/z similar isotope labeled methionine larger retention time range. Importantly, show actual correspondence results, set simulate = FALSE plotChromPeakDensity() function. hoped, signal two different ions now grouped separate features. Generally, correspondence results evaluated extracted chromatograms. Another interesting information look distribution features along retention time axis. Table 5.Distribution features along retention time axis (seconds. (continued ) Table continues results correspondence analysis now stored, along results preprocessing steps, within XcmsExperiment result object. correspondence results, .e., definition LC-MS features, can extracted using featureDefinitions() function. data frame provides average m/z retention time (columns \"mzmed\" \"rtmed\") characterize LC-MS feature. Column, \"peakidx\" contains indices chromatographic peaks assigned feature. actual abundances features, represent also final preprocessing results, can extracted featureValues() function: can note features (e.g. F0003 F0006) missing values samples. expected certain degree samples features, respectively ions, need present. address next section.","code":"#' Default parameter for the grouping and apply them to the test ions BPC param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, bw = 30) param ## Object of class: PeakDensityParam ## Parameters: ## - sampleGroups: [1] \"QC\" \"CVD\" \"CTR\" \"QC\" \"CTR\" \"CVD\" \"QC\" \"CTR\" \"CVD\" \"QC\" ## - bw: [1] 30 ## - minFraction: [1] 0.5 ## - minSamples: [1] 1 ## - binSize: [1] 0.25 ## - maxFeatures: [1] 50 ## - ppm: [1] 0 plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) plotChromPeakDensity(eic_met, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_met)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_met)[, \"sample\"]], 20), peakPch = 16) #' Updating parameters param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, bw = 1.8) plotChromPeakDensity( eic_cystine, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_cystine)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, \"sample\"]], 20), peakPch = 16) plotChromPeakDensity(eic_met, param = param, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(eic_met)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic_met)[, \"sample\"]], 20), peakPch = 16) #' Define the settings for the param param <- PeakDensityParam(sampleGroups = sampleData(data)$phenotype, minFraction = 0.75, binSize = 0.01, ppm = 10, bw = 1.8) #' Apply to whole data data <- groupChromPeaks(data, param = param) #' Extract chromatogram for an m/z similar to the one of the labeled methionine chr_test <- chromatogram(data, mz = as.matrix(intern_standard[\"methionine_13C_15N\", c(\"mzmin\", \"mzmax\")]), rt = c(145, 200), aggregationFun = \"max\") plotChromPeakDensity( chr_test, simulate = FALSE, col = paste0(col_sample, \"80\"), peakCol = col_sample[chromPeaks(chr_test)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(chr_test)[, \"sample\"]], 20), peakPch = 16) # Bin features per RT slices vc <- featureDefinitions(data)$rtmed breaks <- seq(0, max(vc, na.rm = TRUE) + 1, length.out = 15) |> round(0) cuts <- cut(vc, breaks = breaks, include.lowest = TRUE) table(cuts) |> pandoc.table( style = \"rmarkdown\", caption = \"Table 5.Distribution of features along the retention time axis (in seconds.\") #' Definition of the features featureDefinitions(data) |> head() ## mzmed mzmin mzmax rtmed rtmin rtmax npeaks CTR CVD QC ## FT0001 50.98979 50.98949 50.99038 203.6001 203.1181 204.2331 8 1 3 4 ## FT0002 51.05904 51.05880 51.05941 191.1675 190.8787 191.5050 9 2 3 4 ## FT0003 51.98657 51.98631 51.98699 203.1467 202.6406 203.6710 7 0 3 4 ## FT0004 53.02036 53.02009 53.02043 203.2343 202.5652 204.0901 10 3 3 4 ## FT0005 53.52080 53.52051 53.52102 203.1936 202.8490 204.0901 10 3 3 4 ## FT0006 54.01007 54.00988 54.01015 159.2816 158.8499 159.4484 6 1 3 2 ## peakidx ms_level ## FT0001 7702, 16.... 1 ## FT0002 7176, 16.... 1 ## FT0003 7680, 17.... 1 ## FT0004 7763, 17.... 1 ## FT0005 8353, 17.... 1 ## FT0006 5800, 15.... 1 #' Extract feature abundances featureValues(data, method = \"sum\") |> head() ## MS_QC_POOL_1_POS.mzML MS_A_POS.mzML MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ## FT0001 421.6162 689.2422 NA 481.7436 ## FT0002 710.8078 875.9192 NA 693.6997 ## FT0003 445.5711 613.4410 NA 497.8866 ## FT0004 16994.5260 24605.7340 19766.707 17808.0933 ## FT0005 3284.2664 4526.0531 3521.822 3379.8909 ## FT0006 10681.7476 10009.6602 NA 10800.5449 ## MS_C_POS.mzML MS_D_POS.mzML MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## FT0001 NA 635.2732 439.6086 570.5849 ## FT0002 781.2416 648.4344 700.9716 1054.0207 ## FT0003 NA 634.9370 449.0933 NA ## FT0004 22780.6683 22873.1061 16965.7762 23432.1252 ## FT0005 4396.0762 4317.7734 3270.5290 4533.8667 ## FT0006 NA 7296.4262 NA 9236.9799 ## MS_F_POS.mzML MS_QC_POOL_4_POS.mzML ## FT0001 579.9360 437.0340 ## FT0002 534.4577 711.0361 ## FT0003 461.0465 232.1075 ## FT0004 22198.4607 16796.4497 ## FT0005 4161.0132 3142.2268 ## FT0006 6817.8785 NA"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"gap-filling","dir":"Articles","previous_headings":"Data preprocessing","what":"Gap filling","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"previously observed missing values (NA) attributed various reasons. Although might represent genuinely missing value, indicating ion (feature) truly present particular sample, also result failure preceding chromatographic peak detection step. crucial able recover missing values latter category much possible reduce eventual need data imputation. next examine prevalent missing values present dataset: can observe substantial number missing values values dataset. Let’s therefore delve process gap-filling. first evaluate example features chromatographic peak detected samples: instances, chromatographic peak identified one two selected samples (red line), hence missing value reported feature particular samples (blue line). However, cases, signal measured samples, thus, reporting missing value correct example. signal feature low, likely reason peak detection failed. rescue signal cases, fillChromPeaks() function can used ChromPeakAreaParam approach. method defines m/z-retention time area feature based detected peaks, signal respective ion expected. integrates intensities within area samples missing values feature. reported feature abundance. apply method using default parameters. fillChromPeaks() thus rescue missing data data set. Note , even sample ion present, worst case noise integrated, expected much lower actual chromatographic peak signal. Let’s look previously missing values : gap-filling, also blue colored sample chromatographic peak present peak area reported feature abundance sample. assess effectiveness gap-filling method rescuing signals, can also plot average signal features least one missing value average filled-signal. advisable perform analysis repeatedly measured samples; case, QC samples used. , extract: Feature values detected chromatographic peaks setting filled = FALSE featuresValues() call. filled-signal first extracting detected gap-filled abundances replace values detected chromatographic peaks NA. , calculate row averages matrices plot . detected (x-axis) gap-filled (y-axis) values QC samples highly correlated. Especially higher abundances, agreement high, low intensities, can expected, differences higher trending correlation line. , addition, fit linear regression line data summarize results linear regression line slope 1.12 intercept -1.62. indicates filled-signal average 1.12 times higher detected signal.","code":"#' Percentage of missing values sum(is.na(featureValues(data))) / length(featureValues(data)) * 100 ## [1] 26.41597 ftidx <- which(is.na(rowSums(featureValues(data)))) fts <- rownames(featureDefinitions(data))[ftidx] farea <- featureArea(data, features = fts[1:2]) chromatogram(data[c(2, 3)], rt = farea[, c(\"rtmin\", \"rtmax\")], mz = farea[, c(\"mzmin\", \"mzmax\")]) |> plot(col = c(\"red\", \"blue\"), lwd = 2) #' Fill in the missing values in the whole dataset data <- fillChromPeaks(data, param = ChromPeakAreaParam(), chunkSize = 5) #' Percentage of missing values after gap-filling sum(is.na(featureValues(data))) / length(featureValues(data)) * 100 ## [1] 5.155492 #' Get only detected signal in QC samples vals_detect <- featureValues(data, filled = FALSE)[, QC_samples] #' Get detected and filled-in signal vals_filled <- featureValues(data)[, QC_samples] #' Replace detected signal with NA vals_filled[!is.na(vals_detect)] <- NA #' Identify features with at least one filled peak has_filled <- is.na(rowSums(vals_detect)) #' Calculate row averages for features with missing values avg_detect <- rowMeans(vals_detect[has_filled, ], na.rm = TRUE) avg_filled <- rowMeans(vals_filled[has_filled, ], na.rm = TRUE) #' Plot the values against each other (in log2 scale) plot(log2(avg_detect), log2(avg_filled), xlim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), ylim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), pch = 21, bg = \"#00000020\", col = \"#00000080\") grid() abline(0, 1) #' fit a linear regression line to the data l <- lm(log2(avg_filled) ~ log2(avg_detect)) summary(l) ## ## Call: ## lm(formula = log2(avg_filled) ~ log2(avg_detect)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.8176 -0.3807 0.1725 0.5492 6.7504 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.62359 0.11545 -14.06 <2e-16 *** ## log2(avg_detect) 1.11763 0.01259 88.75 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9366 on 2846 degrees of freedom ## (846 observations deleted due to missingness) ## Multiple R-squared: 0.7346, Adjusted R-squared: 0.7345 ## F-statistic: 7877 on 1 and 2846 DF, p-value: < 2.2e-16"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"preprocessing-results","dir":"Articles","previous_headings":"Data preprocessing","what":"Preprocessing results","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"final results LC-MS data preprocessing stored within XcmsExperiment object. includes identified chromatographic peaks, alignment results, well correspondence results. addition, guarantee reproducibility, result object keeps track performed processing steps, including individual parameter objects used configure . processHistory() function returns list various applied processing steps chronological order. , extract information first step performed preprocessing. processParam() function used extract actual parameter class used configure processing step. final result whole LC-MS data preprocessing two-dimensional matrix abundances -called LC-MS features samples. Note stage analysis features characterized m/z retention time don’t yet information metabolite feature represent. seen , feature matrix can extracted featureValues() function corresponding feature characteristics (.e., m/z retention time values) using featureDefinitions() function. Thus, two arrays extracted xcms result object used/imported analysis packages processing. example also exported tab delimited text files, used external tool, used, also MS2 spectra available, feature-based molecular networking GNPS analysis environment (Nothias et al. 2020). processing R, reference link raw MS data required, suggested extract xcms preprocessing result using quantify() function SummarizedExperiment object, Bioconductor’s default container data biological assays/experiments. simplifies integration Bioconductor analysis packages. quantify() function takes parameters featureValues() function, thus, call extract SummarizedExperiment detected, gap-filled, feature abundances: Sample identifications xcms result’s sampleData() now available colData() (column, sample annotations) featureDefinitions() rowData() (row, feature annotations). feature values added first assay() SummarizedExperiment even processing history available object’s metadata(). SummarizedExperiment supports multiple assays, numeric matrices dimensions. thus add detected gap-filled feature abundances additional assay SummarizedExperiment. Feature abundances can extracted assay() function. extract first 6 lines detected gap-filled feature abundances: advantage, addition container full preprocessing results also possibility easy intuitive creation data subsets ensuring data integrity. example easy subset full data selection features /samples: XcmsExperiment object can also saved later use using storeResults() function. data can exported different formats, enable easier integration non-R-based software. Currently, possible export data R-specific RData format (separate) plain text files. Export community-developed open mzTab-M format currently developed supported future. export xcms result object R’s default binary format object serialization.","code":"#' Check first step of the process history processHistory(data)[[1]] ## Object of class \"XProcessHistory\" ## type: Peak detection ## date: Mon Sep 30 13:06:57 2024 ## info: ## fileIndex: 1,2,3,4,5,6,7,8,9,10 ## Parameter class: CentWaveParam ## MS level(s) 1 #' Extract results as a SummarizedExperiment res <- quantify(data, method = \"sum\", filled = FALSE) res ## class: SummarizedExperiment ## dim: 9068 10 ## metadata(6): '' '' ... '' '' ## assays(1): raw ## rownames(9068): FT0001 FT0002 ... FT9067 FT9068 ## rowData names(11): mzmed mzmin ... QC ms_level ## colnames(10): MS_QC_POOL_1_POS.mzML MS_A_POS.mzML ... MS_F_POS.mzML ## MS_QC_POOL_4_POS.mzML ## colData names(11): sample_name derived_spectra_data_file ... phenotype ## injection_index assays(res)$raw_filled <- featureValues(data, method = \"sum\", filled = TRUE ) #' Different assay in the SummarizedExperiment object assayNames(res) ## [1] \"raw\" \"raw_filled\" assay(res, \"raw_filled\") |> head() ## MS_QC_POOL_1_POS.mzML MS_A_POS.mzML MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ## FT0001 421.6162 689.2422 411.3295 481.7436 ## FT0002 710.8078 875.9192 457.5920 693.6997 ## FT0003 445.5711 613.4410 277.5022 497.8866 ## FT0004 16994.5260 24605.7340 19766.7069 17808.0933 ## FT0005 3284.2664 4526.0531 3521.8221 3379.8909 ## FT0006 10681.7476 10009.6602 9599.9701 10800.5449 ## MS_C_POS.mzML MS_D_POS.mzML MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## FT0001 314.7567 635.2732 439.6086 570.5849 ## FT0002 781.2416 648.4344 700.9716 1054.0207 ## FT0003 425.3774 634.9370 449.0933 556.2544 ## FT0004 22780.6683 22873.1061 16965.7762 23432.1252 ## FT0005 4396.0762 4317.7734 3270.5290 4533.8667 ## FT0006 4792.2390 7296.4262 2382.1788 9236.9799 ## MS_F_POS.mzML MS_QC_POOL_4_POS.mzML ## FT0001 579.9360 437.0340 ## FT0002 534.4577 711.0361 ## FT0003 461.0465 232.1075 ## FT0004 22198.4607 16796.4497 ## FT0005 4161.0132 3142.2268 ## FT0006 6817.8785 6911.5439 res[1:14, 3:8] ## class: SummarizedExperiment ## dim: 14 6 ## metadata(6): '' '' ... '' '' ## assays(2): raw raw_filled ## rownames(14): FT0001 FT0002 ... FT0013 FT0014 ## rowData names(11): mzmed mzmin ... QC ms_level ## colnames(6): MS_B_POS.mzML MS_QC_POOL_2_POS.mzML ... ## MS_QC_POOL_3_POS.mzML MS_E_POS.mzML ## colData names(11): sample_name derived_spectra_data_file ... phenotype ## injection_index"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"data-normalization","dir":"Articles","previous_headings":"","what":"Data normalization","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"preprocessing, data normalization scaling might need applied remove technical variances data. simple approaches like median scaling can implemented lines R code, advanced normalization algorithms available packages Bioconductor’s preprocessCore. comprehensive workflow “Notame” also propose interesting normalization approach adaptable scalable user dataset (Klåvus et al. 2020). Generally, LC-MS data, bias can categorized three main groups(Broadhurst et al. 2018): Variances introduced sample collection initial processing, can include differences sample amounts. type bias expected sample-specific affect signals sample way. Methods like median scaling, LOESS quantiles normalization can adjust bias. Signal drifts along measurement samples experiment. Reasons drifts can related aging instrumentation used (columns, detector), also changes metabolite abundances characteristics due reactions modifications, oxidation. changes expected affect samples measured later run rather ones measured beginning. reason, bias can play major role large experiments bias can play major role large experiments measured long time range usually considered affect individual metabolites (metabolite groups) differently. adjustment, moving average linear regression-based approaches can used. latter can example performed using adjust_lm() function MetaboCoreUtils package. Batch-related biases. comprise noise specific larger set samples, can set samples measured one LC-MS measurement run (.e. one analysis plate) samples measured using specific batch reagents. noise assumed affect samples one batch way linear modeling-based approaches can used adjust . Unwanted variation can arise various sources highly dependent experiment. Therefore, data normalization chosen carefully based experimental design, statistical aims, balance accuracy precision achieved use auxiliary information. Sample preparation biases can evaluated using internal standards, depending however also added sample mixes sample processing. Repeated measurements QC samples hand allows estimate correct LC-MS specific biases. Also, proper planning experiment, measurement study samples random order, can largely avoid biases introduced mentioned sources variance. workflow present tools assess data quality evaluate need normalization well options normalization. space reasons able provide solutions adjust possible sources variation.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"initial-quality-assessment","dir":"Articles","previous_headings":"Data normalization","what":"Initial quality assessment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"principal component analysis (PCA) helpful tool initial, unsupervised, visualization data also provides insights potential quality issues data. order apply PCA measured feature abundances, need however impute (still present) missing values. assume missing values (gap-filling step) represent signal detection limit. cases, missing values can replaced random values sampled uniform distribution, ranging half smallest measured value smallest measured value specific feature. uniform distribution defined two parameters (minimum maximum) values equal probability selected. impute missing values approach add resulting data matrix new assay result object.","code":"#' Load preprocessing results ## load(\"SumExp.RData\") ## loadResults(RDataParam(\"data.RData\")) #' Impute missing values using an uniform distribution na_unidis <- function(z) { na <- is.na(z) if (any(na)) { min = min(z, na.rm = TRUE) z[na] <- runif(sum(na), min = min/2, max = min) } z } #' Row-wise impute missing values and add the data as a new assay tmp <- apply(assay(res, \"raw_filled\"), MARGIN = 1, na_unidis) assays(res)$raw_filled_imputed <- t(tmp)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"principal-component-analysis","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Principal Component Analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"PCA powerful tool detecting biases data. dimensionality reduction technique, enables visualization data lower-dimensional space. context LC-MS data, PCA can used identify overall biases batch, sample, injection index, etc. However, important note PCA linear method may able detect biases data. plotting PCA, apply log2 transform, center scale data. log2 transformation applied stabilize variance centering remove dependency absolute abundances. PCA shows clear separation study samples (plasma) QC samples (serum) first principal component (PC1). separation based phenotype visible third principal component (PC3). cases, can better option remove imputed values evaluate PCA . especially true imputed values replacing large proportion data.","code":"#' Log2 transform and scale data vals <- assay(res, \"raw_filled_imputed\") |> log2() |> t() |> scale(center = TRUE, scale = TRUE) #' Perform the PCA pca_res <- prcomp(vals, scale = FALSE, center = FALSE) #' Plot the results vals_st <- cbind(vals, phenotype = res$phenotype) pca_12 <- autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) pca_34 <- autoplot(pca_res, data = vals_st, colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_12, pca_34, ncol = 2)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"intensity-evaluation","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Intensity evaluation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Global differences feature abundances samples (e.g. due sample-specific biases) can evaluated plotting distribution log2 transformed feature abundances using boxplots violin plots. show number detected chromatographic peaks per sample distribution log2 transformed feature abundances. upper part plot show gap filling steps allowed rescue substantial number NAs allowed us consistent number feature values per sample. consistency aligns asspumption every sample similar amount features detected. Additionally observe , average, signal distribution individual samples similar. alternative way evaluate differences abundances samples relative log abundance (RLA) plots (De Livera et al. 2012). RLA value abundance feature sample relative median abundance feature across multiple samples. can discriminate within group across group RLAs, depending whether abundance compared samples within sample group across samples. Within group RLA plots assess tightness replicates within groups median close zero low variation around . used across groups, allow compare behavior groups. Generally, -sample differences can easily spotted using RLA plots. calculate visualize within group RLA values using rowRla() function r Biocpkg(\"MsCoreUtils\") package defining parameter f sample groups. RLA plot raw data filled data. Note: outliers drawn. RLA plot , can observe medians samples indeed centered around 0. Exception two CVD samples. Thus, distribution signals across samples comparable, differences seem present require sample normalization.","code":"layout(mat = matrix(1:3, ncol = 1), height = c(0.2, 0.2, 0.8)) par(mar = c(0.2, 4.5, 0.2, 3)) barplot(apply(assay(res, \"raw\"), MARGIN = 2, function(x) sum(!is.na(x))), col = paste0(col_sample, 80), border = col_sample, ylab = \"# detected peaks\", xaxt = \"n\", space = 0.012) grid(nx = NA, ny = NULL) barplot(apply(assay(res, \"raw_filled\"), MARGIN = 2, function(x) sum(!is.na(x))), col = paste0(col_sample, 80), border = col_sample, ylab = \"# detected + filled peaks\", xaxt = \"n\", space = 0.012) grid(nx = NA, ny = NULL) vioplot(log2(assay(res, \"raw_filled\")), xaxt = \"n\", ylab = expression(log[2]~feature~abundance), col = paste0(col_sample, 80), border = col_sample) points(colMedians(log2(assay(res, \"raw_filled\")), na.rm = TRUE), type = \"b\", pch = 1) grid(nx = NA, ny = NULL) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.8, bty = \"n\") par(mfrow = c(1, 1), mar = c(3.5, 4.5, 2.5, 1)) boxplot(MsCoreUtils::rowRla(assay(res, \"raw_filled\"), f = res$phenotype, transform = \"log2\"), cex = 0.5, pch = 16, col = paste0(col_sample, 80), ylab = \"RLA\", border = col_sample, boxwex = 1, outline = FALSE, xaxt = \"n\", main = \"Relative log abundance\", cex.main = 1) axis(side = 1, at = seq_len(ncol(res)), labels = colData(res)$sample_name) grid(nx = NA, ny = NULL) abline(h = 0, lty=3, lwd = 1, col = \"black\") legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.8, bty = \"n\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"internal-standards","dir":"Articles","previous_headings":"Data normalization > Initial quality assessment","what":"Internal standards","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Depending added sample mixes, allow evaluation variances introduced subsequent processing analysis steps. present experiment, added original plasma samples sample extraction included also protein lipid removal steps. can therefore used evaluate variances introduced sample extraction subsequent steps, can however used infer conclusions performance differences original sample collection (blood drawing, storage, plasma creation). use matchValues() function identify features representing signal . filter matches keep match single feature using filterMatches() function combination SingleMatchParam. internal standards play crucial role guiding normalization process. Given assumption samples artificially spiked, possess known ground truth—abundance intensity internal standard consistent. difference expected due technical differences/variance. Consequently, normalization aims minimize variation samples internal standard, reinforcing reliability analyses.","code":"# Do we keep IS in normalisation ? Does not give much info... Would simplify a bit #' Creating a column within our IS table intern_standard$feature_id <- NA_character_ #' Identify features matching m/z and RT of internal standards. fdef <- featureDefinitions(data) fdef$feature_id <- rownames(fdef) match_intern_standard <- matchValues( query = intern_standard, target = fdef, mzColname = c(\"mz\", \"mzmed\"), rtColname = c(\"RT\", \"rtmed\"), param = MzRtParam(ppm = 50, toleranceRt = 10)) #' Keep only matches with a 1:1 mapping standard to feature. param <- SingleMatchParam(duplicates = \"remove\", column = \"score_rt\", decreasing = TRUE) match_intern_standard <- filterMatches(match_intern_standard, param) intern_standard$feature_id <- match_intern_standard$target_feature_id intern_standard <- intern_standard[!is.na(intern_standard$feature_id), ]"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"between-sample-normalisation","dir":"Articles","previous_headings":"Data normalization","what":"Between sample normalisation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"previous RLA plot showed data biases need corrected. Therefore, implement -sample normalization using filled-features. process effectively mitigates variations influenced technical issues, differences sample preparation, processing injection methods. instance, employ commonly used technique known median scaling (De Livera et al. 2012).","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"median-scaling","dir":"Articles","previous_headings":"Data normalization > Between sample normalisation","what":"Median scaling","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"method involves computing median sample, followed determining median individual sample medians. ensures consistent median values sample throughout entire data set. Maintaining uniformity average total metabolite abundance across samples crucial effective implementation. process aims establish shared baseline central tendency metabolite abundance, mitigating impact sample-specific technical variations. approach fosters robust comparable analysis top features across data set. assumption normalizing based median, known lower sensitivity extreme values, enhances comparability top features ensures consistent average abundance across samples. median scaling calculated imputed non-imputed data, set stored separately within SummarizedExperiment object. approach facilitates testing various normalization strategies maintaining record processing steps undertaken, enabling easy regression previous stages necessary.","code":"#' Compute median and generate normalization factor mdns <- apply(assay(res, \"raw_filled\"), MARGIN = 2, median, na.rm = TRUE ) nf_mdn <- mdns / median(mdns) #' divide dataset by median of median and create a new assay. assays(res)$norm <- sweep(assay(res, \"raw_filled\"), MARGIN = 2, nf_mdn, '/') assays(res)$norm_imputed <- sweep(assay(res, \"raw_filled_imputed\"), MARGIN = 2, nf_mdn, '/')"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"assessing-overall-effectiveness-of-the-normalization-approach","dir":"Articles","previous_headings":"Data normalization","what":"Assessing overall effectiveness of the normalization approach","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"crucial evaluate effectiveness normalization process. can achieved comparing distribution log2 transformed feature abundances normalization. Additionally, RLA plots can used assess tightness replicates within groups compare behavior groups.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"principal-component-analysis-1","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Principal Component Analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Normalization large impact PC1 PC2, separation study groups PC3 seems better difference QC samples lower normalization (see ). PCA plots show normalization process changed overall structure data. separation study QC samples remains . expected results normalization correct biological variance technical.","code":"#' Data before normalization vals_st <- cbind(vals, phenotype = res$phenotype) pca_raw <- autoplot(pca_res, data = vals_st, colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) #' Data after normalization vals_norm <- apply(assay(res, \"norm\"), MARGIN = 1, na_unidis) |> log2() |> scale(center = TRUE, scale = TRUE) pca_res_norm <- prcomp(vals_norm, scale = FALSE, center = FALSE) vals_st_norm <- cbind(vals_norm, phenotype = res$phenotype) pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_raw, pca_adj, ncol = 2) pca_raw <- autoplot(pca_res, data = vals_st , colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, colour = 'phenotype', x = 3, y = 4, scale = 0) + scale_color_manual(values = col_phenotype) grid.arrange(pca_raw, pca_adj, ncol = 2)"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"intensity-evaluation-1","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Intensity evaluation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"compare RLA plots -sample normalization evaluate impact data. RLA plot normalization. Note: outliers drawn. normalization process effectively centered data around median medians samples now closer zero.","code":"par(mfrow = c(2, 1), mar = c(3.5, 4.5, 2.5, 1)) boxplot(rowRla(assay(res, \"raw_filled\"), group = res$phenotype), cex = 0.5, pch = 16, ylab = \"RLA\", border = col_sample, col = paste0(col_sample, 80), cex.main = 1, outline = FALSE, xaxt = \"n\", main = \"Raw data\", boxwex = 1) grid(nx = NA, ny = NULL) legend(\"topright\", inset = c(0, -0.2), col = col_phenotype, legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, cex = 0.7, bty = \"n\") abline(h = 0, lty=3, lwd = 1, col = \"black\") boxplot(rowRla(assay(res, \"norm\"), group = res$phenotype), cex = 0.5, pch = 16, ylab = \"RLA\", border = col_sample, col = paste0(col_sample, 80), boxwex = 1, outline = FALSE, xaxt = \"n\", main = \"Normallized data\", cex.main = 1) axis(side = 1, at = seq_len(ncol(res)), labels = res$sample_name) grid(nx = NA, ny = NULL) abline(h = 0, lty = 3, lwd = 1, col = \"black\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"coefficient-of-variation","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Coefficient of variation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"next evaluate coefficient variation (CV, also referred relative standard deviation RSD) features across samples either QC study samples. QC samples, CV represent technical noise, study samples include also expected biological differences. Thus, normalization reduce CV QC samples, slightly reducing CV study samples. CV calculated using rowRsd() function MetaboCoreUtils package. setting mad = TRUE use robust calculation using median absolute deviation instead standard deviation. Table 6. Distribution CV values across samples raw normalized data. table shows distribution CV raw normalized data. first column highlights % data given CV value, e.g. 25% data CV equal lower 0.04557 QC_raw data. anticipated, CV values QCs, reflect technical variance, lower compared study samples, include technical biological variance. Overall, minimal disparity exists raw normalized data, positive indication normalization process introduced bias dataset, also reflects little differences average abundances sample raw data.","code":"index_study <- res$phenotype %in% c(\"CTR\", \"CVD\") index_QC <- res$phenotype == \"QC\" sample_res <- cbind( QC_raw = rowRsd(assay(res, \"raw_filled\")[, index_QC], na.rm = TRUE, mad = TRUE), QC_norm = rowRsd(assay(res, \"norm\")[, index_QC], na.rm = TRUE, mad = TRUE), Study_raw = rowRsd(assay(res, \"raw_filled\")[, index_study], na.rm = TRUE, mad = TRUE), Study_norm = rowRsd(assay(res, \"norm\")[, index_study], na.rm = TRUE, mad = TRUE) ) #' Summarize the values across features res_df <- data.frame( QC_raw = quantile(sample_res[, \"QC_raw\"], na.rm = TRUE), QC_norm = quantile(sample_res[, \"QC_norm\"], na.rm = TRUE), Study_raw = quantile(sample_res[, \"Study_raw\"], na.rm = TRUE), Study_norm = quantile(sample_res[, \"Study_norm\"], na.rm = TRUE) ) cpt <- paste0(\"Table 6. Distribution of CV values across samples for the raw and \", \"normalized data.\") pandoc.table(res_df, style = \"rmarkdown\", caption = cpt) save(data, file = \"data_afternorm.RData\") save(res, file = \"SumExp_afternorm.RData\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"conclusion-on-normalization","dir":"Articles","previous_headings":"Data normalization > Assessing overall effectiveness of the normalization approach","what":"Conclusion on normalization","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"overall conclusion normalization process little variance present beginning, normalization however able center data around median (shown RLA plot). Given simplicity limited size example dataset, conclude normalization process stage. intricate datasets diverse biases, tailored approach devised. include also approaches adjust signal drifts batch effects. One possible option use linear-model based approach can example applied adjust_lm() function MetaboCoreUtils package.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"quality-control-feature-prefiltering","dir":"Articles","previous_headings":"","what":"Quality control: Feature prefiltering","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"normalizing data can now pre-filter clean data performing statistical analysis. general, pre-filtering samples features performed remove outliers. copy original result object also keep unfiltered data later comparisons. eliminate features exhibit high variability dataset. Repeatedly measured QC samples typically serve robust basis cleansing datasets allowing identify features excessively high noise. data set external QC samples used, .e. pooled samples different collection using slightly different sample matrix, utility filtering somewhat limited. comprehensive description guidelines data filtering untargeted metabolomic studies, please refer (Broadhurst et al. 2018). first restrict data set features chromatographic peak detected least 2/3 samples least one study samples groups. ensures statistical tests carried later study samples performed reliable signal. Also, filter remove features mostly detected QC samples, study samples. filter can performed filterFeatures() function xcms package PercentMissingFilter setting. parameters filer: threshold: defines maximal acceptable percentage samples missing value(s) least one sample groups defined parameter f. f: factor defining sample groups. replacing \"QC\" sample group NA parameter f exclude QC samples evaluation consider study samples. threshold = 40 keep features peak detected 2 3 samples one sample groups. consider detected chromatographic peaks per sample, apply filter \"raw\" assay result object, contains abundance values detected chromatographic peaks (prior gap-filling). Following guidelines stated decided still use QC samples pre-filtering, basis represent similar bio-fluids study samples, thus, anticipate observing relatively similar metabolites affected similar measurement biases. therefore evaluate dispersion ratio (Dratio) (Broadhurst et al. 2018) features data set. accomplish task using function time DratioFilter parameter. filters exist function invite user explore decide best dataset. Dratio filter powerful tool identify features exhibit high variability data, relating variance observed QC samples study samples. setting threshold 0.4, remove features high degree variability QC study samples. example, feature deviation QC higher 40% (threshold = 0.4)deviation study samples removed. filtering step ensures features retained considerably lower technical biological variance. Note rowDratio() rowRsd() functions MetaboCoreUtils package used calculate actual numeric values estimates used filtering, e.g. evaluate distribution whole data set identify data set-dependent threshold values. Finally, evaluate number features left filtering steps calculate percentage features removed. dataset reduced 9068 4275 features. remove considerable amount features expected want focus reliable features analysis. rest analysis need separate QC samples study samples. store QC samples separate object later use. addition calculate CV QC samples add additional column rowData() result object. used later prioritize identified significant features e.g. low technical noise. Now data set preprocessed, normalized filtered, can start evaluate distribution data estimate variation due biology.","code":"load(\"SumExp_afternorm.RData\") load(\"data_afternorm.RData\") #' Number of features before filtering nrow(res) ## [1] 9068 #' keep unfiltered object res_unfilt <- res #' Limit features to those with at least two detected peaks in one study group. #' Setting the value for QC samples to NA excludes QC samples from the #' calculation. f <- res$phenotype f[f == \"QC\"] <- NA f <- as.factor(f) res <- filterFeatures(res, PercentMissingFilter(f = f, threshold = 40), assay = \"raw\") #' Compute and filter based on the Dratio filter_dratio <- DratioFilter(threshold = 0.4, qcIndex = res$phenotype == \"QC\", studyIndex = res$phenotype != \"QC\", mad = TRUE) res <- filterFeatures(res, filter = filter_dratio, assay = \"norm_imputed\") #' Number of features after analysis nrow(res) ## [1] 4275 #' Percentage left: end/beginning nrow(res)/nrow(res_unfilt) * 100 ## [1] 47.1438 res_qc <- res[, res$phenotype == \"QC\"] res <- res[, res$phenotype != \"QC\"] #' Calculate the QC's CV and add as feature variable to the data set rowData(res)$qc_cv <- assay(res_qc, \"norm\") |> rowRsd()"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"differential-abundance-analysis","dir":"Articles","previous_headings":"","what":"Differential abundance analysis","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"normalization quality control, next step identify features differentially abundant study groups. crucial step allows us identify potential biomarkers metabolites associated study groups. various approaches methods available identification features interest. workflow use multiple linear regression analysis identify features significantly difference abundances CVD CTR study group. performing tests evaluate similarities study samples using PCA (excluding QC samples avoid influencing results). samples clearly separate study group PCA indicating differences metabolite profiles two groups. However, drives separation PC1 clear. evaluate whether explained available variable study, .e., age: According PCA , PC1 seem related age. Even variance data set can’t explain stage, proceed (supervised) statistical tests identify features interest. compute linear models metabolite explaining observed feature abundance available study variables. also use base R function lm(), utilize R Biocpkg(\"limma\") package conduct differential abundance analysis: moderated test statistics (Smyth 2004) provided package specifically well suited experiments limited number replicates. tests use linear model ~ phenotype + age, hence explaining abundances one metabolite accounting study group assignment age individual. analysis might benefit inclusion study covariate associated PC2 explaining variance seen principal component, present analysis participant’s age disease association provided. define design study model.matrix() function fit feature-wise linear models log2-transformed abundances using lmFit() function. P-values significance association calculated using eBayes() function, also performs empirical Bayes-based robust estimation standard errors. See also excellent vignette/user guide limma package examples details linear model procedure. linear models fitted, can now proceed extract results. create data frame containing coefficients, raw adjusted p-values (applying Benjamini-Hochberg correction, .e., method = \"BH\" improved control false discovery rate), average intensity signals CVD CTR samples, indication whether feature deemed significant . consider metabolites adjusted p-value smaller 0.05 significant, also include (absolute) difference abundances cut-criteria. last, add differential abundance results result object’s rowData(). can now proceed visualize distribution raw adjusted p-values. Distribution raw (left) adjusted p-values (right). histograms show distribution raw adjusted p-values. Except enrichment small p-values, raw p-values (less) uniformly distributed, indicates absence strong systematic biases data. adjusted p-values conservative account multiple testing; important fit linear model feature therefore perform large number tests leads high chance false positive findings. see features low p-values, indicating likely significantly different two study groups. plot adjusted p-values log2 fold change (average) abundances. volcano plot allow us visualize features significantly different two study groups. highlighted blue color plot . Volcano plot showing analysis results. interesting features top corners volcano plot (.e., features large difference abundance groups small p-value). significant features negative coefficient (log2 fold change value) indicating abundance lower CVD samples compared CTR samples. features listed, along average difference (log2) abundance compared groups, adjusted p-values, average (log2) abundance sample group RSD (CV) QC samples table . Table 7.Features significant differences abundances. (continued ) visualize EICs significant features evaluate (raw) signal. restrict MS data set study samples. Parameters keepFeatures = TRUE: ensures identified features retained `subset object. peakBg: defines (background) color individual chromatographic peak EIC object. EICs significant features show clear single peak. intensities (already observed ) much larger CTR CVD samples. exception second feature (second EIC top row), intensities significant features however generally low. might make challenging identify using LC-MS/MS setup.","code":"col_sample <- col_phenotype[res$phenotype] #' Log transform and scale the data for PCA analysis vals <- assay(res, \"norm_imputed\") |> t() |> log2() |> scale(center = TRUE, scale = TRUE) pca_res <- prcomp(vals, scale = FALSE, center = FALSE) vals_st <- cbind(vals, phenotype = res$phenotype) autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + scale_color_manual(values = col_phenotype) vals_st <- cbind(vals, age = res$age) autoplot(pca_res, data = vals_st , colour = 'age', scale = 0) #' Define the linear model to be applied to the data p.cut <- 0.05 # cut-off for significance. m.cut <- 0.5 # cut-off for log2 fold change age <- res$age phenotype <- factor(res$phenotype) design <- model.matrix(~ phenotype + age) #' Fit the linear model to the data, explaining metabolite #' concentrations by phenotype and age. fit <- lmFit(log2(assay(res, \"norm_imputed\")), design = design) fit <- eBayes(fit) #' Compile a result data frame tmp <- data.frame( coef.CVD = fit$coefficients[, \"phenotypeCVD\"], pvalue.CVD = fit$p.value[, \"phenotypeCVD\"], adjp.CVD = p.adjust(fit$p.value[, \"phenotypeCVD\"], method = \"BH\"), avg.CVD = rowMeans( log2(assay(res, \"norm_imputed\")[, res$phenotype == \"CVD\"])), avg.CTR = rowMeans( log2(assay(res, \"norm_imputed\")[, res$phenotype == \"CTR\"])) ) tmp$significant.CVD <- tmp$adjp.CVD < 0.05 #' Add the results to the object's rowData rowData(res) <- cbind(rowData(res), tmp) #' Restrict the raw data to study samples. data_study <- data[sampleData(data)$phenotype != \"QC\", keepFeatures = TRUE] #' Extract EICs for the significant features eic_sign <- featureChromatograms( data_study, features = rownames(tab), expandRt = 5, filled = TRUE) #' Plot the EICs. plot(eic_sign, col = col_sample, peakBg = paste0(col_sample[chromPeaks(eic_sign)[, \"sample\"]], 40)) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1) save(data, file = \"data_after_DA.RData\") save(res, file = \"Sum_Exp_afterDA.RData\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"annotation","dir":"Articles","previous_headings":"","what":"Annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"now identified features significant differences abundances two study groups. provide information metabolic pathways differentiate affected healthy individuals might hence also serve biomarkers. However, stage analysis know compounds/metabolites actually represent. thus need now annotate signals. Annotation can performed different level confidence Schymanski et al. (2014). lowest level annotation, highest rate false positive hits, bases features m/z ratios. Higher levels annotations employ fragment spectra (MS2 spectra) ions interest requiring however acquisition additional data. section, demonstrate multiple ways annotate significant features using functionality provided Bioconductor packages. Alternative approaches external software tools, may better suited, also discussed later section.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"ms1-based-annotation","dir":"Articles","previous_headings":"Annotation","what":"MS1-based annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"data set acquired using LC-MS setup features thus characterized m/z retention times. retention time LC-setup-specific , without prior data/knowledge provide little information features’ identity. Modern MS instruments high accuracy m/z values therefore reliable estimates compound ion’s mass--charge ratio. first approach, use features’ m/z values match reference values, .e., exact masses chemical compounds provided reference database, case MassBank database. full MassBank data re-distributed Bioconductor’s AnnotationHub resource, simplifies integration reproducible R-based analysis workflows. load resource, list available MassBank data sets/releases load one . MassBank data provided self-contained SQLite database data can queried accessed CompoundDb Bioconductor package. use compounds() function extract small compound annotations database. MassBank (small compound annotation databases) provides (exact) molecular mass compound. Since almost small compounds neutral natural state, need first converted m/z values allow matching feature’s m/z. calculate m/z neutral mass, need assume ion (adduct) might generated measured metabolites employed electro-spray ionization. positive polarity, human serum samples, common ions protonated ([M+H]+), bear addition sodium ([M+Na]+) ammonium ([M+H-NH3]+) ions. match observed m/z values reference values potential ions use matchValues() function Mass2MzParam approach, allows specify types expected ions adducts parameter maximal allowed difference compared values using tolerance ppm parameters. first prepare data.frame significant features, set parameters matching perform comparison query features reference database. resulting Matched object shows 4 6 significant features matched ions compounds MassBank database. extract full result Matched object. Thus, total 237 ions compounds MassBank matched significant features based specified tolerance settings. Many compounds, different structure thus function/chemical property, identical chemical formula thus mass. Matching exclusively m/z features hence result many potentially false positive hits thus considered provide low confidence annotation. additional complication annotation resources, like MassBank, community maintained, contain large amount redundant information. reduce redundancy result table iterate hits feature keep matches unique compounds (identified INCHIKEY). INCHI INCHIKEY combine information compound’s chemical formula structure, different compounds can share chemical formula, different structure thus INCHI. Table 8.MS1 annotation results (continued ) table shows results MS1-based annotation process. can see four significant features matched. matches seem pretty accurate low ppm errors. deduplication performed considerably reduced number hits feature, first still matches ions large number compounds (chemical formula). Considering features’ m/z retention times MS1-based annotation increase annotation confidence, requires additional data, recording retention time thepure standard compound LC setup. alternative approach might provide better inside annotations help choose different annotations feature evaluate certain chemical properties possible matches. instance, LogP value, available several databases HMDB, provides insight given compound’s polarity. property highly affects interaction analyte column, usually also directly affects separation. Therefore, comparison analyte’s retention time polarity can help rule possible misidentifications. low confidence, MS1-based annotation can provide first candidate annotations confirmed rejected additional analyses.","code":"#' load reference data ah <- AnnotationHub() #' List available MassBank data sets query(ah, \"MassBank\") ## AnnotationHub with 6 records ## # snapshotDate(): 2024-08-01 ## # $dataprovider: MassBank ## # $species: NA ## # $rdataclass: CompDb ## # additional mcols(): taxonomyid, genome, description, ## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, ## # rdatapath, sourceurl, sourcetype ## # retrieve records with, e.g., 'object[[\"AH107048\"]]' ## ## title ## AH107048 | MassBank CompDb for release 2021.03 ## AH107049 | MassBank CompDb for release 2022.06 ## AH111334 | MassBank CompDb for release 2022.12.1 ## AH116164 | MassBank CompDb for release 2023.06 ## AH116165 | MassBank CompDb for release 2023.09 ## AH116166 | MassBank CompDb for release 2023.11 #' Load one MAssBank release mb <- ah[[\"AH116166\"]] cmps <- compounds(mb, columns = c(\"compound_id\", \"name\", \"formula\", \"exactmass\", \"inchikey\")) head(cmps) ## compound_id formula exactmass inchikey ## 1 1 C27H29NO11 543.1741 AOJJSUZBOXZQNB-UHFFFAOYSA-N ## 2 2 C40H54O4 598.4022 KFNGKYUGHHQDEE-AXWOCEAUSA-N ## 3 3 C10H24N2O2 204.1838 AEUTYOVWOVBAKS-UWVGGRQHSA-N ## 4 4 C16H27NO5 313.1889 LMFKRLGHEKVMNT-UJDVCPFMSA-N ## 5 5 C20H15Cl3N2OS 435.9971 JLGKQTAYUIMGRK-UHFFFAOYSA-N ## 6 6 C15H14O5 274.0841 BWNCKEBBYADFPQ-UHFFFAOYSA-N ## name ## 1 Epirubicin ## 2 Crassostreaxanthin A ## 3 Ethambutol ## 4 Heliotrine ## 5 Sertaconazole ## 6 (R)Semivioxanthin #' Prepare query data frame rowData(res)$feature_id <- rownames(rowData(res)) res_sig <- res[rowData(res)$significant.CVD, ] #' Setup parameters for the matching param <- Mass2MzParam(adducts = c(\"[M+H]+\", \"[M+Na]+\", \"[M+H-NH3]+\"), tolerance = 0, ppm = 5) #' Perform the matching. mtch <- matchValues(res_sig, cmps, param = param, mzColname = \"mzmed\") mtch ## Object of class Matched ## Total number of matches: 237 ## Number of query objects: 6 (4 matched) ## Number of target objects: 117732 (237 matched) #' Extracting the results mtch_res <- matchedData(mtch, c(\"feature_id\", \"mzmed\", \"rtmed\", \"adduct\", \"ppm_error\", \"target_formula\", \"target_name\", \"target_inchikey\")) mtch_res ## DataFrame with 239 rows and 8 columns ## feature_id mzmed rtmed adduct ppm_error target_formula ## ## FT0371 FT0371 138.055 148.396 [M+H]+ 2.08055 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 1.93568 C7H7NO2 ## FT0371 FT0371 138.055 148.396 [M+H]+ 2.08055 C7H7NO2 ## ... ... ... ... ... ... ... ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT1171 FT1171 229.13 181.0883 [M+Na]+ 3.07708 C12H18N2O ## FT5606 FT5606 560.36 33.5492 NA NA NA ## target_name target_inchikey ## ## FT0371 Benzohydro... VDEUYMSGMP... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Trigonelli... WWNNZCOKKK... ## FT0371 Salicylami... SKZKKFZAGN... ## ... ... ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT1171 Isoproturo... PUIYMUZLKQ... ## FT5606 NA NA rownames(mtch_res) <- NULL #' Keep only info on features that machted - create a utility function for that mtch_res <- split(mtch_res, mtch_res$feature_id) |> lapply(function(x) { lapply(split(x, x$target_inchikey), function(z) { z[which.min(z$ppm_error), ] }) }) |> unlist(recursive = FALSE) |> do.call(what = rbind) #' Display the results mtch_res |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 8.MS1 annotation results\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"ms2-based-annotation","dir":"Articles","previous_headings":"Annotation","what":"MS2-based annotation","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"MS1 annotation fast efficient method annotate features therefore give first insight compounds significantly different two study groups. However, always accurate. MS2 data can provide higher level confidence annotation process provides, observed fragmentation pattern, information structure compound. MS2 data can generated LC-MS/MS measurement MS2 spectra recorded ions either data dependent acquisition (DDA) data independent acquisition (DIA) mode. Generally, advised include LC-MS/MS runs QC samples randomly selected study samples already acquisition MS1 data used quantification signals. alternative, addition, post-hoc LC-MS/MS acquisition can performed generate MS2 data needed annotation. present experiment, separate LC-MS/MS measurement conducted QC samples selected study samples generate data using inclusion list pre-selected ions. represent features found significantly different CVD CTR samples initial analysis full experiment. use subset second LC-MS/MS data set show data can used MS2-based annotation. differential abundance analysis found features significantly higher abundances CTR samples. Consequently, utilize MS2 data obtained CTR samples annotate significant features. load LC-MS/MS data experiment restrict data acquired CTR sample. total 3 LC-MS/MS data files control samples, different collision energy fragment ions. show number MS1 MS2 spectra files. Compared number MS2 spectra, far less MS1 spectra acquired. configuration MS instrument set ensure ions specified inclusion list selected fragmentation, even intensity might low. setting, however, recorded MS2 spectra represent noise. plot shows location precursor ions m/z - retention time plane three files. can see MS2 spectra recorded m/z interest along full retention time range, even actual ions eluting within certain retention time windows. next extract Spectra object MS data data object assign new spectra variable employed collision energy, extract data object sampleData. next filter MS data first restricting MS2 spectra removing mass peaks spectrum intensity lower 5% highest intensity spectrum, assuming low intensity peaks represent background signal. next remove also mass peaks m/z value greater equal precursor m/z ion. puts, later matching reference spectra, weight fragmentation pattern ions avoids hits based precursor m/z peak (hence similar mass compared compounds). last, restrict data spectra least two fragment peaks scale intensities sum 1 spectrum. similarity calculations affected scaling, makes visual comparison fragment spectra easier read. Finally, also speed later comparison spectra reference database, load full MS data memory (changing backend MsBackendMemory) apply processing steps performed data far. Keeping MS data memory performance benefits, generally suggested large data sets. evaluate impact present data set print addition size data object changing backend. thus moderate increase memory demand loading MS data memory (also filtered cleaned MS2 data). proceed match experimental MS2 spectra reference fragment spectra, workflow aim annotate features found significant differential abundance analysis. goal thus identify MS2 spectra second (LC-MS/MS) run represent fragments ions features data first (LC-MS) run. approach match MS2 spectra significant features determined earlier based precursor m/z retention time (given acceptable tolerance) feature’s m/z retention time. can easily done using featureArea() function effectively considers actual m/z retention time ranges features’ chromatographic peaks therefore increase chance finding correct match. however also assumes retention times first second run don’t differ much. Alternatively, need align retention times second LC-MS/MS data set first. first extract feature area, .e., m/z retention time ranges, significant features. next identify fragment spectra precursor m/z retention times within ranges. use filterRanges() function allows filter Spectra object using multiple ranges simultaneously. apply function separately feature (row matrix) extract MS2 spectra representing fragmentation information presumed feature’s ions. result apply() call list Spectra, element representing result one feature. exception last feature, multiple MS2 spectra identified. next combine list Spectra single Spectra object using concatenateSpectra() function add additional spectra variable containing respective feature identifier. now Spectra object fragment spectra significant features differential expression analysis. next build reference data need process way query spectra. extract fragment spectra MassBank database, restrict positive polarity data (since experiment acquired positive polarity) perform processing fragment spectra MassBank database. Note switch MsBackendMemory backend hence loading full data reference database memory. positive impact performance subsequent spectra matching, however also increase memory demand present analysis. Now Spectra object second run database spectra prepared, can proceed matching process. use matchSpectra() function MetaboAnnotation package CompareSpectraParam define settings matching. following parameters: requirePrecursor = TRUE: Limits spectra similarity calculations fragment spectra similar precursor m/z. tolerance ppm: Defines acceptable difference compared m/z values. relaxed tolerance settings ensure find matches even reference spectra acquired instruments lower accuracy. THRESHFUN: Defines matches report. , keep matches resulting spectra similarity score (calculated normalized dot product (Stein Scott 1994), default similarity function) larger 0.6. Thus, total 315 query MS2 spectra, 16 matched (least) one reference fragment spectrum. restrict results matching spectra extract metadata query target spectra well similarity score (complete list available metadata information can listed colnames() function). Now, query-target pairs spectra similarity higher 0.6. Similar MS1-based annotation also result table contains redundant information: multiple fragment spectra per feature also MassBank contains several fragment spectra compound, measured using differing collision energies MS instruments, different laboratories. thus iterate feature-compound pairs select one highest score. identifier compound, use fragment spectra’s INCHI-key, since compound names MassBank accepted consensus/controlled vocabularies. Table 9.MS2 annotation results. Thus, 6 significant features, one annotated compound based MS2-based approach. many reasons failure find matches features. Although MS2 spectra selected feature, appear represent noise, features, LC-MS/MS run, low MS1 signal recorded, indicating selected sample original compound might (longer) present. Also, reference databases contain predominantly fragment spectra protonated ([M+H]+) ions compounds, features might represent signal types ions result different fragmentation pattern. Finally, fragment spectra compounds interest might also simply present used reference database. Thus, combining information MS1- MS2 based annotation can annotate one feature considerable confidence. feature m/z 195.0879 retention time 32 seconds seems ion caffeine. result somewhat disappointing also clearly shows importance proper experimental planning need control potential confounding factors. present experiment, disease-specific biomarker identified, life-style property individuals suffering disease: coffee consumption probably contraindicated patients CVD group reduce risk heart arrhythmia. plot EIC feature highlighting retention time highest scoring MS2 spectra recorded create mirror plot comparing MS2 spectra reference fragment spectra caffeine. plot clearly shows higher signal feature CTR compared CVD samples. QC samples exhibit lower highly consistent signal, suggesting absence strong technical noise biases raw data experiment. vertical line indicates retention time fragment spectrum best match reference spectrum. noted , since fragment spectra measured separate LC-MS/MS experiment, considered indication approximate retention time ions fragmented second experiment. fragment spectrum feature, shown upper panel right plot highly similar reference spectrum caffeine MassBank (shown lower panel). addition matching precursor m/z, two fragments (m/z intensity) present spectra. can also extract additional metadata matching reference spectrum, used collision energy, fragmentation mode, instrument type, instrument well ion (adduct) fragmented.","code":"#' Load form the MetaboLights Database param <- MetaboLightsParam(mtblsId = \"MTBLS8735\", assayName = paste0(\"a_MTBLS8735_LC-MSMS_positive_\", \"hilic_metabolite_profiling.txt\"), filePattern = \".mzML\") msms_data <- readMsObject(MsExperiment(), param, keepOntology = FALSE, keepProtocol = FALSE, simplify = TRUE) #adjust sampleData colnames(sampleData(msms_data)) <- c(\"sample_name\", \"derived_spectra_data_file\", \"metabolite_asssignment_file\", \"source_name\", \"organism\", \"blood_sample_type\", \"sample_type\", \"age\", \"unit\", \"phenotype\") # filter samples to keep MSMS data from CTR samples: sampleData(msms_data) <- sampleData(msms_data)[sampleData(msms_data)$phenotype == \"CTR\", ] sampleData(msms_data) <- sampleData(msms_data)[grepl(\"MSMS\", sampleData(msms_data)$derived_spectra_data_file), ] # Add fragmentation data information (from filenames) sampleData(msms_data)$fragmentation_mode <- c(\"CE20\", \"CE30\", \"CES\") #let's look at the updated sample data sampleData(msms_data)[, c(\"derived_spectra_data_file\", \"phenotype\", \"sample_name\", \"age\")] |> as.data.frame() |> pandoc.table(style = \"rmarkdown\", caption = \"Table 1. Samples from the data set.\") ## ## ## | derived_spectra_data_file | phenotype | sample_name | age | ## |:----------------------------:|:---------:|:-----------:|:---:| ## | FILES/MSMS_2_E_CE20_POS.mzML | CTR | E | 66 | ## | FILES/MSMS_2_E_CE30_POS.mzML | CTR | E | 66 | ## | FILES/MSMS_2_E_CES_POS.mzML | CTR | E | 66 | ## ## Table: Table 1. Samples from the data set. #' Filter the data to the same RT range as the LC-MS run msms_data <- filterRt(msms_data, c(10, 240)) #' check the number of spectra per ms level spectra(msms_data) |> msLevel() |> split(spectraSampleIndex(msms_data)) |> lapply(table) |> do.call(what = cbind) ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 1 825 186 186 186 825 186 186 186 825 185 186 185 ## 2 825 3121 3118 3124 825 3123 3118 3120 825 3117 3117 3116 plotPrecursorIons(msms_data) ms2_ctr <- spectra(msms_data) ms2_ctr$collision_energy <- sampleData(msms_data)$fragmentation_mode[spectraSampleIndex(msms_data)] #' Remove low intensity peaks low_int <- function(x, ...) { x > max(x, na.rm = TRUE) * 0.05 } ms2_ctr <- filterMsLevel(ms2_ctr, 2L) |> filterIntensity(intensity = low_int) #' Remove precursor peaks and restrict to spectra with a minimum #' number of peaks ms2_ctr <- filterPrecursorPeaks(ms2_ctr, ppm = 50, mz = \">=\") ms2_ctr <- ms2_ctr[lengths(ms2_ctr) > 1] |> scalePeaks() #' Size of the object before loading into memory print(object.size(ms2_ctr), units = \"MB\") ## 5.1 Mb #' Load the MS data subset into memory ms2_ctr <- setBackend(ms2_ctr, MsBackendMemory()) ms2_ctr <- applyProcessing(ms2_ctr) #' Size of the object after loading into memory print(object.size(ms2_ctr), units = \"MB\") ## 18.2 Mb #' Define the m/z and retention time ranges for the significant features target <- featureArea(data)[rownames(res_sig), ] target ## mzmin mzmax rtmin rtmax ## FT0371 138.0544 138.0552 146.32270 152.86115 ## FT0565 161.0391 161.0407 159.00234 164.30799 ## FT0732 182.0726 182.0756 32.71242 42.28755 ## FT0845 195.0799 195.0887 30.73235 35.67337 ## FT1171 229.1282 229.1335 178.01450 183.35303 ## FT5606 560.3539 560.3656 32.06570 35.33456 #' Identify for each feature MS2 spectra with their precursor m/z and #' retention time within the feature's m/z and retention time range ms2_ctr_fts <- apply(target[, c(\"rtmin\", \"rtmax\", \"mzmin\", \"mzmax\")], MARGIN = 1, FUN = filterRanges, object = ms2_ctr, spectraVariables = c(\"rtime\", \"precursorMz\")) lengths(ms2_ctr_fts) ## FT0371 FT0565 FT0732 FT0845 FT1171 FT5606 ## 38 36 135 68 38 0 l <- lengths(ms2_ctr_fts) #' Combine the individual Spectra objects ms2_ctr_fts <- concatenateSpectra(ms2_ctr_fts) #' Assign the feature identifier to each MS2 spectrum ms2_ctr_fts$feature_id <- rep(rownames(res_sig), l) ms2_ref <- Spectra(mb) |> filterPolarity(1L) |> filterIntensity(intensity = low_int) |> filterPrecursorPeaks(ppm = 50, mz = \">=\") ms2_ref <- ms2_ref[lengths(ms2_ref) > 1] |> scalePeaks() register(SerialParam()) #' Define the settings for the spectra matching. prm <- CompareSpectraParam(ppm = 40, tolerance = 0.05, requirePrecursor = TRUE, THRESHFUN = function(x) which(x >= 0.6)) ms2_mtch <- matchSpectra(ms2_ctr_fts, ms2_ref, param = prm) ms2_mtch ## Object of class MatchedSpectra ## Total number of matches: 214 ## Number of query objects: 315 (16 matched) ## Number of target objects: 69561 (21 matched) #' Keep only query spectra with matching reference spectra ms2_mtch <- ms2_mtch[whichQuery(ms2_mtch)] #' Extract the results ms2_mtch_res <- matchedData(ms2_mtch) nrow(ms2_mtch_res) ## [1] 214 #' - split the result per feature #' - select for each feature the best matching result for each compound #' - combine the result again into a data frame ms2_mtch_res <- ms2_mtch_res |> split(f = paste(ms2_mtch_res$feature_id, ms2_mtch_res$target_inchikey)) |> lapply(function(z) { z[which.max(z$score), ] }) |> do.call(what = rbind) |> as.data.frame() #' List the best matching feature-compound pair pandoc.table(ms2_mtch_res[, c(\"feature_id\", \"target_name\", \"score\", \"target_inchikey\")], style = \"rmarkdown\", caption = \"Table 9.MS2 annotation results.\", split.table = Inf) par(mfrow = c(1, 2)) col_sample <- col_phenotype[sampleData(data)$phenotype] #' Extract and plot EIC for the annotated feature eic <- featureChromatograms(data, features = ms2_mtch_res$feature_id[1]) plot(eic, col = col_sample, peakCol = col_sample[chromPeaks(eic)[, \"sample\"]], peakBg = paste0(col_sample[chromPeaks(eic)[, \"sample\"]], 20)) legend(\"topright\", col = col_phenotype, legend = names(col_phenotype), lty = 1) #' Identify the best matching query-target spectra pair idx <- which.max(ms2_mtch_res$score) #' Indicate the retention time of the MS2 spectrum in the EIC plot abline(v = ms2_mtch_res$rtime[idx]) #' Get the index of the MS2 spectrum in the query object query_idx <- which(query(ms2_mtch)$.original_query_index == ms2_mtch_res$.original_query_index[idx]) query_ms2 <- query(ms2_mtch)[query_idx] #' Get the index of the MS2 spectrum in the target object target_idx <- which(target(ms2_mtch)$spectrum_id == ms2_mtch_res$target_spectrum_id[idx]) target_ms2 <- target(ms2_mtch)[target_idx] #' Create a mirror plot comparing the two best matching spectra plotSpectraMirror(query_ms2, target_ms2) legend(\"topleft\", legend = paste0(\"precursor m/z: \", format(precursorMz(query_ms2), 3))) spectraData(target_ms2, c(\"collisionEnergy_text\", \"fragmentation_mode\", \"instrument_type\", \"instrument\", \"adduct\")) |> as.data.frame() ## collisionEnergy_text fragmentation_mode instrument_type ## 1 55 (nominal) HCD LC-ESI-ITFT ## instrument adduct ## 1 LTQ Orbitrap XL Thermo Scientific [M+H]+"},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"external-tools-or-alternative-annotation-approaches","dir":"Articles","previous_headings":"Annotation","what":"External tools or alternative annotation approaches","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"present workflow highlights annotation performed within R using packages Bioconductor project, also excellent external softwares used alternative, SIRIUS (Dührkop et al. 2019), mummichog (Li et al. 2013) GNPS (Nothias et al. 2020) among others. use , data need exported format supported . MS2 spectra, data easily exported required MGF file format using r Biocpkg(\"MsBackendMgf\") Bioconductor package. Integration xcms feature-based molecular networking GNPS described GNPS documentation. alternative, addition, evidence potential matching chemical formula feature derived evaluating isotope pattern full MS1 scan. provide information isotope composition. Also , various functions isotopologues() r Biocpkg(\"MetaboCoreUtils\") package functionality envipat R package (Loos et al. 2015) used.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"summary","dir":"Articles","previous_headings":"","what":"Summary","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"tutorial, describe end--end workflow LC-MS-based untargeted metabolomics experiments, conducted entirely within R using packages Bioconductor project base R functionality. excellent software exists perform similar analyses, power R-based workflow lies adaptability individual data sets research questions ability build reproducible workflows documentation. Due space restrictions don’t provide comprehensive listing methodologies individual analysis steps. advanced options approaches available, e.g., normalization data, however also heavily dependent size properties analyzed data set, well annotation features. result, found present analysis set features significant abundance differences compared groups. however reliably annotate single feature, related lifestyle individuals rather pathological properties investigated disease. low proportion annotated signals however uncommon untargeted metabolomics experiments reflects need comprehensive reliable reference annotation libraries.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"session-information","dir":"Articles","previous_headings":"","what":"Session information","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"","code":"sessionInfo() ## R version 4.4.1 (2024-06-14) ## Platform: x86_64-pc-linux-gnu ## Running under: Ubuntu 22.04.5 LTS ## ## Matrix products: default ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## time zone: Etc/UTC ## tzcode source: system (glibc) ## ## attached base packages: ## [1] stats4 stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] MetaboAnnotation_1.9.1 CompoundDb_1.9.5 ## [3] AnnotationFilter_1.29.0 AnnotationHub_3.13.3 ## [5] BiocFileCache_2.13.0 dbplyr_2.5.0 ## [7] gridExtra_2.3 ggfortify_0.4.17 ## [9] ggplot2_3.5.1 vioplot_0.5.0 ## [11] zoo_1.8-12 sm_2.2-6.0 ## [13] pheatmap_1.0.12 RColorBrewer_1.1-3 ## [15] pander_0.6.5 limma_3.61.11 ## [17] MetaboCoreUtils_1.13.0 Spectra_1.15.10 ## [19] xcms_4.3.3 BiocParallel_1.39.0 ## [21] SummarizedExperiment_1.35.2 GenomicRanges_1.57.1 ## [23] GenomeInfoDb_1.41.1 IRanges_2.39.2 ## [25] S4Vectors_0.43.2 MatrixGenerics_1.17.0 ## [27] matrixStats_1.4.1 MsBackendMetaboLights_0.99.0 ## [29] MsIO_0.0.6 MsExperiment_1.7.0 ## [31] ProtGenerics_1.37.1 readxl_1.4.3 ## [33] Biobase_2.65.1 BiocGenerics_0.51.2 ## [35] rmarkdown_2.28 knitr_1.48 ## [37] BiocStyle_2.33.1 ## ## loaded via a namespace (and not attached): ## [1] bitops_1.0-8 filelock_1.0.3 ## [3] tibble_3.2.1 cellranger_1.1.0 ## [5] preprocessCore_1.67.1 XML_3.99-0.17 ## [7] lifecycle_1.0.4 doParallel_1.0.17 ## [9] lattice_0.22-6 MASS_7.3-61 ## [11] alabaster.base_1.5.9 MultiAssayExperiment_1.31.5 ## [13] magrittr_2.0.3 sass_0.4.9 ## [15] jquerylib_0.1.4 yaml_2.3.10 ## [17] MsCoreUtils_1.17.2 DBI_1.2.3 ## [19] abind_1.4-8 zlibbioc_1.51.1 ## [21] purrr_1.0.2 RCurl_1.98-1.16 ## [23] rappdirs_0.3.3 GenomeInfoDbData_1.2.12 ## [25] MSnbase_2.31.1 pkgdown_2.1.1 ## [27] ncdf4_1.23 codetools_0.2-20 ## [29] DelayedArray_0.31.12 DT_0.33 ## [31] xml2_1.3.6 tidyselect_1.2.1 ## [33] farver_2.1.2 UCSC.utils_1.1.0 ## [35] base64enc_0.1-3 jsonlite_1.8.9 ## [37] iterators_1.0.14 systemfonts_1.1.0 ## [39] foreach_1.5.2 tools_4.4.1 ## [41] progress_1.2.3 ragg_1.3.3 ## [43] Rcpp_1.0.13 glue_1.7.0 ## [45] SparseArray_1.5.41 xfun_0.47 ## [47] dplyr_1.1.4 withr_3.0.1 ## [49] BiocManager_1.30.25 fastmap_1.2.0 ## [51] rhdf5filters_1.17.0 fansi_1.0.6 ## [53] digest_0.6.37 mime_0.12 ## [55] R6_2.5.1 textshaping_0.4.0 ## [57] colorspace_2.1-1 rsvg_2.6.1 ## [59] RSQLite_2.3.7 utf8_1.2.4 ## [61] tidyr_1.3.1 generics_0.1.3 ## [63] prettyunits_1.2.0 PSMatch_1.9.0 ## [65] httr_1.4.7 htmlwidgets_1.6.4 ## [67] S4Arrays_1.5.9 pkgconfig_2.0.3 ## [69] gtable_0.3.5 blob_1.2.4 ## [71] impute_1.79.0 MassSpecWavelet_1.71.0 ## [73] XVector_0.45.0 htmltools_0.5.8.1 ## [75] bookdown_0.40 MALDIquant_1.22.3 ## [77] clue_0.3-65 scales_1.3.0 ## [79] png_0.1-8 reshape2_1.4.4 ## [81] rjson_0.2.23 curl_5.2.3 ## [83] cachem_1.1.0 rhdf5_2.49.0 ## [85] stringr_1.5.1 BiocVersion_3.20.0 ## [87] parallel_4.4.1 AnnotationDbi_1.67.0 ## [89] mzID_1.43.0 vsn_3.73.0 ## [91] desc_1.4.3 pillar_1.9.0 ## [93] grid_4.4.1 alabaster.schemas_1.5.0 ## [95] vctrs_0.6.5 MsFeatures_1.13.0 ## [97] pcaMethods_1.97.0 cluster_2.1.6 ## [99] evaluate_1.0.0 cli_3.6.3 ## [101] compiler_4.4.1 rlang_1.1.4 ## [103] crayon_1.5.3 labeling_0.4.3 ## [105] QFeatures_1.15.3 ChemmineR_3.57.1 ## [107] affy_1.83.1 plyr_1.8.9 ## [109] fs_1.6.4 stringi_1.8.4 ## [111] munsell_0.5.1 Biostrings_2.73.2 ## [113] lazyeval_0.2.2 Matrix_1.7-0 ## [115] hms_1.1.3 bit64_4.5.2 ## [117] Rhdf5lib_1.27.0 KEGGREST_1.45.1 ## [119] statmod_1.5.0 highr_0.11 ## [121] mzR_2.39.0 igraph_2.0.3 ## [123] memoise_2.0.1 affyio_1.75.1 ## [125] bslib_0.8.0 bit_4.5.0"},{"path":[]},{"path":[]},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"aknowledgment","dir":"Articles","previous_headings":"Appendix","what":"Aknowledgment","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"Thanks Steffen Neumann continuous work develop maintain xcms software. …","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"alignment-using-manually-selected-anchor-peaks","dir":"Articles","previous_headings":"Appendix","what":"Alignment using manually selected anchor peaks","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"align data set using internal standards. suggested eventually enrich anchor peaks signal ions retention time regions covered internal standards.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/articles/end-to-end-untargeted-metabolomics.html","id":"additional-informations","dir":"Articles","previous_headings":"","what":"Additional informations","title":"A Complete End-to-End Workflow for untargeted LC-MS/MS Metabolomics Data Analysis in R","text":"","code":"#possible extra info: # -"},{"path":"https://rformassspectrometry.github.io/metabonaut/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Philippine Louail. Author, maintainer. Johannes Rainer. Author.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Louail P, Rainer J (2024). metabonaut: Exploring Analyzing LC-MS data. R package version 0.0.1, https://rformassspectrometry.github.io/metabonaut/, https://github.com/rformassspectrometry/metabonaut/.","code":"@Manual{, title = {metabonaut: Exploring and Analyzing LC-MS data}, author = {Philippine Louail and Johannes Rainer}, year = {2024}, note = {R package version 0.0.1, https://rformassspectrometry.github.io/metabonaut/}, url = {https://github.com/rformassspectrometry/metabonaut/}, }"},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"exploring-and-analyzing-untargeted-metabolomics-data","dir":"","previous_headings":"","what":"Exploring and Analyzing LC-MS data","title":"Exploring and Analyzing LC-MS data","text":"Welcome Metabonaut ! 🧑‍🚀 initiative present series workflows based small LC-MS/MS dataset using R Bioconductor packages. Throughout workflows, demonstrate various algorithms can adapted particular data set various R packages can seamlessly integrated ensure efficient reproducible processing. main workflow presented “Complete end--end LC-MS/MS Metabolomic Data analysis” full R code examples along comprehensive descriptions provided end--end-untargeted-metabolomics.Rmd file. file can opened e.g. RStudio allows execution individual R commands (see section additionally required R packages). R command rmarkdown::render(\"xcms-preprocessing.Rmd\") generate html file xcms-preprocessing.html.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"important-to-note","dir":"","previous_headings":"","what":"Important to note","title":"Exploring and Analyzing LC-MS data","text":"tutorial expect user basic knowledge R Rmarkdown. advise go short tutorial order comfortable testing code easily adapting data. Rmarkdown, click R, can find really fun way learn basic R programming interactive short course","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Exploring and Analyzing LC-MS data","text":"workshop files along R runtime environment including required packages RStudio (Posit) editor bundled docker container. installation, docker container can run computer code examples workshop can evaluated within environment (without need install additional packages files). version workshop uses packages Bioconductor devel hence bases Bioconductor’s docker container development version packages. stable version come soon. required steps installation : don’t already , install docker. Find installation information . Get docker image tutorial e.g. command line docker pull rformassspectrometry/metabonaut:latest. Start docker container, either Docker Desktop, command line Enter http://localhost:8787 web browser log username rstudio password bioc. RStudio server version: open R-markdown (.Rmd) files vignettes folder evaluate R code blocks document. manual installation, R version >= 4.4.0 required well recent versions packages used workflow. now 2 packages used workflow bioconductor therefore need downloaded github. Run code follow:","code":"docker run \\ -e PASSWORD=bioc \\ -p 8787:8787 \\ rformassspectrometry/metabonaut:latest install.packages(\"BiocManager\") BiocManager::install(\"rformassspectrometry/metabonaut\")"},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"known-issues","dir":"","previous_headings":"","what":"Known issues","title":"Exploring and Analyzing LC-MS data","text":"workflow still getting ready fully deployed, therefore might ongoing issue actively resolving. know list . now, aware problem code. issue sure check latest devel version packages. issue resolved updating packages please report reproducible example github issue, hesitate report us.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"contribution","dir":"","previous_headings":"","what":"Contribution","title":"Exploring and Analyzing LC-MS data","text":"contributions, see RforMassSpectrometry contributions guideline.","code":""},{"path":"https://rformassspectrometry.github.io/metabonaut/index.html","id":"code-of-conduct","dir":"","previous_headings":"","what":"Code of Conduct","title":"Exploring and Analyzing LC-MS data","text":"See RforMassSpectrometry Code Conduct.","code":""},{"path":[]},{"path":"https://rformassspectrometry.github.io/metabonaut/news/index.html","id":"changes-in-0-0-1","dir":"Changelog","previous_headings":"","what":"Changes in 0.0.1","title":"metabonaut 0.0.1","text":"Addition basic files workflow package. Addition end--end vignette.","code":""}]