diff --git a/Dockerfile b/Dockerfile index 26d4402..6f8aa72 100644 --- a/Dockerfile +++ b/Dockerfile @@ -25,11 +25,11 @@ RUN Rscript -e "BiocManager::install(c('RCurl', 'xcms', 'MsExperiment', 'Summari ## Install the current package with vignettes RUN Rscript -e "devtools::install('.', dependencies = TRUE, type = 'source', build_vignettes = FALSE)" -## USER rstudio +USER rstudio ## build article for end-to-end:test -##RUN Rscript -e "quarto::quarto_render('vignettes/a-end-to-end-untargeted-metabolomics.qmd', quiet = FALSE)" +RUN Rscript -e "quarto::quarto_render('vignettes/a-end-to-end-untargeted-metabolomics.qmd', quiet = FALSE)" -##USER root +USER root -##RUN find vignettes/ -name "*.html" -type f -delete && find vignettes/ -name "*_files" -type d -exec rm -r {} + +RUN find vignettes/ -name "*.html" -type f -delete && find vignettes/ -name "*_files" -type d -exec rm -r {} + diff --git a/vignettes/a-end-to-end-untargeted-metabolomics.qmd b/vignettes/a-end-to-end-untargeted-metabolomics.qmd new file mode 100644 index 0000000..853277a --- /dev/null +++ b/vignettes/a-end-to-end-untargeted-metabolomics.qmd @@ -0,0 +1,3147 @@ +--- +title: "Complete end-to-end LC-MS/MS Metabolomic Data analysis" +format: html +tbl-cap-location: bottom +editor: visual +minimal: true +date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`' +vignette: > + %\VignetteIndexEntry{Complete end-to-end LC-MS/MS Metabolomic Data analysis} + %\VignetteEngine{quarto::html} + %\VignetteEncoding{UTF-8} +--- + +```{r setup, include=FALSE} +library(knitr) +library(quarto) +knitr::opts_knit$set(root.dir = './') +``` + +# Introduction + +The present workflow describes all steps for the analysis of an LC-MS/MS +experiment, which includes the preprocessing of the raw data to generate the +*abundance* matrix for the *features* in the various samples, followed by data +normalization, differential abundance analysis and finally the annotation of +features to metabolites. Note that also alternative analysis options and R +packages could be used for different steps and some examples are mentioned +throughout the workflow. + +![Steps of the end-to-end workflow and possible +alternatives](images/workflow.png) + +# Data description + +See the [data +description](https://rformassspectrometry.github.io/metabonaut/articles/dataset-investigation.html) +vignette for detailed explanation of the dataset we go through in this workflow +and general tips on what should be done when you first get your data. + +# Packages needed + +Our workflow is therefore based on the following dependencies: + +```{r packages used, message=FALSE, warning=FALSE} +## General bioconductor package +library(BiocStyle) +library(knitr) + +## Data Import and handling +library(readxl) +library(MsExperiment) +library(MsIO) +library(MsBackendMetaboLights) +library(SummarizedExperiment) + +## Preprocessing of LC-MS data +library(xcms) +library(Spectra) +library(MetaboCoreUtils) + +## Statistical analysis +library(limma) # Differential abundance +library(matrixStats) # Summaries over matrices + +## Visualisation +library(pander) +library(RColorBrewer) +library(pheatmap) +library(vioplot) +library(ggfortify) # Plot PCA +library(gridExtra) # To arrange multiple ggplots into single plots + +## Annotation +library(AnnotationHub) # Annotation resources +library(CompoundDb) # Access small compound annotation data. +library(MetaboAnnotation) # Functionality for metabolite annotation. +``` + +# Data import + +Note that different equipment will generate various file extensions, so a +conversion step might be needed beforehand, though it does not apply to this +dataset. The Spectra package supports a variety of ways to store and retrieve MS +data, including mzML, mzXML, CDF files, simple flat files, or database systems. +If necessary, several tools, such as ProteoWizard's MSConvert, can be used to +convert files to the .mzML format [@chambers_cross-platform_2012]. + +Below we will show how to extract our dataset from the MetaboLigths database and +load it as an `MsExperiment` object. For more information on how to load your +data from the MetaboLights database, refer to the +[vignette](https://rformassspectrometry.github.io/MsIO/articles/MsIO.html#loading-data-from-metabolights). +For other type of data loading, check out this xcms +[vignette](https://jorainer.github.io/xcmsTutorials/articles/xcms-preprocessing.html#data-import-and-exploration) +A more specific vignette will be created for data import soon. + +```{r import, message=FALSE, warning=FALSE} +param <- MetaboLightsParam(mtblsId = "MTBLS8735", + assayName = paste0("a_MTBLS8735_LC-MS_positive_", + "hilic_metabolite_profiling.txt"), + filePattern = ".mzML") + +lcms1 <- readMsObject(MsExperiment(), + param, + keepOntology = FALSE, + keepProtocol = FALSE, + simplify = TRUE) +``` + +We next configure the parallel processing setup. Most functions from the *xcms* +package allow per-sample parallel processing, which can improve the performance +of the analysis, especially for large data sets. *xcms* and all packages from +the *RforMassSpectrometry* package ecosystem use the parallel processing setup +configured through the `r Biocpkg("BiocParallel")` Bioconductor package. With +the code below we use a *fork-based* parallel processing on unix system, and a +*socket-based* parallel processing on the Windows operating system. + +```{r par-process, eval = FALSE} +register(SerialParam()) +#' Set up parallel processing using 2 cores +## if (.Platform$OS.type == "unix") { +## register(MulticoreParam(2)) +## } else { +## register(SnowParam(2)) +## } + +``` + + +# Data organisation + +The experimental data is now represented by a `MsExperiment` object from the +`r Biocpkg("MsExperiment")` package. The `MsExperiment` object is a container +for metadata and spectral data that provides and manages also the linkage +between samples and spectra. + +```{r} +lcms1 +``` + +Below we provide a brief overview of the data structure and content. The +`sampleData()` function extracts sample information from the object. We next +extract this data and use the *pander* package to render and show this +information in Table 1 below. Throughout the document we use the R pipe operator +(`|>`) to avoid nested function calls and hence improve code readability. + +The `sampleData()` output from *MetaboLights* is not always ideal for direct and +easy access to the data. We therefore rename it and transform it in a more +user-friendly way. The user can add, transform and remove any column they want +using base R functionalities. + +```{r phenodata} +#| tbl-cap: "Table 1. Samples from the data set." +sampleData(lcms1)[, c("Derived_Spectral_Data_File", + "Characteristics[Sample type]", + "Factor Value[Phenotype]", + "Sample Name", + "Factor Value[Age]")] |> + kable(format = "pipe") +``` + +```{r update-phenodata} +#| tbl-cap: "Table 2. Simplified sample data." +# Let's rename the column for easier access +colnames(sampleData(lcms1)) <- c("sample_name", "derived_spectra_data_file", + "metabolite_asssignment_file", + "source_name", + "organism", + "blood_sample_type", + "sample_type", "age", "unit", "phenotype") + +# Add "QC" to the phenotype of the QC samples +sampleData(lcms1)$phenotype[sampleData(lcms1)$sample_name == "POOL"] <- "QC" +sampleData(lcms1)$sample_name[sampleData(lcms1)$sample_name == "POOL" ] <- c("POOL1", "POOL2", "POOL3", "POOL4") + +# Add injection index column +sampleData(lcms1)$injection_index <- seq_len(nrow(sampleData(lcms1))) + +#let's look at the updated sample data +sampleData(lcms1)[, c("derived_spectra_data_file", + "phenotype", "sample_name", "age", + "injection_index")] |> + kable(format = "pipe") +``` + +There are `r length(sampleData(lcms1))` samples in this data set. Below are +abbreviations essential for proper interpretation of this metadata information: + +- *injection_index*: An index representing the order (position) in which an + individual sample was measured (injected) within the LC-MS measurement run + of the experiment. +- *phenotype*: The sample groups of the experiment: + - `"QC"`: Quality control sample (pool of serum samples from an external, + large cohort). + - `"CVD"`: Sample from an individual with a cardiovascular disease. + - `"CTR"`: Sample from a presumably healthy control. +- *sample_name*: An arbitrary name/identifier of the sample. +- *age*: The (rounded) age of the individuals. + +We will define colors for each of the sample groups based on their sample group +using the *RColorBrewer* package: + +```{r define-colors, include=FALSE} +#' Define colors for the different phenotypes +col_phenotype <- brewer.pal(9, name = "Set1")[c(9, 5, 4)] +names(col_phenotype) <- c("QC", # grey + "CVD", # orange + "CTR") # purple +col_sample <- col_phenotype[sampleData(lcms1)$phenotype] +``` + +The MS data of this experiment is stored as a `Spectra` object (from the +`r Biocpkg("Spectra")` Bioconductor package) within the `MsExperiment` object +and can be accessed using `spectra()` function. Each element in this object is a +spectrum - they are organised linearly and are all combined in the same +`Spectra` object one after the other (ordered by retention time and samples). + +Below we access the dataset's `Spectra` object which will summarize its +available information and provide, among other things, the total number of +spectra of the data set. + +```{r} +#' Access Spectra Object +spectra(lcms1) +``` + +We can also summarize the number of spectra and their respective MS level +(extracted with the `msLevel()` function). The `fromFile()` function returns for +each spectrum the index of its sample (data file) and can thus be used to split +the information (MS level in this case) by sample to further summarize using the +base R `table()` function and combine the result into a matrix. Note that this +is the number of spectra acquired at each run, and not the number of spectral +features in each sample. + +```{r} +#' Count the number of spectra with a specific MS level per file. +spectra(lcms1) |> + msLevel() |> + split(fromFile(lcms1)) |> + lapply(table) |> + do.call(what = cbind) +``` + +The present data set thus contains only MS1 data, which is ideal for +quantification of the signal. A second (LC-MS/MS) data set also with fragment +(MS2) spectra of the same samples will be used later on in the workflow. + +Note that users should not restrict themselves to data evaluation examples shown +here or in other tutorials. The *Spectra* package enables user-friendly access +to the full MS data and its functionality should be extensively used to explore, +visualize and summarize the data. + +As another example, we below determine the retention time range for the entire +data set. + +```{r} +#' Retention time range for entire dataset +spectra(lcms1) |> + rtime() |> + range() +``` + +Data obtained from LC-MS experiments are typically analyzed along the retention +time axis, while MS data is organized by spectrum, orthogonal to the retention +time axis. + +# Data visualization and general quality assessment + +Effective visualization is paramount for inspecting and assessing the quality of +MS data. For a general overview of our LC-MS data, we can: + +- Combine all mass peaks from all (MS1) spectra of a sample into a single + spectrum in which each mass peak then represents the maximum signal of all + mass peaks with a similar *m/z*. This spectrum might then be called Base + Peak Spectrum (BPS), providing information on the most abundant ions of a + sample. +- Aggregate mass peak intensities for each spectrum, resulting in the Base + Peak Chromatogram (BPC). The BPC shows the highest measured intensity for + each distinct retention time (hence spectrum) and is thus orthogonal to the + BPS. +- Sum the mass peak intensities for each spectrum to create a Total Ion + Chromatogram (TIC). +- Compare the BPS of all samples in an experiment to evaluate similarity of + their ion content. +- Compare the BPC of all samples in an experiment to identify samples with + similar or dissimilar chromatographic signal. + +In addition to such general data evaluation and visualization, it is also +crucial to investigate specific signal of e.g. internal standards or +compounds/ions known to be present in the samples. By providing a reliable +reference, internal standards help achieve consistent and accurate analytical +results. + +## Spectra Data Visualization: BPS + +The BPS collapses data in the retention time dimension and reveals the most +prevalent ions present in each of the samples, creation of such BPS is however +not straightforward. Mass peaks, even if representing signals from the same ion, +will never have identical *m/z* values in consecutive spectra due to the +measurement error/resolution of the instrument. + +Below we use the `combineSpectra` function to combine all spectra from one file +(defined using parameter `f = fromFile(data)`) into a single spectrum. All mass +peaks with a difference in *m/z* value smaller than 3 parts-per-million (ppm) +are combined into one mass peak, with an intensity representing the maximum of +all such grouped mass peaks. To reduce memory requirement, we in addition first +*bin* each spectrum combining all mass peaks within a spectrum, aggregating mass +peaks into bins with 0.01 *m/z* width. In case of large datasets, it is also +recommended to set the `processingChunkSize()` parameter of the `MsExperiment` +object to a finite value (default is `Inf`) causing the data to be processed +(and loaded into memory) in chunks of `processingChunkSize()` spectra. This can +reduce memory demand and speed up the process. + +```{r} +#' Setting the chunksize +chunksize <- 1000 +processingChunkSize(spectra(lcms1)) <- chunksize +``` + +We can now generate BPS for each sample and `plot()` them. + +```{r bps, eval=FALSE} +#| fig-cap: "Figure 1. BPS of all samples." +#' Combining all spectra per file into a single spectrum +bps <- spectra(lcms1) |> + bin(binSize = 0.01) |> + combineSpectra(f = fromFile(lcms1), intensityFun = max, ppm = 3) + +#' Plot the base peak spectra +par(mar = c(2, 1, 1, 1)) +plotSpectra(bps, main= "") +``` + +Here, there is observable overlap in ion content between the files, particularly +around 300 *m/z* and 700 *m/z*. There are however also differences between sets +of samples. In particular, BPS 1, 4, 7 and 10 (counting row-wise from left to +right) seem different than the others. In fact, these four BPS are from QC +samples, and the remaining six from the study samples. The observed differences +might be explained by the fact that the QC samples are pools of serum samples +from a different cohort, while the study samples represent plasma samples, from +a different sample collection. + +Next to the visual inspection above, we can also calculate and express the +similarity between the BPS with a heatmap. Below we use the `compareSpectra()` +function to calculate pairwise similarities between all BPS and use then the +`pheatmap()` function from the *pheatmap* package to cluster and visualize this +result. + +```{r compare-spectra, eval=FALSE} +#| fig-cap: "Figure 2. Heatmap of MS signals similarities." +#' Calculate similarities between BPS +sim_matrix <- compareSpectra(bps) + +#' Add sample names as rownames and colnames +rownames(sim_matrix) <- colnames(sim_matrix) <- sampleData(lcms1)$sample_name +ann <- data.frame(phenotype = sampleData(lcms1)[, "phenotype"]) +rownames(ann) <- rownames(sim_matrix) + +#' Plot the heatmap +pheatmap(sim_matrix, annotation_col = ann, + annotation_colors = list(phenotype = col_phenotype)) +``` + +We get a first glance at how our different samples distribute in terms of +similarity. The heatmap confirms the observations made with the BPS, showing +distinct clusters for the QCs and the study samples, owing to the different +matrices and sample collections. + +It is also strongly recommended to delve deeper into the data by exploring it in +more detail. This can be accomplished by carefully assessing our data and +extracting spectra or regions of interest for further examination. In the next +chunk, we will look at how to extract information for a specific spectrum from +distinct samples. + +```{r} +#| fig-cap: "Figure 3. Intensity and m/z values of the 125th spectrum of two CTR samples." +#| out.width: 90% +#| fig-align: center +#' Accessing a single spectrum - comparing with QC +par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) +spec1 <- spectra(lcms1[1])[125] +spec2 <- spectra(lcms1[3])[125] +plotSpectra(spec1, main = "QC sample") +plotSpectra(spec2, main = "CTR sample") +``` + +The significant dissimilarities in peak distribution and intensity confirm the +difference in composition between QCs and study samples. We next compare a full +MS1 spectrum from a CVD and a CTR sample. + +```{r} +#| fig-cap: "Figure 4. Intensity and m/z values of the 125th spectrum of one CVD and one CTR sample." +#| out.width: 90% +#| fig-align: center +#' Accessing a single spectrum - comparing CVD and CTR +par(mfrow = c(1,2), mar = c(2, 2, 2, 2)) +spec1 <- spectra(lcms1[2])[125] +spec2 <- spectra(lcms1[3])[125] +plotSpectra(spec1, main = "CVD sample") +plotSpectra(spec2, main = "CTR sample") +``` + +Above, we can observe that the spectra between CVD and CTR samples are not +entirely similar, but they do exhibit similar main peaks between 200 and 600 m/z +with a general higher intensity in control samples. However the peak +distribution (or at least intensity) seems to vary the most between an *m/z* of +10 to 210 and after an *m/z* of 600. + +The CTR spectrum above exhibits significant peaks around an *m/z* of 150 - 200 +that have a much lower intensity in the CVD sample. To delve into more details +about this specific spectrum, a wide range of functions can be employed: + +```{r} +#| tbl-cap: "Table 3. Intensity and m/z values of the 125th spectrum of one CTR sample." +#| tbl-style: "border: 1px solid black; width: 50%; border-collapse: collapse;" +#' Checking its intensity +intensity(spec2) + +#' Checking its rtime +rtime(spec2) + +#' Checking its m/z +mz(spec2) + +#' Filtering for a specific m/z range and viewing in a tabular format +filt_spec <- filterMzRange(spec2,c(50,200)) + +data.frame(intensity = unlist(intensity(filt_spec)), + mz = unlist(mz(filt_spec))) |> + head() |> kable(format = "markdown") +``` + +## Chromatographic Data Visualization: BPC and TIC + +The `chromatogram()` function facilitates the extraction of intensities along +the retention time. However, access to chromatographic information is currently +not as efficient and seamless as it is for spectral information. Work is +underway to develop/improve the infrastructure for chromatographic data through +a new `Chromatograms` object aimed to be as flexible and user-friendly as the +`Spectra` object. + +For visualizing LC-MS data, a BPC or TIC serves as a valuable tool to assess the +performance of liquid chromatography across various samples in an experiment. In +our case, we extract the BPC from our data to create such a plot. The BPC +captures the maximum peak signal from each spectrum in a data file and plots +this information against the retention time for that spectrum on the y-axis. The +BPC can be extracted using the `chromatogram` function. + +By setting the parameter `aggregationFun = "max"`, we instruct the function to +report the maximum signal per spectrum. Conversely, when setting +`aggregationFun = "sum"`, it sums up all intensities of a spectrum, thereby +creating a TIC. + +```{r bpc1, echo=TRUE} +#| fig-cap: "Figure 5. BPC of all samples colored by phenotype." +#' Extract and plot BPC for full data +bpc <- chromatogram(lcms1, aggregationFun = "max") + +plot(bpc, col = paste0(col_sample, 80), main = "BPC", lwd = 1.5) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, + bty = "n") +``` + +After about 240 seconds no signal seems to be measured. Thus, we filter the data +removing that part as well as the first 10 seconds measured in the LC run. + +```{r filter rt} +#| fig-cap: "Figure 6. BPC after filtering retention time." +#' Filter the data based on retention time +lcms1 <- filterRt(lcms1, c(10, 240)) +bpc <- chromatogram(lcms1, aggregationFun = "max") + +#' Plot after filtering +plot(bpc, col = paste0(col_sample, 80), + main = "BPC after filtering retention time", lwd = 1.5) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, lwd = 2, horiz = TRUE, bty = "n") +``` + +Initially, we examined the entire BPC and subsequently filtered it based on the +desired retention times. This not only results in a smaller file size but also +facilitates a more straightforward interpretation of the BPC. + +The final plot illustrates the BPC for each sample colored by phenotype, +providing insights on the signal measured along the retention times of each +sample. It reveals the points at which compounds eluted from the LC column. In +essence, a BPC condenses the three-dimensional LC-MS data (*m/z* by retention +time by intensity) into two dimensions (retention time by intensity). + +We can also here compare similarities of the BPCs in a heatmap. The retention +times will however not be identical between different samples. Thus we *bin()* +the chromatographic signal per sample along the retention time axis into bins of +two seconds resulting in data with the same number of bins/data points. We can +then calculate pairwise similarities between these data vectors using the +`cor()` function and visualize the result using `pheatmap()`. + +```{r heatmap1} +#| fig-cap: "Figure 7. Heatmap of BPC similarities." +#' Total ion chromatogram +tic <- chromatogram(lcms1, aggregationFun = "sum") |> + bin(binSize = 2) + +#' Calculate similarity (Pearson correlation) between BPCs +ticmap <- do.call(cbind, lapply(tic, intensity)) |> + cor() + +rownames(ticmap) <- colnames(ticmap) <- sampleData(lcms1)$sample_name +ann <- data.frame(phenotype = sampleData(lcms1)[, "phenotype"]) +rownames(ann) <- rownames(ticmap) + +#' Plot heatmap +pheatmap(ticmap, annotation_col = ann, + annotation_colors = list(phenotype = col_phenotype)) +``` + +The heatmap above reinforces what our exploration of spectra data showed, which +is a strong separation between the QC and study samples. This is important to +bear in mind for later analyses. + +Additionally, study samples group into two clusters, cluster *I* containing +samples *C* and *F* and cluster *II* with all other samples. Below we plot the +TIC of all samples, using a different color for each cluster. + +```{r} +#| fig-cap: "Figure 8. Example of TIC of unusual signal." +cluster_I_idx <- sampleData(lcms1)$sample_name %in% c("F", "C") +cluster_II_idx <- sampleData(lcms1)$sample_name %in% c("A", "B", "D", "E") + +temp_col <- c("grey", "red") +names(temp_col) <- c("Cluster II", "Cluster I") +col <- rep(temp_col[1], length(lcms1)) +col[cluster_I_idx] <- temp_col[2] +col[sampleData(lcms1)$phenotype == "QC"] <- NA + +lcms1 |> + chromatogram(aggregationFun = "sum") |> + plot( col = col, + main = "TIC after filtering retention time", lwd = 1.5) +grid() +legend("topright", col = temp_col, + legend = names(temp_col), lty = 1, lwd = 2, + horiz = TRUE, bty = "n") +``` + +While the TIC of all samples look similar, the samples from cluster I show a +different signal in the retention time range from about 40 to 160 seconds. +Whether, and how strong this difference will impact the following analysis +remains to be determined. + +### Known compounds + +Throughout the entire process, it is crucial to have reference points within the +dataset, such as well-known ions. Most experiments nowadays include internal +standards (IS), and it was the case here. We strongly recommend using them for +visualization throughout the entire analysis. For this experiment, a set of 15 +IS was spiked to all samples. After reviewing signal of all of them, we selected +two to guide this analysis process. However, we also advise to plot and evaluate +all the ions after each steps. + +To illustrate this, we generate Extracted Ion Chromatograms (EIC) for these +selected *test ions*. By restricting the MS data to intensities within a +restricted, small *m/z* range and a selected retention time window, EICs are +expected to contain only signal from a single type of ion. The expected *m/z* +and retention times for our set of IS was determined in a different experiment. +Additionally, in cases where internal standards are not available, commonly +present ions in the sample matrix can serve as suitable alternatives. Ideally, +these compounds should be distributed across the entire retention time range of +the experiment. + +```{r EIC extract for internal standard} +#| tbl-cap: "Table 4. Internal standard list with respective m/z and expected retention time [s]." +#| tbl-style: "font-size: 0.8em" +#| tbl-colwidths: [20, 20, 20] +#' Load our list of standard +intern_standard <- read.delim("intern_standard_list.txt") + +# Extract EICs for the list +eic_is <- chromatogram( + lcms1, + rt = as.matrix(intern_standard[, c("rtmin", "rtmax")]), + mz = as.matrix(intern_standard[, c("mzmin", "mzmax")])) + +#' Add internal standard metadata +fData(eic_is)$mz <- intern_standard$mz +fData(eic_is)$rt <- intern_standard$RT +fData(eic_is)$name <- intern_standard$name +fData(eic_is)$abbreviation <- intern_standard$abbreviation +rownames(fData(eic_is)) <- intern_standard$abbreviation + +#' Summary of IS information +fData(eic_is)[, c("name", "mz", "rt")] |> + kable(format = "pipe") +``` + +We below plot the EICs for isotope labeled cystine and methionine. + +```{r} +#| fig-cap: "Figure 9. EIC of cystine and methionine." +#| fig-align: center +#' Extract the two IS from the chromatogram object. +eic_cystine <- eic_is["cystine_13C_15N"] +eic_met <- eic_is["methionine_13C_15N"] + +#' plot both EIC +par(mfrow = c(1, 2), mar = c(4, 2, 2, 0.5)) +plot(eic_cystine, main = fData(eic_cystine)$name, cex.axis = 0.8, + cex.main = 0.8, + col = paste0(col_sample, 80)) +grid() +abline(v = fData(eic_cystine)$rt, col = "red", lty = 3) + +plot(eic_met, main = fData(eic_met)$name, cex.axis = 0.8, cex.main = 0.8, + col = paste0(col_sample, 80)) +grid() +abline(v = fData(eic_met)$rt, col = "red", lty = 3) +legend("topright", col = col_phenotype, legend = names(col_phenotype), lty = 1, + bty = "n") +``` + +We can observe a clear concentration difference between QCs and study samples +for the isotope labeled cystine ion. Meanwhile, the labeled methionine internal +standard exhibits a discernible signal amidst some noise and a noticeable +retention time shift between samples. + +While the artificially isotope labeled compounds were spiked to the individual +samples, there should also be the signal from the endogenous compounds in serum +(or plasma) samples. Thus, we calculate next the mass and *m/z* of an *\[M+H\]+* +ion of the endogenous cystine from its chemical formula and extract also the EIC +from this ion. For calculation of the exact mass and the *m/z* of the selected +ion adduct we use the `calculateMass()` and `mass2mz()` functions from the +`r Biocpkg("MetaboCoreUtils")` package. + +```{r} +#| fig-cap: "Figure 10. EIC of endogenous cystine vs spiked." +#' extract endogenous cystine mass and EIC and plot. +cysmass <- calculateMass("C6H12N2O4S2") +cys_endo <- mass2mz(cysmass, adduct = "[M+H]+")[, 1] + +#' Plot versus spiked +par(mfrow = c(1, 2)) +chromatogram(lcms1, mz = cys_endo + c(-0.005, 0.005), + rt = unlist(fData(eic_cystine)[, c("rtmin", "rtmax")]), + aggregationFun = "max") |> + plot(col = paste0(col_sample, 80)) |> + grid() + +plot(eic_cystine, col = paste0(col_sample, 80)) +grid() +legend("topright", col = col_phenotype, legend = names(col_phenotype), lty = 1, + bty = "n") +``` + +The two cystine EICs above look highly similar (the endogenous shown left, the +isotope labeled right in the plot above), if not for the shift in m/z, which +arises from the artificial labeling. This shift allows us to discriminate +between the endogenous and non-endogenous compound. + +# Data preprocessing + +Preprocessing stands as the inaugural step in the analysis of untargeted LC-MS. +It is characterized by 3 main stages: chromatographic peak detection, retention +time shift correction (alignment) and correspondence which results in features +defined. The primary objective of preprocessing is the quantification of signals +from ions measured in a sample, addressing any potential retention time drifts +between samples, and ensuring alignment of quantified signals across samples +within an experiment. The final result of LC-MS data preprocessing is a numeric +matrix with abundances of quantified entities in the samples of the experiment. + +## Chromatographic peak detection + +The initial preprocessing step involves detecting intensity peaks along the +retention time axis, the so called *chromatographic peaks*. To achieve this, we +employ the `findChromPeaks()` function within *xcms*. This function supports +various algorithms for peak detection, which can be selected and configured with +their respective parameter objects. + +The preferred algorithm in this case, *CentWave*, utilizes continuous wavelet +transformation (CWT)-based peak detection [@tautenhahn_highly_2008]. This method +is known for its effectiveness in handling non-Gaussian shaped chromatographic +peaks or peaks with varying retention time widths, which are commonly +encountered in HILIC separations. + +Below we apply the CentWave algorithm with its default settings on the extracted +ion chromatogram for cystine and methionine ions and evaluate its results. + +```{r Default centwave param test} +#' Use default Centwave parameter +param <- CentWaveParam() + +#' Look at the default parameters +param + +#' Evaluate for Cystine +cystine_test <- findChromPeaks(eic_cystine, param = param) +chromPeaks(cystine_test) + +#' Evaluate for Methionine +met_test <- findChromPeaks(eic_met, param = param) +chromPeaks(met_test) +``` + +While *CentWave* is a highly performant algorithm, it requires to be costumized +to each dataset. This implies that the parameters should be fine-tuned based on +the user's data. The example above serves as a clear motivation for users to +familiarize themselves with the various parameters and the need to adapt them to +a data set. We will discuss the main parameters that can be easily adjusted to +suit the user's dataset: + +- `peakwidth`: Specifies the minimal and maximal expected width of the peaks + in the retention time dimension. Highly dependent on the chromatographic + settings used. + +- `ppm`: The maximal allowed difference of mass peaks' *m/z* values (in + parts-per-million) in consecutive scans to consider them representing signal + from the same ion. + +- `integrate`: This parameter defines the integration method. Here, we + primarily use `integrate = 2` because it integrates also signal of a + chromatographic peak's tail and is considered more accurate by the + developers. + +To determine `peakwidth`, we recommend that users refer to previous EICs and +estimate the range of peak widths they observe in their dataset. Ideally, +examining multiple EICs should be the goal. For this dataset, the peak widths +appear to be around 2 to 10 seconds. We do not advise choosing a range that is +too wide or too narrow with the `peakwidth` parameter as it can lead to false +positives or negatives. + +To determine the `ppm`, a deeper analysis of the dataset is needed. It is +clarified above that `ppm` depends on the instrument, but users should not +necessarily input the vendor-advertised ppm. Here's how to determine it as +accurately as possible: + +The following steps involve generating a highly restricted MS area with a single +mass peak per spectrum, representing the cystine ion. The *m/z* of these peaks +is then extracted, their absolute difference calculated and finally expressed in +ppm. + +```{r ppm parameter1 } +#' Restrict the data to signal from cystine in the first sample +cst <- lcms1[1L] |> + spectra() |> + filterRt(rt = c(208, 218)) |> + filterMzRange(mz = fData(eic_cystine)["cystine_13C_15N", c("mzmin", "mzmax")]) + +#' Show the number of peaks per m/z filtered spectra +lengths(cst) + +#' Calculate the difference in m/z values between scans +mz_diff <- cst |> + mz() |> + unlist() |> + diff() |> + abs() + +#' Express differences in ppm +range(mz_diff * 1e6 / mean(unlist(mz(cst)))) +``` + +We therefore, choose a value close to the maximum within this range for +parameter `ppm`, i.e., 15 ppm. + +We can now perform the chromatographic peak detection with the adapted settings +on our EICs. It is important to note that, to properly estimate background +noise, sufficient data points outside the chromatographic peak need to be +present. This is generally no problem if peak detection is performed on the full +LC-MS data set, but for peak detection on EICs the retention time range of the +EIC needs to be sufficiently wide. If the function fails to find a peak in an +EIC, the initial troubleshooting step should be to increase this range. +Additionally, the signal-to-noise threshold `snthresh` should be reduced for +peak detection in EICs, because within the small retention time range, not +enough signal is present to properly estimate the background noise. Finally, in +case of too few MS1 data points per peaks, setting CentWave's advanced parameter +`extendLengthMSW` to `TRUE` can help with peak detection. + +```{r} +#' Parameters adapted for chromatographic peak detection on EICs. +param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2, + snthresh = 2) + +#' Evaluate on the cystine ion +cystine_test <- findChromPeaks(eic_cystine, param = param) +chromPeaks(cystine_test) + +#' Evaluate on the methionine ion +met_test <- findChromPeaks(eic_met, param = param) +chromPeaks(met_test) +``` + +With the customized parameters, a chromatographic peak was detected in each +sample. Below, we use the `plot()` function on the EICs to visualize these +results. + +```{r echo=FALSE} +#| fig-cap: "Figure 11. Chromatographic peak detection on EICs." +#' Plot test chromatogram +par(mfrow = c(1, 2)) +plot(cystine_test, main = fData(cystine_test)$name, + col = paste0(col_sample, 80), + peakBg = paste0(col_sample, 40)[chromPeaks(cystine_test)[, "column"]], + cex.main = 0.8, cex.axis = 0.8) +grid() + +plot(met_test, main = fData(met_test)$name, + col = paste0(col_sample, 80), + peakBg = paste0(col_sample, 40)[chromPeaks(met_test)[, "column"]], + cex.main = 0.8, cex.axis = 0.8) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, bty = "n") +``` + +We can see a peak seems ot be detected in each sample for both ions. This +indicates that our custom settings seem thus to be suitable for our dataset. We +now proceed and apply them to the entire dataset, extracting EICs again for the +same ions to evaluate and confirm that chromatographic peak detection worked as +expected. Note: + +- We revert the value for parameter `snthresh` to its default, because, as + mentioned above, background noise estimation is more reliable when performed + on the full data set. + +- Parameter `chunkSize` of `findChromPeaks()` defines the number of data files + that are loaded into memory and processed simultaneously. This parameter + thus allows to fine-tune the memory demand as well as performance of the + chromatographic peak detection step. + +```{r run find chrompeak on entire dataset} +#' Using the same settings, but with default snthresh +param <- CentWaveParam(peakwidth = c(1, 8), ppm = 15, integrate = 2) +lcms1 <- findChromPeaks(lcms1, param = param, chunkSize = 5) + +#' Update EIC internal standard object +eics_is_noprocess <- eic_is +eic_is <- chromatogram(lcms1,, + rt = as.matrix(intern_standard[, c("rtmin", "rtmax")]), + mz = as.matrix(intern_standard[, c("mzmin", "mzmax")])) +fData(eic_is) <- fData(eics_is_noprocess) +``` + +Below we plot the EICs of the two selected internal standards to evaluate the +chromatographic peak detection results. + +```{r plot new eics, echo=FALSE} +#| fig-cap: "Figure 12. Chromatographic peak detection on EICs after processing." +#' Test if we find peaks for Cystine and Methionine again +eic_cystine <- eic_is["cystine_13C_15N", ] +eic_met <- eic_is["methionine_13C_15N", ] + +#' Plot +par(mfrow = c(1, 2)) +plot(eic_cystine, main = "L-Cystine (13C6, 99%; 15N2, 99%)", + col = paste0(col_sample, 80), + peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, "sample"]], 40)) +grid() + +plot(eic_met, main = "L-Methionine (13C5, 99%; 15N, 99%)", + col = paste0(col_sample, 80), + peakBg = paste0(col_sample[chromPeaks(eic_met)[, "sample"]], 40)) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1) +``` + +Peaks seem to have been detected properly in all samples for both ions. This +indicates that the peak detection process over the entire dataset was +successful. + +### Refine identified chromatographic peaks + +The identification of chromatographic peaks using the *CentWave* algorithm can +sometimes result in artifacts, such as overlapping or split peaks. To address +this issue, the `refineChromPeaks()` function is utilized, in conjunction with +`MergeNeighboringPeaksParam`, which aims at merging such split peaks. + +Below we show some examples of *CentWave* peak detection artifacts. These +examples are pre-selected to illustrate the necessity of the next step: + +```{r echo=FALSE} +#| fig-cap: "Figure 13. Examples of CentWave peak detection artifacts." +#' Extract m/z-rt regions for selected peaks +mz_rt <- cbind(rtmin = c(155.2120, 181.71800), + rtmax = c(201.0710, 228.13500), + mzmin = c(191.0288, 53.50964), + mzmax = c(191.0527, 53.53183)) + +#' Extract the EICs +eics <- chromatogram(lcms1[3], rt = mz_rt[, c("rtmin", "rtmax")], + mz = mz_rt[, c("mzmin", "mzmax")]) +#' Plot the EICs +plot(eics) +``` + +In both cases the signal presumably from a single type of ion was split into two +separate chromatographic peaks (indicated by the vertical line). The +`MergeNeigboringPeaksParam` allows to combine such split peaks. The parameters +for this algorithm are defined below: + +- `expandMz`and `expandRt`: Define which chromatographic peaks should be + evaluated for merging. + - `expandMz`: Suggested to be kept relatively small (here at 0.0015) to + prevent the merging of isotopes. + - `expandRt`: Usually set to approximately half the size of the average + retention time width used for chromatographic peak detection (in this + case, 2.5 seconds). +- `minProp`: Used to determine whether candidates will be merged. + Chromatographic peaks with overlapping *m/z* ranges (expanded on each side + by `expandMz`) and with a tail-to-head distance in the retention time + dimension that is less than `2 * expandRt`, and for which the signal between + them is higher than `minProp` of the apex intensity of the chromatographic + peak with the lower intensity, are merged. Values for this parameter should + not be too small to avoid merging closely co-eluting ions, such as isomers. + +We test these settings below on the EICs with the split peaks. + +```{r test merging} +#| fig-cap: "Figure 14. Examples of CentWave peak detection artifacts after merging." +#' set up the parameter +param <- MergeNeighboringPeaksParam(expandRt = 2.5, expandMz = 0.0015, + minProp = 0.75) + +#' Perform the peak refinement on the EICs +eics <- refineChromPeaks(eics, param = param) +plot(eics) +``` + +We can observe that the artificially split peaks have been appropriately merged. +Therefore, we next apply these settings to the entire dataset. After peak +merging, column `"merged"` in the result object's `chromPeakData()` data frame +can be used to evaluate which chromatographic peaks in the result represent +signal from merged, or from the originally identified chromatographic peaks. + +```{r} +#' Apply on whole dataset +lcms1 <- refineChromPeaks(lcms1, param = param, chunkSize = 5) +chromPeakData(lcms1)$merged |> + table() +``` + +Before proceeding with the next preprocessing step it is generally suggested to +evaluate the results of the chromatographic peak detection on EICs of e.g. +internal standards or other compounds/ions known to be present in the samples. + +```{r} +eics_is_chrompeaks <- eic_is + +eic_is <- chromatogram(lcms1, + rt = as.matrix(intern_standard[, c("rtmin", "rtmax")]), + mz = as.matrix(intern_standard[, c("mzmin", "mzmax")])) +fData(eic_is) <- fData(eics_is_chrompeaks) + +eic_cystine <- eic_is["cystine_13C_15N", ] +eic_met <- eic_is["methionine_13C_15N", ] +``` + +Additionally, evaluating and comparing the number of identified chromatographic +peaks in all samples of a data set can help spotting potentially problematic +samples. Below we count the number of chromatographic peaks per sample and show +these numbers in a table. + +```{r} +#| tbl-cap: "Table 5. Samples and number of identified chromatographic peaks." +#' Count the number of peaks per sample and summarize them in a table. +data.frame(sample_name = sampleData(lcms1)$sample_name, + peak_count = as.integer(table(chromPeaks(lcms1)[, "sample"]))) |> + kable(format = "pipe") +``` + +A similar number of chromatographic peaks was identified within the various +samples of the data set. + +Additional options to evaluate the results of the chromatographic peak detection +can be implemented using the `plotChromPeaks()` function or by summarizing the +results using base R commands. + +## Retention time alignment + +Despite using the same chromatographic settings and conditions retention time +shifts are unavoidable. Indeed, the performance of the instrument can change +over time, for example due to small variations in environmental conditions, such +as temperature and pressure. These shifts will be generally small if samples are +measured within the same batch/measurement run, but can be considerable if data +of an experiment was acquired across a longer time period. + +To evaluate the presence of a shift we extract and plot below the BPC from the +QC samples. + +```{r} +#| fig-cap: "Figure 15. BPC of QC samples." +#' Get QC samples +QC_samples <- sampleData(lcms1)$phenotype == "QC" + +#' extract BPC +lcms1[QC_samples] |> + chromatogram(aggregationFun = "max", chromPeaks = "none") |> + plot(col = col_phenotype["QC"], main = "BPC of QC samples") |> + grid() +``` + +These QC samples representing the same sample (pool) were measured at regular +intervals during the measurement run of the experiment and were all measured on +the same day. Still, small shifts can be observed, especially in the region +between 100 and 150 seconds. To facilitate proper correspondence of signals +across samples (and hence definition of the LC-MS features), it is essential to +minimize these differences in retention times. + +Theoretically, we proceed in two steps: first we select only the QC samples of +our dataset and do a first alignment on these, using the so-called *anchor +peaks*. In this way we can assume a linear shift in time, since we are always +measuring the same sample in different regular time intervals. Despite having +external QCs in our data set, we still use the subset-based alignment assuming +retention time shifts to be independent of the different sample matrix (human +serum or plasma) and instead are mostly instrument-dependent. Note that it would +also be possible to manually specify anchor peaks, respectively their retention +times or to align a data set against an external, reference, data set. More +information is provided in the vignettes of the `r Biocpkg("xcms")` package. + +After calculating how much to adjust the retention time in these samples, we +apply this shift also on the study samples. + +In *xcms* retention time alignment can be performed using the `adjustRtime()` +function with an alignment algorithm. For this example we use the *PeakGroups* +method [@smith_xcms_2006] that performs the alignment by minimizing differences +in retention times of a set of *anchor peaks* in the different samples. This +method requires an initial correspondence analysis to match/group +chromatographic peaks across samples from which the algorithm then selects the +anchor peaks for the alignment. + +For the initial correspondence, we use the *PeakDensity* approach +[@smith_xcms_2006] that groups chromatographic peaks with similar *m/z* and +retention time into LC-MS features. The parameters for this algorithm, that can +be configured using the `PeakDensityParam` object, are `sampleGroups`, +`minFraction`, `binSize`, `ppm` and `bw`. + +`binSize`, `ppm` and `bw` allow to specify how similar the chromatographic +peaks' *m/z* and retention time values need to be to consider them for grouping +into a feature. + +- `binSize` and `ppm` define the required similarity of *m/z* values. Within + each *m/z* bin (defined by `binSize` and `ppm`) areas along the retention + time axis with a high chromatographic peak density (considering all peaks in + all samples) are identified, and chromatographic peaks within these regions + are considered for grouping into a feature. + +- High density areas are identified using the base R `density()` function, for + which `bw` is a parameter: higher values define wider retention time areas, + lower values require chromatographic peaks to have more similar retention + times. This parameter can be seen as the black line on the plot below, + corresponding to the smoothness of the density curve. + +Whether such candidate peaks get grouped into a feature depends also on +parameters `sampleGroups` and `minFraction`: + +- `sampleGroups` should provide, for each sample, the sample group it belongs + to. + +- `minFraction` is expected to be a value between 0 and 1 defining the + proportion of samples within at least one of the sample groups (defined with + `sampleGroups`) in which a chromatographic peaks was detected to group them + into a feature. + +For the initial correspondence, parameters don't need to be fully optimized. +Selection of dataset-specific parameter values is described in more detail in +the next section. For our dataset, we use small values for `binSize` and `ppm` +and, importantly, also for parameter `bw`, since for our data set an ultra high +performance (UHP) LC setup was used. For `minFraction` we use a high value (0.9) +to ensure only features are defined for chromatographic peaks present in almost +all samples of one sample group (which can then be used as anchor peaks for the +actual alignment). We will base the alignment later on QC samples only and hence +define for `sampleGroups` a binary variable grouping samples either into a +study, or QC group. + +```{r} +#| fig-cap: "Figure 16. Initial correspondence analysis." +# Initial correspondence analysis +param <- PeakDensityParam(sampleGroups = sampleData(lcms1)$phenotype == "QC", + minFraction = 0.9, + binSize = 0.01, ppm = 10, + bw = 2) +lcms1 <- groupChromPeaks(lcms1, param = param) + +plotChromPeakDensity( + eic_cystine, param = param, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(eic_cystine)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, "sample"]], 20), + peakPch = 16) +``` + +*PeakGroups*-based alignment can next be performed using the `adjustRtime()` +function with a `PeakGroupsParam` parameter object. The parameters for this +algorithm are: + +- `subsetAdjust` and `subset`: Allows for subset alignment. Here we base the + retention time alignment on the QC samples, i.e., retention time shifts will + be estimated based on these repeatedly measured samples. The resulting + adjustment is then applied to the entire data. For data sets in which QC + samples (e.g. sample pools) are measured repeatedly, we strongly suggest to + use this method. Note also that for subset-based alignment the samples + should be ordered by injection index (i.e., in the order in which they were + measured during the measurement run). + +- `minFraction`: A value between 0 and 1 defining the proportion of samples + (of the full data set, or the data subset defined with `subset`) in which a + chromatographic peak has to be identified to use it as *anchor peak*. This + is in contrast to the `PeakDensityParam` where this parameter was used to + define a proportion within a sample group. + +- `span`: The *PeakGroups* method allows, depending on the data, to adjust + regions along the retention time axis differently. To enable such local + alignments the *LOESS* function is used and this parameter defines the + degree of smoothing of this function. Generally, values between 0.4 and 0.6 + are used, however, it is suggested to evaluate alignment results and + eventually adapt parameters if the result was not satisfactory. + +Below we perform the alignment of our data set based on retention times of +anchor peaks defined in the subset of QC samples. + +```{r} +#' Define parameters of choice +subset <- which(sampleData(lcms1)$phenotype == "QC") +param <- PeakGroupsParam(minFraction = 0.9, extraPeaks = 50, span = 0.5, + subsetAdjust = "average", + subset = subset) + +#' Perform the alignment +lcms1 <- adjustRtime(lcms1, param = param) +``` + +Alignment adjusted the retention times of all spectra in the data set, as well +as the retention times of all identified chromatographic peaks. + +Once the alignment has been performed, the user should evaluate the results +using the `plotAdjustedRtime()` function. This function visualizes the +difference between adjusted and raw retention time for each sample on the y-axis +along the adjusted retention time on the x-axis. Dot points represent the +position of the used anchor peak along the retention time axis. For optimal +alignment in all areas along the retention time axis, these anchor peaks should +be scattered all over the retention time dimension. + +```{r} +#| fig-cap: "Figure 17. Retention time alignment results." +#' Visualize alignment results +plotAdjustedRtime(lcms1, col = paste0(col_sample, 80), peakGroupsPch = 1) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, bty = "n") +``` + +All samples from the present data set were measured within the same measurement +run, resulting in small retention time shifts. Therefore, only little +adjustments needed to be performed (shifts of at maximum 1 second as can be seen +in the plot above). Generally, the magnitude of adjustment seen in such plots +should match the expectation from the analyst. + +We can also compare the BPC before and after alignment. To get the original +data, i.e. the raw retention times, we can use the `dropAdjustedRtime()` +function: + +```{r} +#' Get data before alignment +lcms1_raw <- dropAdjustedRtime(lcms1) + +#' Apply the adjusted retention time to our dataset +lcms1 <- applyAdjustedRtime(lcms1) +``` + +```{r bpc before and after1} +#| fig-cap: "Figure 18. BPC before and after alignment." +#' Plot the BPC before and after alignment +par(mfrow = c(2, 1), mar = c(2, 1, 1, 0.5)) +chromatogram(lcms1_raw, aggregationFun = "max", chromPeaks = "none") |> + plot(main = "BPC before alignment", col = paste0(col_sample, 80)) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, bty = "n", horiz = TRUE) + +chromatogram(lcms1, aggregationFun = "max", chromPeaks = "none") |> + plot(main = "BPC after alignment", + col = paste0(col_sample, 80)) +grid() +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, bty = "n", horiz = TRUE) +``` + +The largest shift can be observed in the retention time range from 120 to 130s. +Apart from that retention time range, only little changes can be observed. + +We next evaluate the impact of the alignment on the EICs of the selected +internal standards. We thus below first extract the ion chromatograms after +alignment. + +```{r} +#' Store the EICs before alignment +eics_is_refined <- eic_is + +#' Update the EICs +eic_is <- chromatogram(lcms1, + rt = as.matrix(intern_standard[, c("rtmin", "rtmax")]), + mz = as.matrix(intern_standard[, c("mzmin", "mzmax")])) +fData(eic_is) <- fData(eics_is_refined) + +#' Extract the EICs for the test ions +eic_cystine <- eic_is["cystine_13C_15N"] +eic_met <- eic_is["methionine_13C_15N"] +``` + +We can now evaluate the alignment effect in our test ions. We will plot the EICs +before and after alignment for both the isotope labeled cystine and methionine. + +```{r specific ion before and after1} +#| fig-cap: "Figure 19. EICs of cystine and methionine before and after alignment." +par(mfrow = c(2, 2), mar = c(4, 4.5, 2, 1)) + +old_eic_cystine <- eics_is_refined["cystine_13C_15N"] +plot(old_eic_cystine, main = "Cystine before alignment", peakType = "none", + col = paste0(col_sample, 80)) +grid() +abline(v = intern_standard["cystine_13C_15N", "RT"], col = "red", lty = 3) + +old_eic_met <- eics_is_refined["methionine_13C_15N"] +plot(old_eic_met, main = "Methionine before alignment", + peakType = "none", col = paste0(col_sample, 80)) +grid() +abline(v = intern_standard["methionine_13C_15N", "RT"], col = "red", lty = 3) + +plot(eic_cystine, main = "Cystine after alignment", peakType = "none", + col = paste0(col_sample, 80)) +grid() +abline(v = intern_standard["cystine_13C_15N", "RT"], col = "red", lty = 3) +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty = 1, bty = "n") + +plot(eic_met, main = "Methionine after alignment", + peakType = "none", col = paste0(col_sample, 80)) +grid() +abline(v = intern_standard["methionine_13C_15N", "RT"], col = "red", lty = 3) +``` + +The non-endogenous cystine ion was already well aligned so the difference is +minimal. The methionine ion, however, shows an improvement in alignment. + +In addition to a visual inspection of the results, we also evaluate the impact +of the alignment by comparing the variance in retention times for internal +standards before and after alignment. To this end, we first need to identify +chromatographic peaks in each sample with an *m/z* and retention time close to +the expected values for each internal standard. For this we use the +`matchValues()` function from the `r Biocpkg("MetaboAnnotation")` package +[@rainer_modular_2022] using the `MzRtParam` method to identify all +chromatographic peaks with similar *m/z* (+/- 50 ppm) and retention time (+/- 10 +seconds) to the internal standard's values. With parameters `mzColname` and +`rtColname` we specify the column names in the query (our IS) and target +(chromatographic peaks) that contain the *m/z* and retention time values on +which to match entities. We perform this matching separately for each sample. +For each internal standard in every sample, we use the `filterMatches()` +function and the `SingleMatchParam()` parameter to select the chromatographic +peak with the highest intensity. + +```{r} +#' Extract the matrix with all chromatographic peaks and add a column +#' with the ID of the chromatographic peak +chrom_peaks <- chromPeaks(lcms1) |> as.data.frame() +chrom_peaks$peak_id <- rownames(chrom_peaks) + +#' Define the parameters for the matching and filtering of the matches +p_1 <- MzRtParam(ppm = 50, toleranceRt = 10) +p_2 <- SingleMatchParam(duplicates = "top_ranked", column = "target_maxo", + decreasing = TRUE) + +#' Iterate over samples and identify for each the chromatographic peaks +#' with similar m/z and retention time than the onse from the internal +#' standard, and extract among them the ID of the peaks with the +#' highest intensity. +intern_standard_peaks <- lapply(seq_along(lcms1), function(i) { + tmp <- chrom_peaks[chrom_peaks[, "sample"] == i, , drop = FALSE] + mtch <- matchValues(intern_standard, tmp, + mzColname = c("mz", "mz"), + rtColname = c("RT", "rt"), + param = p_1) + mtch <- filterMatches(mtch, p_2) + mtch$target_peak_id +}) |> + do.call(what = cbind) +``` + +We have now for each internal standard the ID of the chromatographic peak in +each sample that most likely represents signal from that ion. We can now extract +the retention times for these chromatographic peaks before and after alignment. + +```{r} +#' Define the index of the selected chromatographic peaks in the +#' full chromPeaks matrix +idx <- match(intern_standard_peaks, rownames(chromPeaks(lcms1))) + +#' Extract the raw retention times for these +rt_raw <- chromPeaks(lcms1_raw)[idx, "rt"] |> + matrix(ncol = length(lcms1_raw)) + +#' Extract the adjusted retention times for these +rt_adj <- chromPeaks(lcms1)[idx, "rt"] |> + matrix(ncol = length(lcms1_raw)) + +``` + +We can now evaluate the impact of the alignment on the retention times of +internal standards across the full data set: + +```{r} +#| fig-cap: "Figure 20. Retention time variation of internal standards before and after alignment." +list(all_raw = rowSds(rt_raw, na.rm = TRUE), + all_adj = rowSds(rt_adj, na.rm = TRUE) + ) |> + vioplot(ylab = "sd(retention time)") +grid() +``` + +On average, the variation in retention times of internal standards across +samples was *very* slightly reduced by the alignment. + +## Correspondence + +We briefly touched on the subject of correspondence before to determine anchor +peaks for alignment. Generally, the goal of the correspondence analysis is to +identify chromatographic peaks that originate from the same types of ions in all +samples of an experiment and to group them into LC-MS *features*. At this point, +proper configuration of parameter `bw` is crucial. Here we illustrate how +sensible choices for this parameter's value can be made. We use below the +`plotChromPeakDensity()` function to *simulate* a correspondence analysis with +the default values for *PeakGroups* on the extracted ion chromatograms of our +two selected isotope labeled ions. This plot shows the EIC in the top panel, and +the apex position of chromatographic peaks in the different samples (y-axis), +along retention time (x-axis) in the lower panel. + +```{r} +#| fig-cap: "Figure 21. Initial correspondence analysis, Cystine." +#' Default parameter for the grouping and apply them to the test ions BPC +param <- PeakDensityParam(sampleGroups = sampleData(lcms1)$phenotype, bw = 30) + +param + +plotChromPeakDensity( + eic_cystine, param = param, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(eic_cystine)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, "sample"]], 20), + peakPch = 16) +``` + +```{r} +#| fig-cap: "Figure 22. Initial correspondence analysis, Methionine." +plotChromPeakDensity(eic_met, param = param, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(eic_met)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic_met)[, "sample"]], 20), + peakPch = 16) +``` + +Grouping of peaks depends on the smoothness of the previousl mentionned density +curve that can be configured with the parameter `bw`. As seen above, the +smoothness is too high to properly group our features. When looking at the +default parameters, we can observe that indeed, the `bw` parameter is set to +`bw = 30`, which is too high for modern UHPLC-MS setups. We reduce the value of +this parameter to 1.8 and evaluate its impact. + +```{r} +#| fig-cap: "Figure 23. Correspondence analysis with optimized parameters, Cystine." +#' Updating parameters +param <- PeakDensityParam(sampleGroups = sampleData(lcms1)$phenotype, bw = 1.8) + +plotChromPeakDensity( + eic_cystine, param = param, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(eic_cystine)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic_cystine)[, "sample"]], 20), + peakPch = 16) +``` + +```{r} +#| fig-cap: "Figure 24. Correspondence analysis with optimized parameters, Methionine." +plotChromPeakDensity(eic_met, param = param, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(eic_met)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic_met)[, "sample"]], 20), + peakPch = 16) +``` + +We can observe that the peaks are now grouped more accurately into a single +feature for each test ion. The other important parameters optimized here are: + +- `binsize`: Our data was generated on a high resolution MS instrument, thus + we select a low value for this paramete. + +- `ppm`: For TOF instruments, it is suggested to use a value for `ppm` larger + than 0 to accommodate the higher measurement error of the instrument for + larger *m/z* values. + +- `minFraction`: We set it to `minFraction = 0.75`, hence defining features + only if a chromatographic peak was identified in at least 75% of the samples + of one of the sample groups. + +- `sampleGroups`: We use the information available in our `sampleData`'s + `"phenotype"` column. + +```{r} +#' Define the settings for the param +param <- PeakDensityParam(sampleGroups = sampleData(lcms1)$phenotype, + minFraction = 0.75, binSize = 0.01, ppm = 10, + bw = 1.8) + +#' Apply to whole data +lcms1 <- groupChromPeaks(lcms1, param = param) +``` + +After correspondence analysis it is suggested to evaluate the results again for +selected EICs. Below we extract signal for an *m/z* similar to that of the +isotope labeled methionine for a larger retention time range. Importantly, to +show the actual correspondence results, we set `simulate = FALSE` for the +`plotChromPeakDensity()` function. + +```{r} +#| fig-cap: "Figure 25. Correspondence analysis results, Methionine." +#' Extract chromatogram for an m/z similar to the one of the labeled methionine +chr_test <- chromatogram(lcms1, + mz = as.matrix(intern_standard["methionine_13C_15N", + c("mzmin", "mzmax")]), + rt = c(145, 200), + aggregationFun = "max") +plotChromPeakDensity( + chr_test, simulate = FALSE, + col = paste0(col_sample, "80"), + peakCol = col_sample[chromPeaks(chr_test)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(chr_test)[, "sample"]], 20), + peakPch = 16) +``` + +As hoped, signal from the two different ions are now grouped into separate +features. Generally, correspondence results should be evaluated on more such +extracted chromatograms. + +Another interesting information to look a the the distribution of these features +along the retention time axis. + +```{r} +#| tbl-cap: "Table 5. Distribution of features along the retention time axis." +# Bin features per RT slices +vc <- featureDefinitions(lcms1)$rtmed +breaks <- seq(0, max(vc, na.rm = TRUE) + 1, length.out = 15) |> + round(0) +cuts <- cut(vc, breaks = breaks, include.lowest = TRUE) + +table(cuts) |> + kable(format = "pipe") +``` + +The results from the correspondence analysis are now stored, along with the +results from the other preprocessing steps, within our `XcmsExperiment` result +object. The correspondence results, i.e., the definition of the LC-MS features, +can be extracted using the `featureDefinitions()` function. + +```{r} +#' Definition of the features +featureDefinitions(lcms1) |> + head() +``` + +This data frame provides the average *m/z* and retention time (in columns +`"mzmed"` and `"rtmed"`) that characterize a LC-MS feature. Column, `"peakidx"` +contains the indices of all chromatographic peaks that were assigned to that +feature. The actual abundances for these features, which represent also the +final preprocessing results, can be extracted with the `featureValues()` +function: + +```{r} +#' Extract feature abundances +featureValues(lcms1, method = "sum") |> + head() +``` + +We can note that some features (e.g. F0003 and F0006) have missing values in +some samples. This is expected to a certain degree as not in all samples +features, respectively their ions, need to be present. We will address this in +the next section. + +## Gap filling + +The previously observed missing values (*NA*) could be attributed to various +reasons. Although they might represent a genuinely missing value, indicating +that an ion (feature) was truly not present in a particular sample, they could +also be a result of a failure in the preceding chromatographic peak detection +step. It is crucial to be able to recover missing values of the latter category +as much as possible to reduce the eventual need for data imputation. We next +examine how prevalent missing values are in our present dataset: + +```{r} +#' Percentage of missing values +sum(is.na(featureValues(lcms1))) / + length(featureValues(lcms1)) * 100 +``` + +We can observe a substantial number of missing values values in our dataset. +Let's therefore delve into the process of *gap-filling*. We first evaluate some +example features for which a chromatographic peak was only detected in some +samples: + +```{r} +#| fig-cap: "Figure 26. Examples of chromatographic peaks with missing values." +ftidx <- which(is.na(rowSums(featureValues(lcms1)))) +fts <- rownames(featureDefinitions(lcms1))[ftidx] +farea <- featureArea(lcms1, features = fts[1:2]) + +chromatogram(lcms1[c(2, 3)], + rt = farea[, c("rtmin", "rtmax")], + mz = farea[, c("mzmin", "mzmax")]) |> + plot(col = c("red", "blue"), lwd = 2) +``` + +In both instances, a chromatographic peak was only identified in one of the two +selected samples (red line), hence a missing value is reported for this feature +in those particular samples (blue line). However, in both cases, signal was +measured in both samples, thus, reporting a missing value would not be correct +in this example. The signal for this feature is very low, which is the most +likely reason peak detection failed. To rescue signal in such cases, the +`fillChromPeaks()` function can be used with the `ChromPeakAreaParam` approach. +This method defines an *m/z*-retention time area for each feature based on +detected peaks, where the signal for the respective ion is expected. It then +integrates all intensities within this area in samples that have missing values +for that feature. This is then reported as feature abundance. Below we apply +this method using the default parameters. + +```{r} +#' Fill in the missing values in the whole dataset +lcms1 <- fillChromPeaks(lcms1, param = ChromPeakAreaParam(), chunkSize = 5) + +#' Percentage of missing values after gap-filling +sum(is.na(featureValues(lcms1))) / + length(featureValues(lcms1)) * 100 +``` + +With `fillChromPeaks()` we could thus rescue most of the missing data in the +data set. Note that, even if in a sample no ion would be present, in the worst +case noise would be integrated, which is expected to be much lower than actual +chromatographic peak signal. Let's look at our previously missing values again: + +```{r echo=FALSE} +#| fig-cap: "Figure 27. Examples of chromatographic peaks with missing values after gap-filling." +#' Extract EICs again and plot them +chromatogram(lcms1[c(2, 3)], + mz = farea[, c("mzmin", "mzmax")], + rt = farea[, c("rtmin", "rtmax")]) |> + plot(col = c("red", "blue"), lwd = 2) +``` + +After gap-filling, also in the blue colored sample a chromatographic peak is +present and its peak area would be reported as feature abundance for that +sample. + +To further assess the effectiveness of the gap-filling method for rescuing +signals, we can also plot the average signal of features with at least one +missing value against the average of filled-in signal. It is advisable to +perform this analysis on repeatedly measured samples; in this case, our QC +samples will be used. + +For this, we extract: + +- Feature values from detected chromatographic peaks by setting + `filled = FALSE` in the `featuresValues()` call. + +- The filled-in signal by first extracting both detected and gap-filled + abundances and then replace the values for detected chromatographic peaks + with `NA`. + +Then, we calculate the row averages of both of these matrices and plot them +against each other. + +```{r Detected vs filled signal1} +#' Get only detected signal in QC samples +vals_detect <- featureValues(lcms1, filled = FALSE)[, QC_samples] + +#' Get detected and filled-in signal +vals_filled <- featureValues(lcms1)[, QC_samples] + +#' Replace detected signal with NA +vals_filled[!is.na(vals_detect)] <- NA + +#' Identify features with at least one filled peak +has_filled <- is.na(rowSums(vals_detect)) + +#' Calculate row averages for features with missing values +avg_detect <- rowMeans(vals_detect[has_filled, ], na.rm = TRUE) +avg_filled <- rowMeans(vals_filled[has_filled, ], na.rm = TRUE) + +#' Plot the values against each other (in log2 scale) +plot(log2(avg_detect), log2(avg_filled), + xlim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), + ylim = range(log2(c(avg_detect, avg_filled)), na.rm = TRUE), + pch = 21, bg = "#00000020", col = "#00000080") +grid() +abline(0, 1) +``` + +The detected (x-axis) and gap-filled (y-axis) values for QC samples are highly +correlated. Especially for higher abundances, the agreement is very high, while +for low intensities, as can be expected, differences are higher and trending to +below the correlation line. Below we, in addition, fit a linear regression line +to the data and summarize its results + +```{r} +#' fit a linear regression line to the data +l <- lm(log2(avg_filled) ~ log2(avg_detect)) +summary(l) +``` + +The linear regression line has a slope of `r round(coef(l)[2], 2)` and an +intercept of `r round(coef(l)[1], 2)`. This indicates that the filled-in signal +is on average `r round(coef(l)[2], 2)` times higher than the detected signal. + +## Preprocessing results + +The final results of the LC-MS data preprocessing are stored within the +`XcmsExperiment` object. This includes the identified chromatographic peaks, the +alignment results, as well as the correspondence results. In addition, to +guarantee reproducibility, this result object keeps track of all performed +processing steps, including the individual parameter objects used to configure +these. The `processHistory()` function returns a list of the various applied +processing steps in chronological order. Below, we extract the information for +the first step of the performed preprocessing. + +```{r Process history} +#' Check first step of the process history +processHistory(lcms1)[[1]] +``` + +The `processParam()` function could then be used to extract the actual parameter +class used to configure this processing step. + +The final result of the whole LC-MS data preprocessing is a two-dimensional +matrix with abundances of the so-called LC-MS features in all samples. Note that +at this stage of the analysis features are only characterized by their *m/z* and +retention time and we don't have yet any information which metabolite a feature +could represent. + +As we have seen before, such feature matrix can be extracted with the +`featureValues()` function and the corresponding feature characteristics (i.e., +their *m/z* and retention time values) using the `featureDefinitions()` +function. Thus, these two arrays could be extracted from the *xcms* result +object and used/imported in other analysis packages for further processing. They +could for example also be exported to tab delimited text files, and used in an +external tool, or used, if also MS2 spectra were available, for feature-based +molecular networking in the GNPS analysis environment +[@nothias_feature-based_2020]. + +For further processing in R, if no reference or link to the raw MS data is +required, it is suggested to extract the *xcms* preprocessing result using the +`quantify()` function as a `SummarizedExperiment` object, Bioconductor's default +container for data from biological assays/experiments. This simplifies +integration with other Bioconductor analysis packages. The `quantify()` function +takes the same parameters than the `featureValues()` function, thus, with the +call below we extract a `SummarizedExperiment` with only the detected, but not +gap-filled, feature abundances: + +```{r} +#' Extract results as a SummarizedExperiment +res <- quantify(lcms1, method = "sum", filled = FALSE) +res +``` + +Sample identifications from the *xcms* result's `sampleData()` are now available +as `colData()` (column, sample annotations) and the `featureDefinitions()` as +`rowData()` (row, feature annotations). The feature values have been added as +the first `assay()` in the `SummarizedExperiment` and even the processing +history is available in the object's `metadata()`. A `SummarizedExperiment` +supports multiple assays, all being numeric matrices with the same dimensions. +Below we thus add the detected and gap-filled feature abundances as an +additional assay to the `SummarizedExperiment`. + +```{r} +assays(res)$raw_filled <- featureValues(lcms1, method = "sum", + filled = TRUE ) + +#' Different assay in the SummarizedExperiment object +assayNames(res) +``` + +Feature abundances can be extracted with the `assay()` function. Below we +extract the first 6 lines of the detected and gap-filled feature abundances: + +```{r} +assay(res, "raw_filled") |> head() +``` + +An advantage, in addition to being a container for the full preprocessing +results is also the possibility of an easy and intuitive creation of data +subsets ensuring data integrity. It would for example be very easy to subset the +full data to a selection of features and/or samples: + +```{r} +res[1:14, 3:8] +``` + +Before moving to the next step of the analysis, it is advisable to save the +preprocessing results. We have multiple format options to save into, and they +can be found in the `MsIO` package +[documentation](https://rformassspectrometry.github.io/MsIO/articles/MsIO.html). +Below we will save our `XcmsExperiment` object into a file format handled by the +[alabster framework](https://github.com/ArtifactDB/alabaster.base), which +ensures that our object can be easily read from other languages like Python and +Javascript as well as loaded easily back into R. + +```{r} +#' Save the preprocessing results +# d <- file.path(tempdir(), "objects/lcms1") +# saveMsObject(lcms1, AlabasterParam(path = d)) +#for now let's do R object because the previous method is not implemented yet. +save(lcms1, file = "preprocessed_lcms1.RData") +``` + +# Data normalization + +After preprocessing, data normalization or scaling might need to be applied to +remove any technical variances from the data. While simple approaches like +median scaling can be implemented with a few lines of R code, more advanced +normalization algorithms are available in packages such as Bioconductor's +`r Biocpkg("preprocessCore")`. The comprehensive workflow "Notame" also propose +a very interesting normalization approach adaptable and scalable to the user +dataset [@klavus_notame_2020]. + +Generally, for LC-MS data, bias can be categorized into three main +groups[@broadhurst_guidelines_2018]: + +- Variances introduced by sample collection and initial processing, which can + include differences in sample amounts. This type of bias is expected to be + sample-specific and affect all signals in a sample in the same way. Methods + like median scaling, LOESS or quantiles normalization can adjust this bias. + +- Signal drifts along measurement of samples from an experiment. Reasons for + such drifts can be related to aging of the instrumentation used (columns, + detector), but also to changes in metabolite abundances and characteristics + due to reactions and modifications, such as oxidation. These changes are + expected to affect more the samples measured later in a run rather than the + ones measured at the beginning. For this reason, this bias can play a major + role in large experiments bias can play a major role in large experiments + measured over a long time range and is usually considered to affect + individual metabolites (or metabolite groups) differently. For adjustment, + moving average or linear regression-based approaches can be used. The latter + can for example be performed using the `adjust_lm()` function from the + `r Biocpkg("MetaboCoreUtils")` package. + +- Batch-related biases. These comprise any noise that is specific to a larger + set of samples, which can be the set of samples measured in one LC-MS + measurement run (i.e. from one analysis plate) or samples measured using a + specific batch of reagents. This noise is assumed to affect all samples in + one batch in the same way and linear modeling-based approaches can be used + to adjust for this. + +Unwanted variation can arise from various sources and is highly dependent on the +experiment. Therefore, data normalization should be chosen carefully based on +experimental design, statistical aims, and the balance of accuracy and precision +achieved through the use of auxiliary information. + +Sample preparation biases can be evaluated using internal standards, depending +however also on when they were added to the sample mixes during sample +processing. Repeated measurements of QC samples on the other hand allows to +estimate and correct for LC-MS specific biases. Also, proper planning of an +experiment, such as measurement of study samples in random order, can largely +avoid biases introduced by most of the above mentioned sources of variance. + +In this workflow we present some tools to assess data quality and evaluate need +for normalization as well as options for normalization. For space reasons we are +not able to provide solutions to adjust for all possible sources of variation. + +## Initial quality assessment + +A principal component analysis (PCA) is a very helpful tool for an initial, +unsupervised, visualization of the data that also provides insights into +potential quality issues in the data. In order to apply a PCA to the measured +feature abundances, we need however to impute (the still present) missing +values. We assume that most of these missing values (after the gap-filling step) +represent signal which is below detection limit. In such cases, missing values +can be replaced with random values sampled from a uniform distribution, ranging +from half of the smallest measured value to the smallest measured value for a +specific feature. The uniform distribution is defined with two parameters +(minimum and maximum) and all values between them have an equal probability of +being selected. + +Below we impute missing values with this approach and add the resulting data +matrix as a new *assay* to our result object. + +```{r} +#' Load preprocessing results +## load("SumExp.RData") +## loadResults(RDataParam("data.RData")) + +#' Impute missing values using an uniform distribution +na_unidis <- function(z) { + na <- is.na(z) + if (any(na)) { + min = min(z, na.rm = TRUE) + z[na] <- runif(sum(na), min = min/2, max = min) + } + z +} + +#' Row-wise impute missing values and add the data as a new assay +tmp <- apply(assay(res, "raw_filled"), MARGIN = 1, na_unidis) +assays(res)$raw_filled_imputed <- t(tmp) +``` + +### Principal Component Analysis + +PCA is a powerful tool for detecting biases in data. As a dimensionality +reduction technique, it enables visualization of data in a lower-dimensional +space. In the context of LC-MS data, PCA can be used to identify overall biases +in batch, sample, injection index, etc. However, it is important to note that +PCA is a linear method and may not be able to detect all biases in the data. + +Before plotting the PCA, we apply a log2 transform, center and scale the data. +The log2 transformation is applied to stabilize the variance while centering to +remove dependency on absolute abundances. + +```{r unsupervised checks1} +#| fig-cap: "Figure 28. PCA of the data." +#' Log2 transform and scale data +vals <- assay(res, "raw_filled_imputed") |> + log2() |> + t() |> + scale(center = TRUE, scale = TRUE) + +#' Perform the PCA +pca_res <- prcomp(vals, scale = FALSE, center = FALSE) + +#' Plot the results +vals_st <- cbind(vals, phenotype = res$phenotype) +pca_12 <- autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + + scale_color_manual(values = col_phenotype) +pca_34 <- autoplot(pca_res, data = vals_st, colour = 'phenotype', + x = 3, y = 4, scale = 0) + + scale_color_manual(values = col_phenotype) +grid.arrange(pca_12, pca_34, ncol = 2) +``` + +The PCA above shows a clear separation of the study samples (plasma) from the QC +samples (serum) on the first principal component (PC1). The separation based on +phenotype is visible on the third principal component (PC3). + +In some cases, it can be a better option to remove the imputed values and +evaluate the PCA again. This is especially true if the imputed values are +replacing a large proportion of the data. + +### Intensity evaluation + +Global differences in feature abundances between samples (e.g. due to +sample-specific biases) can be evaluated by plotting the distribution of the +log2 transformed feature abundances using boxplots of violin plots. Below we +show the number of detected chromatographic peaks per sample and the +distribution of the log2 transformed feature abundances. + +```{r counts1} +#| fig-cap: "Figure 29. Number of detected peaks and feature abundances." +layout(mat = matrix(1:3, ncol = 1), height = c(0.2, 0.2, 0.8)) + +par(mar = c(0.2, 4.5, 0.2, 3)) +barplot(apply(assay(res, "raw"), MARGIN = 2, function(x) sum(!is.na(x))), + col = paste0(col_sample, 80), border = col_sample, + ylab = "# detected peaks", xaxt = "n", space = 0.012) +grid(nx = NA, ny = NULL) +barplot(apply(assay(res, "raw_filled"), MARGIN = 2, function(x) sum(!is.na(x))), + col = paste0(col_sample, 80), border = col_sample, + ylab = "# detected + filled peaks", xaxt = "n", + space = 0.012) +grid(nx = NA, ny = NULL) +vioplot(log2(assay(res, "raw_filled")), xaxt = "n", + ylab = expression(log[2]~feature~abundance), + col = paste0(col_sample, 80), border = col_sample) +points(colMedians(log2(assay(res, "raw_filled")), na.rm = TRUE), type = "b", + pch = 1) +grid(nx = NA, ny = NULL) +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, + cex = 0.8, bty = "n") +``` + +The upper part of the plot show that the gap filling steps allowed to rescue a +substantial number of *NAs* and allowed us to have a more consistent number of +feature values per sample. This consistency aligns with our asspumption that +every sample should have a similar amount of features detected. Additionally we +observe that, on average, the signal distribution from the individual samples is +very similar. + +An alternative way to evaluate differences in abundances between samples are +*relative log abundance* (RLA) plots [@de_livera_normalizing_2012]. An RLA value +is the abundance of a feature in a sample relative to the median abundance of +the same feature across multiple samples. We can discriminate between *within +group* and *across group* RLAs, depending whether the abundance is compared +against samples within the same sample group or across all samples. Within group +RLA plots assess the tightness of replicates within groups and should have a +median close to zero and low variation around it. When used across groups, they +allow to compare behavior between groups. Generally, between-sample differences +can be easily spotted using RLA plots. Below we calculate and visualize within +group RLA values using the `rowRla()` function from the +`r Biocpkg("MsCoreUtils")` package defining with parameter `f` the sample +groups. + +```{r rla-plot raw and filled1} +#| fig-cap: "Figure 30. RLA plot for the raw data and filled data." +par(mfrow = c(1, 1), mar = c(3.5, 4.5, 2.5, 1)) +boxplot(MsCoreUtils::rowRla(assay(res, "raw_filled"), + f = res$phenotype, transform = "log2"), + cex = 0.5, pch = 16, + col = paste0(col_sample, 80), ylab = "RLA", + border = col_sample, boxwex = 1, + outline = FALSE, xaxt = "n", main = "Relative log abundance", + cex.main = 1) +axis(side = 1, at = seq_len(ncol(res)), labels = colData(res)$sample_name) +grid(nx = NA, ny = NULL) +abline(h = 0, lty=3, lwd = 1, col = "black") +legend("topright", col = col_phenotype, + legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, ncol = 3, + cex = 0.8, bty = "n") +``` + +On the RLA plot above, we can observe that the medians for most samples are +indeed centered around 0. Exception are two of the *CVD* samples. Thus, while +the distribution of signals across samples is comparable, some differences seem +to be present that require a between sample normalization. + +### Internal standards + +Depending when IS are added to the sample mixes, they allow evaluation of +variances introduced by the subsequent processing and analysis steps. For the +present experiment, IS were added to the original plasma samples before sample +extraction which included also protein and lipid removal steps. They can +therefore be used to evaluate variances introduced by sample extraction and any +subsequent steps, can however not be used to infer conclusions on performance or +differences on the original sample collection (blood drawing, storage, plasma +creation). + +Here we again use the `matchValues()` function to identify features representing +signal from the IS. We then filter these matches to only keep IS that match with +a single feature using the `filterMatches()` function in combination with +`SingleMatchParam`. + +```{r} +# Do we keep IS in normalisation ? Does not give much info... Would simplify a bit +#' Creating a column within our IS table +intern_standard$feature_id <- NA_character_ + +#' Identify features matching m/z and RT of internal standards. +fdef <- featureDefinitions(lcms1) +fdef$feature_id <- rownames(fdef) +match_intern_standard <- matchValues( + query = intern_standard, + target = fdef, + mzColname = c("mz", "mzmed"), + rtColname = c("RT", "rtmed"), + param = MzRtParam(ppm = 50, toleranceRt = 10)) + +#' Keep only matches with a 1:1 mapping standard to feature. +param <- SingleMatchParam(duplicates = "remove", column = "score_rt", + decreasing = TRUE) +match_intern_standard <- filterMatches(match_intern_standard, param) + +intern_standard$feature_id <- match_intern_standard$target_feature_id +intern_standard <- intern_standard[!is.na(intern_standard$feature_id), ] +``` + +These internal standards play a crucial role in guiding the normalization +process. Given the assumption that the samples were artificially spiked, we +possess a known ground truth—that the abundance or intensity of the internal +standard should be consistent. Any difference is expected to be due to technical +differences/variance. Consequently, normalization aims to minimize variation +between samples for the internal standard, reinforcing the reliability of our +analyses. + +## Between sample normalisation + +The previous RLA plot showed that the data had some biases that need to be +corrected. Therefore, we will implement between-sample normalization using +filled-in features. This process effectively mitigates variations influenced by +technical issues, such as differences in sample preparation, processing and +injection methods. In this instance, we will employ a commonly used technique +known as median scaling [@de_livera_normalizing_2012]. + +### Median scaling + +This method involves computing the median for each sample, followed by +determining the median of these individual sample medians. This ensures +consistent median values for each sample throughout the entire data set. +Maintaining uniformity in the average total metabolite abundance across all +samples is crucial for effective implementation. + +This process aims to establish a shared baseline for the central tendency of +metabolite abundance, mitigating the impact of sample-specific technical +variations. This approach fosters a more robust and comparable analysis of the +top features across the data set. The assumption is that normalizing based on +the median, known for its lower sensitivity to extreme values, enhances the +comparability of top features and ensures a consistent average abundance across +samples. + +```{r} +#' Compute median and generate normalization factor +mdns <- apply(assay(res, "raw_filled"), MARGIN = 2, + median, na.rm = TRUE ) +nf_mdn <- mdns / median(mdns) + +#' divide dataset by median of median and create a new assay. +assays(res)$norm <- sweep(assay(res, "raw_filled"), MARGIN = 2, nf_mdn, '/') +assays(res)$norm_imputed <- sweep(assay(res, "raw_filled_imputed"), MARGIN = 2, + nf_mdn, '/') +``` + +The median scaling is calculated for both imputed and non-imputed data, with +each set stored separately within the `SummarizedExperiment` object. This +approach facilitates testing various normalization strategies while maintaining +a record of all processing steps undertaken, enabling easy regression to +previous stages if necessary. + +## Assessing overall effectiveness of the normalization approach + +It is crucial to evaluate the effectiveness of the normalization process. This +can be achieved by comparing the distribution of the log2 transformed feature +abundances before and after normalization. Additionally, the RLA plots can be +used to assess the tightness of replicates within groups and compare the +behavior between groups. + +### Principal Component Analysis + +```{r} +#| fig-cap: "Figure 31. PC1 and PC2 of the data before and after normalization." +#' Data before normalization +vals_st <- cbind(vals, phenotype = res$phenotype) +pca_raw <- autoplot(pca_res, data = vals_st, + colour = 'phenotype', scale = 0) + + scale_color_manual(values = col_phenotype) + +#' Data after normalization +vals_norm <- apply(assay(res, "norm"), MARGIN = 1, na_unidis) |> + log2() |> + scale(center = TRUE, scale = TRUE) + +pca_res_norm <- prcomp(vals_norm, scale = FALSE, center = FALSE) +vals_st_norm <- cbind(vals_norm, phenotype = res$phenotype) +pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, + colour = 'phenotype', scale = 0) + + scale_color_manual(values = col_phenotype) + +grid.arrange(pca_raw, pca_adj, ncol = 2) +``` + +Normalization did not have a large impact on PC1 or PC2, but the separation of +the study groups on PC3 seems to be better and difference between QC samples +lower after normalization (see below). + +```{r} +#| fig-cap: "Figure 32. PC3 and PC4 of the data before and after normalization." +pca_raw <- autoplot(pca_res, data = vals_st , + colour = 'phenotype', x = 3, y = 4, scale = 0) + + scale_color_manual(values = col_phenotype) +pca_adj <- autoplot(pca_res_norm, data = vals_st_norm, + colour = 'phenotype', x = 3, y = 4, scale = 0) + + scale_color_manual(values = col_phenotype) + +grid.arrange(pca_raw, pca_adj, ncol = 2) +``` + +The PCA plots above show that the normalization process has not changed the +overall structure of the data. The separation between the study and QC samples +remains the same. This is an expected results as normalization should not +correct for biological variance and only technical. + +### Intensity evaluation + +We compare the RLA plots before and after between-sample normalization to +evaluate its impact on the data. + +```{r rla-plot after norm2} +#| fig-cap: "Figure 33. RLA plot before and after normalization." +par(mfrow = c(2, 1), mar = c(3.5, 4.5, 2.5, 1)) + +boxplot(rowRla(assay(res, "raw_filled"), group = res$phenotype), + cex = 0.5, pch = 16, ylab = "RLA", border = col_sample, + col = paste0(col_sample, 80), cex.main = 1, outline = FALSE, + xaxt = "n", main = "Raw data", boxwex = 1) +grid(nx = NA, ny = NULL) +legend("topright", inset = c(0, -0.2), col = col_phenotype, + legend = names(col_phenotype), lty=1, lwd = 2, xpd = TRUE, + ncol = 3, cex = 0.7, bty = "n") +abline(h = 0, lty=3, lwd = 1, col = "black") + +boxplot(rowRla(assay(res, "norm"), group = res$phenotype), + cex = 0.5, pch = 16, ylab = "RLA", border = col_sample, + col = paste0(col_sample, 80), boxwex = 1, outline = FALSE, + xaxt = "n", main = "Normallized data", cex.main = 1) +axis(side = 1, at = seq_len(ncol(res)), labels = res$sample_name) +grid(nx = NA, ny = NULL) +abline(h = 0, lty = 3, lwd = 1, col = "black") +``` + +The normalization process has effectively centered the data around the median +and medians for all samples are now closer to zero. + +### Coefficient of variation + +We next evaluate the coefficient of variation (CV, also referred to as *relative +standard deviation* RSD) for features across samples from either QC or study +samples. In QC samples, the CV would represent technical noise, while in study +samples it would include also expected biological differences. Thus, +normalization should reduce the CV in QC samples, while only slightly reducing +the CV in study samples. The CV is calculated below using the `rowRsd()` +function from the `r Biocpkg("MetaboCoreUtils")` package. By setting +`mad = TRUE` we use a more robust calculation using the median absolute +deviation instead of the standard deviation. + +```{r include=TRUE, results = "asis"} +#| tbl-cap: "Table 6. Distribution of CV values across samples for the raw and normalized data." +#' Calculate the CV values +index_study <- res$phenotype %in% c("CTR", "CVD") +index_QC <- res$phenotype == "QC" + +sample_res <- cbind( + QC_raw = rowRsd(assay(res, "raw_filled")[, index_QC], + na.rm = TRUE, mad = TRUE), + QC_norm = rowRsd(assay(res, "norm")[, index_QC], + na.rm = TRUE, mad = TRUE), + Study_raw = rowRsd(assay(res, "raw_filled")[, index_study], + na.rm = TRUE, mad = TRUE), + Study_norm = rowRsd(assay(res, "norm")[, index_study], + na.rm = TRUE, mad = TRUE) +) + +#' Summarize the values across features +res_df <- data.frame( + QC_raw = quantile(sample_res[, "QC_raw"], na.rm = TRUE), + QC_norm = quantile(sample_res[, "QC_norm"], na.rm = TRUE), + Study_raw = quantile(sample_res[, "Study_raw"], na.rm = TRUE), + Study_norm = quantile(sample_res[, "Study_norm"], na.rm = TRUE) +) + +kable(res_df, format = "pipe") +``` + +The table above shows the distribution of the CV for both raw and normalized +data. The first column highlights the % of the data which is below a given CV +value, e.g. 25% of the data has a CV equal or lower than 0.04557 at the *QC_raw* +data. As anticipated, the CV values for the QCs, which reflect technical +variance, are lower compared to those for the study samples, which include both +technical and biological variance. Overall, minimal disparity exists between the +raw and normalized data, which is a positive indication that the normalization +process has not introduced bias into the dataset, but also reflects the little +differences in average abundances between sample in the raw data. + +### Conclusion on normalization + +The overall conclusion of the normalization process is that very little variance +was present from the beginning, normalization was however able to center the +data around the median (as shown by the RLA plot). Given the simplicity and +limited size of our example dataset, we will conclude the normalization process +at this stage. For more intricate datasets with diverse biases, a tailored +approach would be devised. This could include also approaches to adjust for +signal drifts or batch effects. One possible option would be to use a +linear-model based approach such as can for example be applied with the +`adjust_lm()` function from the `r Biocpkg("MetaboCoreUtils")` package. + +# Quality control: Feature prefiltering + +After normalizing our data we can now pre-filter to clean the data before +performing any statistical analysis. In general, pre-filtering of samples and +features is performed to remove outliers. + +Below we copy the original result object to also keep the unfiltered data for +later comparisons. + +```{r} +#' Number of features before filtering +nrow(res) + +#' keep unfiltered object +res_unfilt <- res +``` + +Here we will eliminate features that exhibit high variability in our dataset. +Repeatedly measured QC samples typically serve as a robust basis for cleansing +datasets allowing to identify features with excessively high noise. As in our +data set external QC samples were used, i.e. pooled samples from a different +collection and using a slightly different sample matrix, their utility for +filtering is somewhat limited. For a comprehensive description and guidelines +for data filtering in untargeted metabolomic studies, please refer to +[@broadhurst_guidelines_2018]. + +We first restrict the data set to features for which a chromatographic peak was +detected in at least 2/3 of samples of at least one of the study samples groups. +This ensures the statistical tests carried out later on the study samples are +being performed on *reliable* signal. Also, with this filter we remove features +that were mostly detected in QC samples, but not the study samples. Such filter +can be performed with `filterFeatures()` function from the `r Biocpkg("xcms")` +package with the `PercentMissingFilter` setting. The parameters of this filer: + +- `threshold`: defines the maximal acceptable percentage of samples with + missing value(s) in at least one of the sample groups defined by parameter + `f`. +- `f`: a factor defining the sample groups. By replacing the `"QC"` sample + group with `NA` in parameter `f` we exclude the QC samples from the + evaluation and consider only the study samples. With `threshold = 40` we + keep only features for which a peak was detected in 2 out of the 3 samples + in one of the sample groups. + +To consider detected chromatographic peaks per sample, we apply the filter on +the`"raw"` assay of our result object, that contains abundance values only for +detected chromatographic peaks (prior gap-filling). + +```{r} +#' Limit features to those with at least two detected peaks in one study group. +#' Setting the value for QC samples to NA excludes QC samples from the +#' calculation. +f <- res$phenotype +f[f == "QC"] <- NA +f <- as.factor(f) +res <- filterFeatures(res, PercentMissingFilter(f = f, threshold = 40), + assay = "raw") +``` + +Following the guidelines stated above we decided to still use the QC samples for +pre-filtering, on the basis that they represent similar bio-fluids to our study +samples, and thus, we anticipate observing relatively similar metabolites +affected by similar measurement biases. + +We therefore evaluate the *dispersion ratio* (Dratio) +[@broadhurst_guidelines_2018] for all features in the data set. We accomplish +this task using the same function as above but this time with the `DratioFilter` +parameter. More filters exist for this function and we invite the user to +explore them as to decide what is best for their dataset. + +```{r} +#' Compute and filter based on the Dratio +filter_dratio <- DratioFilter(threshold = 0.4, + qcIndex = res$phenotype == "QC", + studyIndex = res$phenotype != "QC", + mad = TRUE) +res <- filterFeatures(res, filter = filter_dratio, assay = "norm_imputed") +``` + +The Dratio filter is a powerful tool to identify features that exhibit high +variability in the data, relating the variance observed in QC samples with that +in study samples. By setting a threshold of 0.4, we remove features that have a +high degree of variability between the QC and study samples. In this example, +any feature in which the deviation at the QC is higher than 40% +(`threshold = 0.4`)of the deviation on the study samples is removed. This +filtering step ensures that only features are retained that have considerably +lower technical than biological variance. + +Note that the `rowDratio()` and `rowRsd()` functions from the +`r Biocpkg("MetaboCoreUtils")` package could be used to calculate the actual +numeric values for the estimates used for filtering, to e.g. to evaluate their +distribution in the whole data set or identify data set-dependent threshold +values. + +Finally, we evaluate the number of features left after the filtering steps and +calculate the percentage of features that were removed. + +```{r} +#' Number of features after analysis +nrow(res) + +#' Percentage left: end/beginning +nrow(res)/nrow(res_unfilt) * 100 +``` + +The dataset has been reduced from `r nrow(res_unfilt)` to `r nrow(res)` +features. We did remove a considerable amount of features but this is expected +as we want to focus on the most reliable features for our analysis. For the rest +of our analysis we need to separate the QC samples from the study samples. We +will store the QC samples in a separate object for later use. + +```{r} +res_qc <- res[, res$phenotype == "QC"] +res <- res[, res$phenotype != "QC"] +``` + +We in addition calculate the CV of the QC samples and add that as an additional +column to the `rowData()` of our result object. These could be used later to +prioritize identified *significant* features with e.g. low technical noise. + +```{r} +#' Calculate the QC's CV and add as feature variable to the data set +rowData(res)$qc_cv <- assay(res_qc, "norm") |> + rowRsd() +``` + +Now that our data set has been preprocessed, normalized and filtered, we can +start to evaluate the distribution of the data and estimate the variation due to +biology. + +# Differential abundance analysis + +After normalization and quality control, the next step is to identify features +that are differentially abundant between the study groups. This crucial step +allows us to identify potential biomarkers or metabolites that are associated +with the study groups. There are various approaches and methods available for +the identification of such features of interest. In this workflow we use a +multiple linear regression analysis to identify features with significantly +difference in abundances between the CVD or CTR study group. + +Before performing the tests we evaluate similarities between study samples using +a PCA (after excluding the QC samples to avoid them influencing the results). + +```{r} +#| fig-cap: "Figure 34. PCA of the data after normalization and quality control." +#' Define the colors for the plot +col_sample <- col_phenotype[res$phenotype] + +#' Log transform and scale the data for PCA analysis +vals <- assay(res, "norm_imputed") |> + t() |> + log2() |> + scale(center = TRUE, scale = TRUE) +pca_res <- prcomp(vals, scale = FALSE, center = FALSE) + +vals_st <- cbind(vals, phenotype = res$phenotype) +autoplot(pca_res, data = vals_st , colour = 'phenotype', scale = 0) + + scale_color_manual(values = col_phenotype) +``` + +The samples clearly separate by study group on PCA indicating that there are +differences in the metabolite profiles between the two groups. However, what +drives the separation on PC1 is not clear. Below we evaluate whether this could +be explained by the other available variable of our study, i.e., age: + +```{r} +#| fig-cap: "Figure 35. PCA colored by age of the data after normalization and quality control." +#' Add age to the PCA plot +vals_st <- cbind(vals, age = res$age) +autoplot(pca_res, data = vals_st , colour = 'age', scale = 0) +``` + +According to the PCA above, PC1 does not seem to be related to age. + +Even if there is some variance in the data set we can't explain at this stage, +we proceed with the (supervised) statistical tests to identify features of +interest. We compute linear models for each metabolite explaining the observed +feature abundance by the available study variables. While we could also use the +base R function `lm()`, we utilize the `R Biocpkg("limma")` package to conduct +the differential abundance analysis: the *moderated* test statistics +[@smyth_linear_2004] provided by this package are specifically well suited for +experiments with a limited number of replicates. For our tests we use a linear +model `~ phenotype + age`, hence explaining the abundances of one metabolite +accounting for the study group assignment and age of each individual. The +analysis might benefit from inclusion of a study covariate associated with PC2 +or explaining the variance seen in that principal component, but for the present +analysis only the participant's age and disease association was provided. + +Below we define the design of the study with the `model.matrix()` function and +fit feature-wise linear models to the log2-transformed abundances using the +`lmFit()` function. P-values for significance of association are then calculated +using the `eBayes()` function, that also performs the empirical Bayes-based +robust estimation of the standard errors. See also the excellent vignette/user +guide of the `r Biocpkg("limma")` package for examples and details on the linear +model procedure. + +```{r} +#' Define the linear model to be applied to the data +p.cut <- 0.05 # cut-off for significance. +m.cut <- 0.5 # cut-off for log2 fold change + +age <- res$age +phenotype <- factor(res$phenotype) +design <- model.matrix(~ phenotype + age) + +#' Fit the linear model to the data, explaining metabolite +#' concentrations by phenotype and age. +fit <- lmFit(log2(assay(res, "norm_imputed")), design = design) +fit <- eBayes(fit) +``` + +After the linear models have been fitted, we can now proceed to extract the +results. We create a data frame containing the coefficients, raw and adjusted +p-values (applying a Benjamini-Hochberg correction, i.e., `method = "BH"` for +improved control of the false discovery rate), the average intensity of signals +in CVD and CTR samples, and an indication of whether a feature is deemed +significant or not. We consider all metabolites with an adjusted p-value smaller +than 0.05 to be significant, but we could also include the (absolute) difference +in abundances in the cut-off criteria. At last, we add the differential +abundance results to the result object's `rowData()`. + +```{r} +#' Compile a result data frame +tmp <- data.frame( + coef.CVD = fit$coefficients[, "phenotypeCVD"], + pvalue.CVD = fit$p.value[, "phenotypeCVD"], + adjp.CVD = p.adjust(fit$p.value[, "phenotypeCVD"], method = "BH"), + avg.CVD = rowMeans( + log2(assay(res, "norm_imputed")[, res$phenotype == "CVD"])), + avg.CTR = rowMeans( + log2(assay(res, "norm_imputed")[, res$phenotype == "CTR"])) +) +tmp$significant.CVD <- tmp$adjp.CVD < 0.05 +#' Add the results to the object's rowData +rowData(res) <- cbind(rowData(res), tmp) +``` + +We can now proceed to visualize the distribution of the raw and adjusted +p-values. + +```{r histogram} +#| fig-cap: "Figure 36. Distribution of raw (left) and adjusted p-values (right)." +#' Plot the distribution of p-values +par(mfrow = c(1, 2)) +hist(rowData(res)$pvalue.CVD, breaks = 64, xlab = "p value", + main = "Distribution of raw p-values", + cex.main = 1, cex.lab = 1, cex.axis = 1) +hist(rowData(res)$adjp.CVD, breaks = 64, xlab = expression(p[BH]~value), + main = "Distribution of adjusted p-values", + cex.main = 1, cex.lab = 1, cex.axis = 1) +``` + +The histograms above show the distribution of raw and adjusted p-values. Except +for an enrichment of small p-values, the raw p-values are (more or less) +uniformly distributed, which indicates the absence of any strong systematic +biases in the data. The adjusted p-values are more conservative and account for +multiple testing; this is important here as we fit a linear model to each +feature and therefore perform a large number of tests which leads to a high +chance of false positive findings. We do see that some features have very low +p-values, indicating that they are likely to be significantly different between +the two study groups. + +Below we plot the adjusted p-values against the log2 fold change of (average) +abundances. This volcano plot will allow us to visualize the features that are +significantly different between the two study groups. These are highlighted with +a blue color in the plot below. + +```{r volcano} +#| fig-cap: "Figure 37. Volcano plot showing the analysis results." +#' Plot volcano plot of the statistical results +par(mfrow = c(1, 1), mar = c(5, 5, 5, 1)) +plot(rowData(res)$coef.CVD, -log10(rowData(res)$adjp.CVD), + xlab = expression(log[2]~difference), + ylab = expression(-log[10]~p[BH]), pch = 16, col = "#00000060", + cex.main = 1.5, cex.lab = 1.5, cex.axis = 1.3) +grid() +abline(h = -log10(0.05), col = "#0000ffcc") +if (any(rowData(res)$significant.CVD)) { + points(rowData(res)$coef.CVD[rowData(res)$significant.CVD], + -log10(rowData(res)$adjp.CVD[rowData(res)$significant.CVD]), + col = "#0000ffcc") +} +``` + +The most interesting features are in the top corners in the volcano plot (i.e., +features with a large difference in abundance between the groups and a small +p-value). All significant features have a negative coefficient (log2 fold change +value) indicating that their abundance is lower in CVD samples compared to the +CTR samples. These features are listed, along with their average difference in +(log2) abundance between the compared groups, their adjusted p-values, their +average (log2) abundance in each sample group and their RSD (CV) in QC samples +in the table below. + +```{r result-table} +#| tbl-cap: "Table 7. Features with significant differences in abundances." +# Table of significant features +tab <- rowData(res)[rowData(res)$significant.CVD, + c("mzmed", "rtmed", "coef.CVD", "adjp.CVD", + "avg.CTR", "avg.CVD", "qc_cv")] |> + as.data.frame() +tab <- tab[order(abs(tab$coef.CVD), decreasing = TRUE), ] +kable(tab, format = "pipe") +``` + +Below we visualize the EICs of the significant features to evaluate their (raw) +signal. We restrict the MS data set to study samples. Parameters + +- `keepFeatures = TRUE`: ensures that identified features are retained in the + \`subset object. +- `peakBg`: defines the (background) color for each individual chromatographic + peak in the EIC object. + +```{r} +#| fig-cap: "Figure 38. Extracted ion chromatograms of the significant features." +#' Restrict the raw data to study samples. +lcms1_study <- lcms1[sampleData(lcms1)$phenotype != "QC", keepFeatures = TRUE] +#' Extract EICs for the significant features +eic_sign <- featureChromatograms( + lcms1_study, features = rownames(tab), expandRt = 5, filled = TRUE) + +#' Plot the EICs. +plot(eic_sign, col = col_sample, + peakBg = paste0(col_sample[chromPeaks(eic_sign)[, "sample"]], 40)) +legend("topright", col = col_phenotype, legend = names(col_phenotype), lty = 1) +``` + +The EICs for all significant features show a clear and single peak. Their +intensities are (as already observed above) much larger in the CTR than in the +CVD samples. With the exception of the second feature (second EIC in the top +row), the intensities for all significant features are however generally low. +This might make it challenging to identify them using an LC-MS/MS setup. + +# Annotation + +We have now identified the features with significant differences in abundances +between the two study groups. These provide information on metabolic pathways +that differentiate affected from healthy individuals and might hence also serve +as biomarkers. However, at this stage of the analysis we do not know what +compounds/metabolites they actually represent. We thus need now to annotate +these signals. Annotation can be performed at different level of confidence +[@sumner_proposed_2007,@schymanski_identifying_2014]. The lowest level of +annotation, with the highest rate of false positive hits, bases on the features +*m/z* ratios. Higher levels of annotations employ fragment spectra (MS2 spectra) +of ions of interest requiring this however acquisition of additional data. In +this section, we will demonstrate multiple ways to annotate our significant +features using functionality provided through Bioconductor packages. Alternative +approaches and external software tools, which may be better suited, will also be +discussed later in the section. + +## MS1-based annotation + +Our data set was acquired using an LC-MS setup and our features are thus only +characterized by their *m/z* and retention times. The retention time is +LC-setup-specific and, without prior data/knowledge provide only little +information on the features' identity. Modern MS instruments have a high +accuracy and the *m/z* values should therefore be reliable estimates of the +compound ion's mass-to-charge ratio. As first approach, we use the features' +*m/z* values and match them against reference values, i.e., exact masses of +chemical compounds that are provided by a reference database, in our case the +MassBank database. The full MassBank data is re-distributed through +Bioconductor's `r Biocpkg("AnnotationHub")` resource, which simplifies their +integration into reproducible R-based analysis workflows. + +Below we load the resource, list all available MassBank data sets/releases and +load one of them. + +```{r} +#' load reference data +ah <- AnnotationHub() +#' List available MassBank data sets +query(ah, "MassBank") +#' Load one MAssBank release +mb <- ah[["AH116166"]] +``` + +The MassBank data is provided as a self-contained SQLite database and data can +be queried and accessed through the `r Biocpkg("CompoundDb")` Bioconductor +package. Below we use the `compounds()` function to extract small compound +annotations from the database. + +```{r} +#| tbl-cap: "Table 8. MassBank database extract." +#' Extract compound annotations +cmps <- compounds(mb, columns = c("compound_id", "name", "formula", + "exactmass", "inchikey")) +head(cmps) +``` + +MassBank (as most other small compound annotation databases) provides the +(exact) molecular mass of each compound. Since almost all small compounds are +neutral in their natural state, they need to be first converted to *m/z* values +to allow matching against the feature's *m/z*. To calculate an *m/z* from a +neutral mass, we need to assume which ion (adduct) might be generated from the +measured metabolites by the employed electro-spray ionization. In positive +polarity, for human serum samples, the most common ions will be protonated +(*\[M+H\]+*), or bear the addition of sodium (*\[M+Na\]+*) or ammonium +(*\[M+H-NH3\]+*) ions. To match the observed *m/z* values against reference +values from these potential ions we use again the `matchValues()` function with +the `Mass2MzParam` approach, that allows to specify the types of expected ions +with its `adducts` parameter and the maximal allowed difference between the +compared values using the `tolerance` and `ppm` parameters. Below we first +prepare the data.frame with the significant features, set up the parameters for +the matching and perform the comparison of the query features against the +reference database. + +```{r} +#' Prepare query data frame +rowData(res)$feature_id <- rownames(rowData(res)) +res_sig <- res[rowData(res)$significant.CVD, ] + +#' Setup parameters for the matching +param <- Mass2MzParam(adducts = c("[M+H]+", "[M+Na]+", "[M+H-NH3]+"), + tolerance = 0, ppm = 5) + +#' Perform the matching. +mtch <- matchValues(res_sig, cmps, param = param, mzColname = "mzmed") +mtch +``` + +The resulting `Matched` object shows that 4 of our 6 significant features could +be matched to ions of compounds from the MassBank database. Below we extract the +full result from the `Matched` object. + +```{r} +#' Extracting the results +mtch_res <- matchedData(mtch, c("feature_id", "mzmed", "rtmed", + "adduct", "ppm_error", + "target_formula", "target_name", + "target_inchikey")) +mtch_res +``` + +Thus, in total `r sum(!is.na(mtch_res$ppm_error))` ions of compounds in MassBank +were matched to our significant features based on the specified tolerance +settings. Many compounds, while having a different structure and thus +function/chemical property, have an identical chemical formula and thus mass. +Matching exclusively on the *m/z* of features will hence result in many +potentially false positive hits and is thus considered to provide only low +confidence annotation. An additional complication is that some annotation +resources, like MassBank, being community maintained, contain a large amount of +redundant information. To reduce redundancy in the result table we below iterate +over the hits of a feature and keep only matches to unique compounds (identified +by their INCHIKEY). The INCHI or INCHIKEY combine information from compound's +chemical formula and structure, so while different compounds can share the same +chemical formula, they should have a different structure and thus INCHI. + +```{r} +#| tbl-cap: "Table 9. MS1 annotation results." +rownames(mtch_res) <- NULL + +#' Keep only info on features that machted - create a utility function for that +mtch_res <- split(mtch_res, mtch_res$feature_id) |> + lapply(function(x) { + lapply(split(x, x$target_inchikey), function(z) { + z[which.min(z$ppm_error), ] + }) + }) |> + unlist(recursive = FALSE) |> + do.call(what = rbind) + +#' Display the results +kable(mtch_res, format = "pipe") +``` + +The table above shows the results of the MS1-based annotation process. We can +see that four of our significant features were matched. The matches seem to be +pretty accurate with low ppm errors. The deduplication performed above +considerably reduced the number of hits for each feature, but the first still +matches ions of a large number of compounds (all with the same chemical +formula). + +Considering both features' *m/z* and retention times in the MS1-based annotation +will increase the annotation confidence, but requires additional data, such as +recording of the retention time of thepure standard for a compound on the same +LC setup. An alternative approach which might provide better inside on +annotations and help to choose between different annotations for a same feature +is to evaluate certain chemical properties of the possible matches. For +instance, the LogP value, available in several databases such HMDB, provides an +insight on a given compound's polarity. As this property highly affects the +interaction of the analyte with the column, it usually also directly affects +separation. Therefore, a comparison between an analyte's retention time and +polarity can help to rule out some possible misidentifications. + +While being of low confidence, MS1-based annotation can provide first candidate +annotations that could be confirmed or rejected by additional analyses. + +## MS2-based annotation + +MS1 annotation is a fast and efficient method to annotate features and therefore +give a first insight into the compounds that are significantly different between +the two study groups. However, it is not always the most accurate. MS2 data can +provide a higher level of confidence in the annotation process as it provides, +through the observed fragmentation pattern, information on the structure of the +compound. + +MS2 data can be generated through LC-MS/MS measurement in which MS2 spectra are +recorded for ions either in a data dependent acquisition (DDA) or data +independent acquisition (DIA) mode. Generally, it is advised to include some +LC-MS/MS runs of QC samples or randomly selected study samples already during +the acquisition of the MS1 data that is used for quantification of the signals. +As an alternative, or in addition, a post-hoc LC-MS/MS acquisition can be +performed to generate the MS2 data needed for annotation. For the present +experiment, a separate LC-MS/MS measurement was conducted on QC samples and +selected study samples to generate such data using an *inclusion list* of +pre-selected ions. These represent features found to be significantly different +between CVD and CTR samples in an initial analysis of the full experiment. We +use a subset of this second LC-MS/MS data set to show how such data can be used +for MS2-based annotation. In our differential abundance analysis we found +features with significantly higher abundances in CTR samples. Consequently, we +will utilize the MS2 data obtained from the CTR samples to annotate these +significant features. + +Below we load the LC-MS/MS data from the experiment and restrict it to data +acquired from the CTR sample. + +```{r, message=FALSE, warning=FALSE} +#' Load form the MetaboLights Database +param <- MetaboLightsParam(mtblsId = "MTBLS8735", + assayName = paste0("a_MTBLS8735_LC-MSMS_positive_", + "hilic_metabolite_profiling.txt"), + filePattern = ".mzML") + +lcms2 <- readMsObject(MsExperiment(), + param, + keepOntology = FALSE, + keepProtocol = FALSE, + simplify = TRUE) +``` + +```{r} +#| tbl-cap: "Table 10. Samples from the LC-MS/MS data set." +#adjust sampleData +colnames(sampleData(lcms2)) <- c("sample_name", "derived_spectra_data_file", + "metabolite_asssignment_file", + "source_name", + "organism", + "blood_sample_type", + "sample_type", "age", "unit", "phenotype") + +# filter samples to keep MSMS data from CTR samples: +sampleData(lcms2) <- sampleData(lcms2)[sampleData(lcms2)$phenotype == "CTR", ] + +sampleData(lcms2) <- sampleData(lcms2)[grepl("MSMS", sampleData(lcms2)$derived_spectra_data_file), ] + +# Add fragmentation data information (from filenames) +sampleData(lcms2)$fragmentation_mode <- c("CE20", "CE30", "CES") + +#let's look at the updated sample data +sampleData(lcms2)[, c("derived_spectra_data_file", + "phenotype", "sample_name", "age")] |> + kable(format = "pipe") + +#' Filter the data to the same RT range as the LC-MS run +lcms2 <- filterRt(lcms2, c(10, 240)) +``` + +In total we have `r length(lcms2)` LC-MS/MS data files for the control samples, +each with a different collision energy to fragment the ions. Below we show the +number of MS1 and MS2 spectra for each of the files. + +```{r} +#' check the number of spectra per ms level +spectra(lcms2) |> + msLevel() |> + split(spectraSampleIndex(lcms2)) |> + lapply(table) |> + do.call(what = cbind) +``` + +Compared to the number of MS2 spectra, far less MS1 spectra were acquired. The +configuration of the MS instrument was set to ensure that the ions specified in +the *inclusion list* were selected for fragmentation, even if their intensity +might be very low. With this setting, however, most of the recorded MS2 spectra +will represent only noise. The plot below shows the location of precursor ions +in the *m/z* - retention time plane for the three files. + +```{r echo=TRUE} +plotPrecursorIons(lcms2) +``` + +We can see MS2 spectra being recorded for the *m/z* of interest along the full +retention time range, even if the actual ions will only be eluting within +certain retention time windows. + +We next extract the `Spectra` object with the MS data from the data object and +assign a new spectra variable with the employed collision energy, which we +extract from the data object `sampleData`. + +```{r} +ms2_ctr <- spectra(lcms2) +ms2_ctr$collision_energy <- + sampleData(lcms2)$fragmentation_mode[spectraSampleIndex(lcms2)] +``` + +We next filter the MS data by first restricting to MS2 spectra and then removing +mass peaks from each spectrum with an intensity that is lower than 5% of the +highest intensity for that spectrum, assuming that these very low intensity +peaks represent background signal. + +```{r} +#' Remove low intensity peaks +low_int <- function(x, ...) { + x > max(x, na.rm = TRUE) * 0.05 +} + +ms2_ctr <- filterMsLevel(ms2_ctr, 2L) |> + filterIntensity(intensity = low_int) +``` + +We next remove also mass peaks with an *m/z* value greater or equal to the +precursor *m/z* of the ion. This puts, for the later matching against reference +spectra, more weight on the fragmentation pattern of ions and avoids hits based +only on the precursor *m/z* peak (and hence a similar mass of the compared +compounds). At last, we restrict the data to spectra with at least two fragment +peaks and scale their intensities such that their sum is 1 for each spectrum. +While similarity calculations are not affected by this scaling, it makes visual +comparison of fragment spectra easier to read. + +```{r} +#' Remove precursor peaks and restrict to spectra with a minimum +#' number of peaks +ms2_ctr <- filterPrecursorPeaks(ms2_ctr, ppm = 50, mz = ">=") +ms2_ctr <- ms2_ctr[lengths(ms2_ctr) > 1] |> + scalePeaks() +``` + +Finally, also to speed up the later comparison of the spectra against the +reference database, we load the full MS data into memory (by changing the +*backend* to `MsBackendMemory`) and *apply* all processing steps performed on +this data so far. Keeping the MS data in memory has performance benefits, but is +generally not suggested for large data sets. To evaluate the impact for the +present data set we print in addition the size of the data object before and +after changing the backend. + +```{r} +#' Size of the object before loading into memory +print(object.size(ms2_ctr), units = "MB") + +#' Load the MS data subset into memory +ms2_ctr <- setBackend(ms2_ctr, MsBackendMemory()) +ms2_ctr <- applyProcessing(ms2_ctr) + +#' Size of the object after loading into memory +print(object.size(ms2_ctr), units = "MB") +``` + +There is thus only a moderate increase in memory demand after loading the MS +data into memory (also because we filtered and cleaned the MS2 data). + +While we could proceed and match all these experimental MS2 spectra against +reference fragment spectra, for our workflow we aim to annotate the features +found significant in the differential abundance analysis. The goal is thus to +identify the MS2 spectra from the second (LC-MS/MS) run that could represent +fragments of the ions of features from the data in the first (LC-MS) run. Our +approach is to match MS2 spectra against the significant features determined +earlier based on their precursor *m/z* and retention time (given an acceptable +tolerance) to the feature's *m/z* and retention time. This can be easily done +using the `featureArea()` function that effectively considers the actual *m/z* +and retention time ranges of the features' chromatographic peaks and therefore +increase the chance of finding a correct match. This however also assumes that +retention times between the first and second run don't differ by much. +Alternatively, we would need to align the retention times of the second LC-MS/MS +data set to those of the first. + +Below we first extract the *feature area*, i.e., the *m/z* and retention time +ranges, for the significant features. + +```{r} +#' Define the m/z and retention time ranges for the significant features +target <- featureArea(lcms1)[rownames(res_sig), ] +target +``` + +We next identify the fragment spectra with their precursor *m/z* and retention +times within these ranges. We use the `filterRanges()` function that allows to +filter a `Spectra` object using multiple ranges simultaneously. We apply this +function separately to each feature (row in the above matrix) to extract the MS2 +spectra representing fragmentation information of the presumed feature's ions. + +```{r} +#' Identify for each feature MS2 spectra with their precursor m/z and +#' retention time within the feature's m/z and retention time range +ms2_ctr_fts <- apply(target[, c("rtmin", "rtmax", "mzmin", "mzmax")], + MARGIN = 1, FUN = filterRanges, object = ms2_ctr, + spectraVariables = c("rtime", "precursorMz")) +lengths(ms2_ctr_fts) +``` + +The result from this `apply()` call is a `list` of `Spectra`, each element +representing the result for one feature. With the exception of the last feature, +multiple MS2 spectra could be identified. We next combine the `list` of +`Spectra` into a single `Spectra` object using the `concatenateSpectra()` +function and add an additional spectra variable containing the respective +feature identifier. + +```{r} +l <- lengths(ms2_ctr_fts) +#' Combine the individual Spectra objects +ms2_ctr_fts <- concatenateSpectra(ms2_ctr_fts) +#' Assign the feature identifier to each MS2 spectrum +ms2_ctr_fts$feature_id <- rep(rownames(res_sig), l) +``` + +We have now a `Spectra` object with fragment spectra for the significant +features from our differential expression analysis. + +We next build our reference data which we need to process the same way as our +*query* spectra. We extract all fragment spectra from the MassBank database, +restrict to positive polarity data (since our experiment was acquired in +positive polarity) and perform the same processing to the fragment spectra from +the MassBank database. + +```{r prep spectra object} +ms2_ref <- Spectra(mb) |> + filterPolarity(1L) |> + filterIntensity(intensity = low_int) |> + filterPrecursorPeaks(ppm = 50, mz = ">=") +ms2_ref <- ms2_ref[lengths(ms2_ref) > 1] |> + scalePeaks() +``` + +Note that here we could switch to a `MsBackendMemory` backend hence loading the +full data from the reference database into memory. This could have a positive +impact on the performance of the subsequent spectra matching, however it would +also increase the memory demand of the present analysis. + +Now that both the `Spectra` object for the second run and the database spectra +have been prepared, we can proceed with the matching process. We use the +`matchSpectra()` function from the `r Biocpkg("MetaboAnnotation")` package with +the `CompareSpectraParam` to define the settings for the matching. With the +following parameters: + +- `requirePrecursor = TRUE`: Limits spectra similarity calculations to + fragment spectra with a similar precursor *m/z*. +- `tolerance` and `ppm`: Defines the acceptable difference between compared + *m/z* values. These relaxed tolerance settings ensure that we find matches + even if reference spectra were acquired on instruments with a lower + accuracy. +- `THRESHFUN`: Defines which matches to report. Here, we keep all matches + resulting in a spectra similarity score (calculated with the normalized dot + product [@stein_optimization_1994], the default similarity function) larger + than 0.6. + +```{r} +register(SerialParam()) +#' Define the settings for the spectra matching. +prm <- CompareSpectraParam(ppm = 40, tolerance = 0.05, + requirePrecursor = TRUE, + THRESHFUN = function(x) which(x >= 0.6)) + +ms2_mtch <- matchSpectra(ms2_ctr_fts, ms2_ref, param = prm) +ms2_mtch +``` + +Thus, from the total `r length(ms2_ctr_fts)` query MS2 spectra, only +`r length(whichQuery(ms2_mtch))` could be matched to (at least) one reference +fragment spectrum. + +Below we restrict the results to these matching spectra and extract all metadata +of the query and target spectra as well as the similarity score (the complete +list of available metadata information can be listed with the `colnames()` +function). + +```{r} +#' Keep only query spectra with matching reference spectra +ms2_mtch <- ms2_mtch[whichQuery(ms2_mtch)] + +#' Extract the results +ms2_mtch_res <- matchedData(ms2_mtch) +nrow(ms2_mtch_res) +``` + +Now, we have query-target pairs with a spectra similarity higher than 0.6. +Similar to the MS1-based annotation also this result table contains redundant +information: we have multiple fragment spectra per feature and also MassBank +contains several fragment spectra for each compound, measured using differing +collision energies or MS instruments, by different laboratories. Below we thus +iterate over the feature-compound pairs and select for each the one with the +highest score. As an identifier for a compound, we use its fragment spectra's +INCHI-key, since compound names in MassBank do not have accepted +consensus/controlled vocabularies. + +```{r results='asis'} +#' - split the result per feature +#' - select for each feature the best matching result for each compound +#' - combine the result again into a data frame +ms2_mtch_res <- + ms2_mtch_res |> + split(f = paste(ms2_mtch_res$feature_id, ms2_mtch_res$target_inchikey)) |> + lapply(function(z) { + z[which.max(z$score), ] + }) |> + do.call(what = rbind) |> + as.data.frame() + +#' List the best matching feature-compound pair +pandoc.table(ms2_mtch_res[, c("feature_id", "target_name", "score", + "target_inchikey")], + style = "rmarkdown", + caption = "Table 9.MS2 annotation results.", + split.table = Inf) +``` + +Thus, from the `r length(res_sig)` significant features, only one could be +annotated to a compound based on the MS2-based approach. There could be many +reasons for the failure to find matches for the other features. Although MS2 +spectra were selected for each feature, most appear to only represent noise, as +for most features, in the LC-MS/MS run, no or only very low MS1 signal was +recorded, indicating that in that selected sample the original compound might +not (or no longer) be present. Also, most reference databases contain +predominantly fragment spectra for protonated (*\[M+H\]+*) ions of compounds, +while some of the features might represent signal from other types of ions which +would result in a different fragmentation pattern. Finally, fragment spectra of +compounds of interest might also simply not be present in the used reference +database. + +Thus, combining the information from the MS1- and MS2 based annotation we can +only annotate one feature with a considerable confidence. The feature with the +*m/z* of 195.0879 and a retention time of 32 seconds seems to be an ion of +caffeine. While the result is somewhat disappointing it also clearly shows the +importance of proper experimental planning and the need to control for all +potential confounding factors. For the present experiment, no disease-specific +biomarker could be identified, but a life-style property of individuals +suffering from this disease: coffee consumption was probably contraindicated to +patients of the CVD group to reduce the risk of heart arrhythmia. + +We below plot the EIC for this feature highlighting the retention time at which +the highest scoring MS2 spectra was recorded and create a mirror plot comparing +this MS2 spectra to the reference fragment spectra for caffeine. + +```{r fig.height=5, fig.width=5} +par(mfrow = c(1, 2)) + +col_sample <- col_phenotype[sampleData(lcms1)$phenotype] +#' Extract and plot EIC for the annotated feature +eic <- featureChromatograms(lcms1, features = ms2_mtch_res$feature_id[1]) +plot(eic, col = col_sample, peakCol = col_sample[chromPeaks(eic)[, "sample"]], + peakBg = paste0(col_sample[chromPeaks(eic)[, "sample"]], 20)) +legend("topright", col = col_phenotype, legend = names(col_phenotype), lty = 1) + +#' Identify the best matching query-target spectra pair +idx <- which.max(ms2_mtch_res$score) + +#' Indicate the retention time of the MS2 spectrum in the EIC plot +abline(v = ms2_mtch_res$rtime[idx]) + +#' Get the index of the MS2 spectrum in the query object +query_idx <- which(query(ms2_mtch)$.original_query_index == + ms2_mtch_res$.original_query_index[idx]) +query_ms2 <- query(ms2_mtch)[query_idx] +#' Get the index of the MS2 spectrum in the target object +target_idx <- which(target(ms2_mtch)$spectrum_id == + ms2_mtch_res$target_spectrum_id[idx]) +target_ms2 <- target(ms2_mtch)[target_idx] + +#' Create a mirror plot comparing the two best matching spectra +plotSpectraMirror(query_ms2, target_ms2) +legend("topleft", + legend = paste0("precursor m/z: ", format(precursorMz(query_ms2), 3))) +``` + +The plot above clearly shows the higher signal for this feature in CTR compared +to CVD samples. The QC samples exhibit a lower but highly consistent signal, +suggesting the absence of strong technical noise or biases in the raw data of +this experiment. The vertical line indicates the retention time of the fragment +spectrum with the best match against a reference spectrum. It has to be noted +that, since the fragment spectra were measured in a separate LC-MS/MS +experiment, this should only be considered as an indication of the approximate +retention time in which ions were fragmented in the second experiment. The +fragment spectrum for this feature, shown in the upper panel of the right plot +above is highly similar to the reference spectrum for caffeine from MassBank +(shown in the lower panel). In addition to their matching precursor *m/z*, the +same two fragments (same *m/z* and intensity) are present in both spectra. + +We can also extract additional metadata for the matching reference spectrum, +such as the used collision energy, fragmentation mode, instrument type, +instrument as well as the ion (adduct) that was fragmented. + +```{r} +spectraData(target_ms2, c("collisionEnergy_text", "fragmentation_mode", + "instrument_type", "instrument", "adduct")) |> + as.data.frame() +``` + +## External tools or alternative annotation approaches + +The present workflow highlights how annotation could be performed within R using +packages from the Bioconductor project, but there are also other excellent +external softwares that could be used as an alternative, such as SIRIUS +[@duhrkop_sirius_2019], mummichog [@li_predicting_2013] or GNPS +[@nothias_feature-based_2020] among others. To use these, the data would need to +be exported in a format supported by these. For MS2 spectra, the data could +easily be exported in the required MGF file format using the +`r Biocpkg("MsBackendMgf")` Bioconductor package. Integration of *xcms* into +feature-based molecular networking with GNPS is described in the [GNPS +documentation](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-xcms3/). + +In alternative, or in addition, evidence for the potential matching chemical +formula of a feature could be derived by evaluating the isotope pattern of its +full MS1 scan. This could provide information on the isotope composition. Also +for this, various functions such as the `isotopologues()` from the or +`r Biocpkg("MetaboCoreUtils")` package or the functionality of the *envipat* R +package [@loos_accelerated_2015] could be used. + +# Summary + +In this tutorial, we describe an end-to-end workflow for LC-MS-based untargeted +metabolomics experiments, conducted entirely within R using packages from the +Bioconductor project or base R functionality. While other excellent software +exists to perform similar analyses, the power of an R-based workflow lies in its +adaptability to individual data sets or research questions and its ability to +build reproducible workflows and documentation. + +Due to space restrictions we don't provide a comprehensive listing of +methodologies for the individual analysis steps. More advanced options or +approaches would be available, e.g., for normalization of the data, which will +however also be heavily dependent on the size and properties of the analyzed +data set, as well as for annotation of the features. + +As a result, we found in the present analysis a set of features with significant +abundance differences between the compared groups. From these we could however +only reliably annotate a single feature, that related to the lifestyle of +individuals rather then to the pathological properties of the investigated +disease. This low proportion of annotated signals is however not uncommon for +untargeted metabolomics experiments and reflects the need for more comprehensive +and reliable reference annotation libraries. + +# Session information + +```{r} +sessionInfo() +``` + +# References + +# Appendix + +## Aknowledgment + +Thanks to Steffen Neumann for his continuous work to develop and maintain the +xcms software. ... + +## Alignment using manually selected anchor peaks + +- align the data set using internal standards. +- suggested to eventually *enrich* the anchor peaks with signal from other + ions in retention time regions not covered by the internal standards. + +# Additional informations + +```{r eval=FALSE, include=FALSE} +plot(eics_is_noprocess) + +#' Notes on the EICs: +#' - Alanine 13C 15N: signal very low, maybe other ion? +#' - Arginine: very nice signal, RT a bit off. +#' - Aspartic acid: very nice signal. +#' - Carnitine: nothing there, maybe other ion? +#' - Creatinine: nothing there, maybe other ion? +#' - Cystine: very nice signal. +#' - Glucose: nice signal. +#' - Glutamic acid: very nice signal. +#' - Glycine: signal low but ~ OK. +#' - cystine: very nice signal, RT a bit off. +#' - Isoleucine: multiple peaks, most likely both leucine and isoleucine. +#' - Leucine: same as Isoleucine. +#' - Methionine: nice signal, but some other peaks close by. +#' - Phenylalanine: nice signal, but some other peaks close by. +#' - Proline: signal low but ~ OK (maybe other ion?) +#' - Serine: nice signal. +#' - Threonine: nice signal. +#' - Tyrosine: signal low but ~ OK. +#' - Valine: signal OK. Other peaks close by. +#' + +plot(eics_is_chrompeaks) # show chrompeak detection +plot(eics_is_refined) # show refinement effect +plot(eic_is) # show alignment effect +``` + +```{r} +#possible extra info: +# - +``` diff --git a/vignettes/images/sitcker_draft.xcf b/vignettes/images/sitcker_draft.xcf new file mode 100644 index 0000000..1394b73 Binary files /dev/null and b/vignettes/images/sitcker_draft.xcf differ diff --git a/vignettes/install_v0.qmd b/vignettes/install_v0.qmd deleted file mode 100644 index 1066125..0000000 --- a/vignettes/install_v0.qmd +++ /dev/null @@ -1,60 +0,0 @@ ---- -title: "Install" -format: html -editor: visual -minimal: true -date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`' -vignette: > - %\VignetteIndexEntry{Install} - %\VignetteEngine{quarto::html} - %\VignetteEncoding{UTF-8} ---- - -For manual installation, an [R](https://www.r-project.org/) version \>= 4.4.0 is -required. - -# Install packages required for metabonaut tutorials - -Notes: only a development version of metabonaut is available as of now but a -stable version will come soon. - -The following packages need to be installed for proper functioning of the -tutorials: - -Run the code as follow: - -```{r eval=FALSE} -install.packages("BiocManager") -BiocManager::install(c('RCurl', 'xcms', 'MsExperiment', 'SummarizedExperiment', - 'Spectra', 'MetaboCoreUtils', 'limma', 'matrixStats', 'pander', - 'RColorBrewer', 'pheatmap', 'vioplot', 'ggfortify', 'gridExtra', - 'AnnotationHub', 'CompoundDb', 'MetaboAnnotation', - 'RforMassSpectrometry/MsIO', 'RforMassSpectrometry/MsBackendMetaboLights')) -``` - -# Docker image - -The vignettes files along with an R runtime environment including all required -packages and the RStudio (Posit) editor are all bundled in a *docker* container. - -After installation, this docker container can be run on the computer and the -code and examples from the vignettes can be evaluated within this environment -(without the need to install any additional packages or files). - -- If you don't already have, install [docker](https://www.docker.com/). Find - installation information [here](https://docs.docker.com/desktop/). -- Get the [docker - image](https://hub.docker.com/r/rformassspectrometry/metabonaut) of this - tutorial e.g. from the command line with - `docker pull rformassspectrometry/metabonaut:latest`. -- Start the docker container, either through the Docker Desktop, or on the - command line with - -``` -docker run -e PASSWORD=bioc -p 8787:8787 rformassspectrometry metabonaut:latest -``` - -- Enter [`http://localhost:8787`](http://localhost:8787) in a web browser and - log in with username `rstudio` and password `bioc`. -- In the RStudio server version: open any of the R-markdown (*.Rmd*) files in - the *vignettes* folder and evaluate the R code blocks in that document.