@@ -927,7 +927,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
Chromatographic peak detection
The initial preprocessing step involves detecting intensity peaks along the retention time axis, the so called chromatographic peaks. To achieve this, we employ the findChromPeaks()
function within xcms. This function supports various algorithms for peak detection, which can be selected and configured with their respective parameter objects.
-The preferred algorithm in this case, CentWave, utilizes continuous wavelet transformation (CWT)-based peak detection [@tautenhahn_highly_2008]. This method is known for its effectiveness in handling non-Gaussian shaped chromatographic peaks or peaks with varying retention time widths, which are commonly encountered in HILIC separations.
+The preferred algorithm in this case, CentWave, utilizes continuous wavelet transformation (CWT)-based peak detection (Tautenhahn, Böttcher, and Neumann 2008). This method is known for its effectiveness in handling non-Gaussian shaped chromatographic peaks or peaks with varying retention time widths, which are commonly encountered in HILIC separations.
Below we apply the CentWave algorithm with its default settings on the extracted ion chromatogram for cystine and methionine ions and evaluate its results.
@@ -1254,8 +1254,8 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
These QC samples representing the same sample (pool) were measured at regular intervals during the measurement run of the experiment and were all measured on the same day. Still, small shifts can be observed, especially in the region between 100 and 150 seconds. To facilitate proper correspondence of signals across samples (and hence definition of the LC-MS features), it is essential to minimize these differences in retention times.
Theoretically, we proceed in two steps: first we select only the QC samples of our dataset and do a first alignment on these, using the so-called anchor peaks. In this way we can assume a linear shift in time, since we are always measuring the same sample in different regular time intervals. Despite having external QCs in our data set, we still use the subset-based alignment assuming retention time shifts to be independent of the different sample matrix (human serum or plasma) and instead are mostly instrument-dependent. Note that it would also be possible to manually specify anchor peaks, respectively their retention times or to align a data set against an external, reference, data set. More information is provided in the vignettes of the xcms package.
After calculating how much to adjust the retention time in these samples, we apply this shift also on the study samples.
-In xcms retention time alignment can be performed using the adjustRtime()
function with an alignment algorithm. For this example we use the PeakGroups method [@smith_xcms_2006] that performs the alignment by minimizing differences in retention times of a set of anchor peaks in the different samples. This method requires an initial correspondence analysis to match/group chromatographic peaks across samples from which the algorithm then selects the anchor peaks for the alignment.
-For the initial correspondence, we use the PeakDensity approach [@smith_xcms_2006] that groups chromatographic peaks with similar m/z and retention time into LC-MS features. The parameters for this algorithm, that can be configured using the PeakDensityParam
object, are sampleGroups
, minFraction
, binSize
, ppm
and bw
.
+In xcms retention time alignment can be performed using the adjustRtime()
function with an alignment algorithm. For this example we use the PeakGroups method (Smith et al. 2006) that performs the alignment by minimizing differences in retention times of a set of anchor peaks in the different samples. This method requires an initial correspondence analysis to match/group chromatographic peaks across samples from which the algorithm then selects the anchor peaks for the alignment.
+For the initial correspondence, we use the PeakDensity approach (Smith et al. 2006) that groups chromatographic peaks with similar m/z and retention time into LC-MS features. The parameters for this algorithm, that can be configured using the PeakDensityParam
object, are sampleGroups
, minFraction
, binSize
, ppm
and bw
.
binSize
, ppm
and bw
allow to specify how similar the chromatographic peaks’ m/z and retention time values need to be to consider them for grouping into a feature.
binSize
and ppm
define the required similarity of m/z values. Within each m/z bin (defined by binSize
and ppm
) areas along the retention time axis with a high chromatographic peak density (considering all peaks in all samples) are identified, and chromatographic peaks within these regions are considered for grouping into a feature.
@@ -1424,7 +1424,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
The non-endogenous cystine ion was already well aligned so the difference is minimal. The methionine ion, however, shows an improvement in alignment.
-In addition to a visual inspection of the results, we also evaluate the impact of the alignment by comparing the variance in retention times for internal standards before and after alignment. To this end, we first need to identify chromatographic peaks in each sample with an m/z and retention time close to the expected values for each internal standard. For this we use the matchValues()
function from the MetaboAnnotation package [@rainer_modular_2022] using the MzRtParam
method to identify all chromatographic peaks with similar m/z (+/- 50 ppm) and retention time (+/- 10 seconds) to the internal standard’s values. With parameters mzColname
and rtColname
we specify the column names in the query (our IS) and target (chromatographic peaks) that contain the m/z and retention time values on which to match entities. We perform this matching separately for each sample. For each internal standard in every sample, we use the filterMatches()
function and the SingleMatchParam()
parameter to select the chromatographic peak with the highest intensity.
+In addition to a visual inspection of the results, we also evaluate the impact of the alignment by comparing the variance in retention times for internal standards before and after alignment. To this end, we first need to identify chromatographic peaks in each sample with an m/z and retention time close to the expected values for each internal standard. For this we use the matchValues()
function from the MetaboAnnotation package (Rainer et al. 2022) using the MzRtParam
method to identify all chromatographic peaks with similar m/z (+/- 50 ppm) and retention time (+/- 10 seconds) to the internal standard’s values. With parameters mzColname
and rtColname
we specify the column names in the query (our IS) and target (chromatographic peaks) that contain the m/z and retention time values on which to match entities. We perform this matching separately for each sample. For each internal standard in every sample, we use the filterMatches()
function and the SingleMatchParam()
parameter to select the chromatographic peak with the highest intensity.
#' Extract the matrix with all chromatographic peaks and add a column
@@ -1881,7 +1881,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
Object of class "XProcessHistory"
type: Peak detection
- date: Mon Oct 21 23:01:56 2024
+ date: Tue Oct 22 16:33:02 2024
info:
fileIndex: 1,2,3,4,5,6,7,8,9,10
Parameter class: CentWaveParam
@@ -1890,7 +1890,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
The processParam()
function could then be used to extract the actual parameter class used to configure this processing step.
The final result of the whole LC-MS data preprocessing is a two-dimensional matrix with abundances of the so-called LC-MS features in all samples. Note that at this stage of the analysis features are only characterized by their m/z and retention time and we don’t have yet any information which metabolite a feature could represent.
-As we have seen before, such feature matrix can be extracted with the featureValues()
function and the corresponding feature characteristics (i.e., their m/z and retention time values) using the featureDefinitions()
function. Thus, these two arrays could be extracted from the xcms result object and used/imported in other analysis packages for further processing. They could for example also be exported to tab delimited text files, and used in an external tool, or used, if also MS2 spectra were available, for feature-based molecular networking in the GNPS analysis environment [@nothias_feature-based_2020].
+As we have seen before, such feature matrix can be extracted with the featureValues()
function and the corresponding feature characteristics (i.e., their m/z and retention time values) using the featureDefinitions()
function. Thus, these two arrays could be extracted from the xcms result object and used/imported in other analysis packages for further processing. They could for example also be exported to tab delimited text files, and used in an external tool, or used, if also MS2 spectra were available, for feature-based molecular networking in the GNPS analysis environment (Nothias et al. 2020).
For further processing in R, if no reference or link to the raw MS data is required, it is suggested to extract the xcms preprocessing result using the quantify()
function as a SummarizedExperiment
object, Bioconductor’s default container for data from biological assays/experiments. This simplifies integration with other Bioconductor analysis packages. The quantify()
function takes the same parameters than the featureValues()
function, thus, with the call below we extract a SummarizedExperiment
with only the detected, but not gap-filled, feature abundances:
@@ -1978,8 +1978,8 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
Data normalization
-After preprocessing, data normalization or scaling might need to be applied to remove any technical variances from the data. While simple approaches like median scaling can be implemented with a few lines of R code, more advanced normalization algorithms are available in packages such as Bioconductor’s preprocessCore. The comprehensive workflow “Notame” also propose a very interesting normalization approach adaptable and scalable to the user dataset [@klavus_notame_2020].
-Generally, for LC-MS data, bias can be categorized into three main groups[@broadhurst_guidelines_2018]:
+After preprocessing, data normalization or scaling might need to be applied to remove any technical variances from the data. While simple approaches like median scaling can be implemented with a few lines of R code, more advanced normalization algorithms are available in packages such as Bioconductor’s preprocessCore. The comprehensive workflow “Notame” also propose a very interesting normalization approach adaptable and scalable to the user dataset (Klåvus et al. 2020).
+Generally, for LC-MS data, bias can be categorized into three main groups(Broadhurst et al. 2018):
Variances introduced by sample collection and initial processing, which can include differences in sample amounts. This type of bias is expected to be sample-specific and affect all signals in a sample in the same way. Methods like median scaling, LOESS or quantiles normalization can adjust this bias.
Signal drifts along measurement of samples from an experiment. Reasons for such drifts can be related to aging of the instrumentation used (columns, detector), but also to changes in metabolite abundances and characteristics due to reactions and modifications, such as oxidation. These changes are expected to affect more the samples measured later in a run rather than the ones measured at the beginning. For this reason, this bias can play a major role in large experiments bias can play a major role in large experiments measured over a long time range and is usually considered to affect individual metabolites (or metabolite groups) differently. For adjustment, moving average or linear regression-based approaches can be used. The latter can for example be performed using the adjust_lm()
function from the MetaboCoreUtils package.
@@ -2078,7 +2078,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
The upper part of the plot show that the gap filling steps allowed to rescue a substantial number of NAs and allowed us to have a more consistent number of feature values per sample. This consistency aligns with our asspumption that every sample should have a similar amount of features detected. Additionally we observe that, on average, the signal distribution from the individual samples is very similar.
-
An alternative way to evaluate differences in abundances between samples are relative log abundance (RLA) plots [@de_livera_normalizing_2012]. An RLA value is the abundance of a feature in a sample relative to the median abundance of the same feature across multiple samples. We can discriminate between within group and across group RLAs, depending whether the abundance is compared against samples within the same sample group or across all samples. Within group RLA plots assess the tightness of replicates within groups and should have a median close to zero and low variation around it. When used across groups, they allow to compare behavior between groups. Generally, between-sample differences can be easily spotted using RLA plots. Below we calculate and visualize within group RLA values using the rowRla()
function from the MsCoreUtils package defining with parameter f
the sample groups.
+
An alternative way to evaluate differences in abundances between samples are relative log abundance (RLA) plots (De Livera et al. 2012). An RLA value is the abundance of a feature in a sample relative to the median abundance of the same feature across multiple samples. We can discriminate between within group and across group RLAs, depending whether the abundance is compared against samples within the same sample group or across all samples. Within group RLA plots assess the tightness of replicates within groups and should have a median close to zero and low variation around it. When used across groups, they allow to compare behavior between groups. Generally, between-sample differences can be easily spotted using RLA plots. Below we calculate and visualize within group RLA values using the rowRla()
function from the MsCoreUtils package defining with parameter f
the sample groups.
par(mfrow = c(1, 1), mar = c(3.5, 4.5, 2.5, 1))
@@ -2134,7 +2134,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
These internal standards play a crucial role in guiding the normalization process. Given the assumption that the samples were artificially spiked, we possess a known ground truth—that the abundance or intensity of the internal standard should be consistent. Any difference is expected to be due to technical differences/variance. Consequently, normalization aims to minimize variation between samples for the internal standard, reinforcing the reliability of our analyses.
Between sample normalisation
-The previous RLA plot showed that the data had some biases that need to be corrected. Therefore, we will implement between-sample normalization using filled-in features. This process effectively mitigates variations influenced by technical issues, such as differences in sample preparation, processing and injection methods. In this instance, we will employ a commonly used technique known as median scaling [@de_livera_normalizing_2012].
+The previous RLA plot showed that the data had some biases that need to be corrected. Therefore, we will implement between-sample normalization using filled-in features. This process effectively mitigates variations influenced by technical issues, such as differences in sample preparation, processing and injection methods. In this instance, we will employ a commonly used technique known as median scaling (De Livera et al. 2012).
This method involves computing the median for each sample, followed by determining the median of these individual sample medians. This ensures consistent median values for each sample throughout the entire data set. Maintaining uniformity in the average total metabolite abundance across all samples is crucial for effective implementation.
@@ -2331,7 +2331,7 @@ Complete end-to-end LC-MS/MS Metabolomic Data analysis
#' keep unfiltered object
res_unfilt <- res
-
Here we will eliminate features that exhibit high variability in our dataset. Repeatedly measured QC samples typically serve as a robust basis for cleansing datasets allowing to identify features with excessively high noise. As in our data set external QC samples were used, i.e. pooled samples from a different collection and using a slightly different sample matrix, their utility for filtering is somewhat limited. For a comprehensive description and guidelines for data filtering in untargeted metabolomic studies, please refer to [@broadhurst_guidelines_2018].
+
Here we will eliminate features that exhibit high variability in our dataset. Repeatedly measured QC samples typically serve as a robust basis for cleansing datasets allowing to identify features with excessively high noise. As in our data set external QC samples were used, i.e. pooled samples from a different collection and using a slightly different sample matrix, their utility for filtering is somewhat limited. For a comprehensive description and guidelines for data filtering in untargeted metabolomic studies, please refer to (Broadhurst et al. 2018).
We first restrict the data set to features for which a chromatographic peak was detected in at least 2/3 of samples of at least one of the study samples groups. This ensures the statistical tests carried out later on the study samples are being performed on reliable signal. Also, with this filter we remove features that were mostly detected in QC samples, but not the study samples. Such filter can be performed with filterFeatures()
function from the xcms package with the PercentMissingFilter
setting. The parameters of this filer:
-
@@ -2355,7 +2355,7 @@
Complete end-to-end LC-MS/MS Metabolomic Data analysis
Following the guidelines stated above we decided to still use the QC samples for pre-filtering, on the basis that they represent similar bio-fluids to our study samples, and thus, we anticipate observing relatively similar metabolites affected by similar measurement biases.
The Dratio filter is a powerful tool to identify features that exhibit high variability in the data, relating the variance observed in QC samples with that in study samples. By setting a threshold of 0.4, we remove features that have a high degree of variability between the QC and study samples. In this example, any feature in which the deviation at the QC is higher than 40% (threshold = 0.4
)of the deviation on the study samples is removed. This filtering step ensures that only features are retained that have considerably lower technical than biological variance.