-
Notifications
You must be signed in to change notification settings - Fork 1
Tutorials
2023-10-24
This guide explains how to perform some common microbial sequencing count data analysis tasks using tidytacos. We will illustrate those using a dataset with human microbiome samples from the upper respiratory tract (URT), taken from this paper by De Boeck et al. It contains nose as well as nasopharynx samples. Most samples were taken using a swab method, but a minority was taking with the aspirate method.
A tidytacos object is simply a list of three tables:
- counts: These are the counts of reads for each taxon (OTU/ASV/phylotype) in each sample. Each row represents such a read count.
- samples: This table contains the sample metadata. Each row represents a sample.
- taxa: This table contains the taxonomy and other metadata for the taxa. Each row represents a taxon.
The package is called tidytacos because each of the tables is tidy: every row represents an observation and every column a variable (more on data tidying can be found at https://r4ds.hadley.nz/data-tidy.html#sec-tidy-data). For this quick start guide, we assume you are familiar with the tidyverse (especially dplyr and ggplot).
The main differences with the popular phyloseq package are:
- The abundance table of tidytacos is in long format, while the otu_table object of phyloseq is in wide (matrix) format. A tidy read count table is more compact for very sparse data (such as microbiome data), and also easier to handle and visualize within the tidyverse framework.
- All tables in tidytacos are data frames (technically tibbles). In phyloseq, each table has its own special data type (i.e. the classes “otu_table”, “tax_table” and “sample_data”). This makes it sometimes difficult to access the data directly and do simple stuff with it. Also, the otu_table and tax_table classes are based on the matrix type, while the sample_data class is based on the dataframe type. This also makes data manipulation sometimes un-straightforward.
- In a phyloseq otu table, sometimes the rows are the taxa and sometimes the columns. This can easily lead to errors.
In case you haven’t installed tidytacos yet, it can be installed using devtools:
install.packages("devtools")
devtools::install_github("LebeerLab/tidytacos")
For this guide, we only need to load two packages: tidytacos (of course) and the tidyverse set of packages.
library(tidyverse)
library(tidytacos)
The first step will usually be the conversion our data into a usable
tidytacos format. If, for example, you used DADA2 as explained
here,
you can hand off the data to tidytacos using from_dada
. You can also
convert a phyloseq object to a tidytacos object using the as_tidytacos
function. More options to import and convert your data can be found
here.
A tidytacos object is read and stored as three sparse tables (counts-, taxa- and samples.csv). To read in existing data from a folder called 'my_data' you would run:
taco <- read_tidytacos("/path/to/my_data")
To write your data to a folder called "my_data_filtered" you can run:
taco %>% write_tidytacos("/path/to/my_data_filtered")
However, our example dataset is available in the tidytacos package and doesn’t need to be imported or converted. It is called “urt” and we start by inspecting the samples table:
glimpse(urt$samples)
## Rows: 217
## Columns: 9
## $ run <chr> "20161207_ilke_urt100", "20161207_ilke_urt100", "20161207_…
## $ condition <chr> "CON", "CON", "CON", "CON", "CON", "CON", "CON", "CON", "C…
## $ participant <chr> "CON100", "CON100", "CON10", "CON10", "CON11", "CON11", "C…
## $ location <chr> "NF", "N", "NF", "NF", "NF", "N", "NF", "NF", "N", "NF", "…
## $ method <chr> "S", "S", "A", "S", "S", "S", "S", "S", "S", "S", "S", "S"…
## $ plate <chr> "3", "3", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ passes_qc <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ sample_id <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s10…
## $ sample <chr> "CON100-NF-S", "CON100-N-S", "CON10-NF-A", "CON10-NF-S", "…
We then have a quick look at the total number of samples, ASVs, and reads in the tidytacos object:
tacosum(urt)
## n_samples n_taxa n_reads
## 217 1957 3873478
We can very easily create a plot to explore a subset of our samples (e.g., only nose samples taken with the swab method) in the following way:
urt %>%
filter_samples(location == "N", method == "S") %>%
tacoplot_stack()
The filter_samples
function does what it says: filtering samples. It
will also delete taxa from the taxa table that have zero total reads in
the remaining samples. The tacoplot_stack
function returns a nice
stacked bar plot visualization of the most abundant taxa in our samples.
Our next question for this dataset is to what extent nose and nasopharynx are linked from a microbiological point of view. To get an idea we can first visualize this:
urt_s <- urt %>% filter_samples(method == "S")
tacoplot_stack(urt_s)+
geom_point(aes(y=-0.02,color=location))
tacoplot_stack(urt_s, x = location) +
facet_wrap(~ participant, nrow = 10)
First, we make sure to include only swab samples. We can then visualize the data in stacked bar plots. The tacoplot functions consist of ggplot function calls and therefore geoms and other ggplot features can be added.
To explore alpha diversity, let's create a rarefied version of the dataset:
urt_rar <- urt %>%
add_total_count() %>%
filter_samples(total_count >= 2000) %>%
rarefy(2000) %>%
add_alpha()
We can visualize alpha diversities of nose versus the nasopoharynx:
urt_rar %>%
samples() %>%
ggplot(aes(x = location, y = observed, fill = location)) +
geom_boxplot(outlier.shape = NA)+
geom_jitter(height = NULL)
The add_total_count
function will add total sample read numbers to the
sample table.
The rarefy
function will randomly subsample all samples n times. It
only works if the read count of each sample equals or exceeds n. For
determining ASV richness, we chose to rarefy first, but this may depend
on your data.
The add_alpha
function can be used to add several metrics of alpha
diversity to the sample table.
We would like to address the differences between nose and nasopharynx. We’re also more interested in genera than ASVs. A PCA might offer insight:
urt_genus <- urt %>%
filter_samples(method == "S") %>%
aggregate_taxa(rank = "genus")
tacoplot_ord_ly(urt_genus,location, samplenames = sample, dim = 3)
The aggregate_taxa
function merges all rows of the taxa table together
at a specified taxonomic level, in this case the genus level. As for all
tidytacos functions, all other tables in the tidytacos object are
adjusted accordingly.
The tacoplot_ord_ly
function will determine relative abundances of
taxa in the samples, and will then use Bray-Curtis dissimilarities to
ordinate samples in a 2-dimensional space based on their taxonomic
composition. The plotly “_ly” addition makes the plot interactive,
which is really nice for exploratory work. This also works for other plot
functions.
A logical next question is to what extent the niche (nose versus nasopharynx) determines the community composition variability. Let’s not forget that everyone has their unique microbiome and include the variable "participant" in the model.
perform_adonis(urt_genus, c("participant", "location"))
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = as.formula(paste("counts_matrix", formula_RHS, sep = " ~ ")), data = metadata, permutations = permutations)
## Df SumOfSqs R2 F Pr(>F)
## participant 97 31.590 0.65236 1.916 0.001 ***
## location 1 3.066 0.06331 18.037 0.001 ***
## Residual 81 13.768 0.28432
## Total 179 48.424 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The perform_adonis
function will perform a PERMutational ANOVA to
determine the effect of sample variables on the Bray-Curtis
dissimilarities of the communities. The result shows that participant is
a really important contributor to community composition (R squared =
0.65). Furthermore, there are consistent significant differences between
the communities of nose and nasopharynx (R squared = 0.06).
Next, we would like to know which of the 20 most abundant genera, are significantly more abundant in the nasopharynx compared to the nose.
urt_genus <- urt_genus %>% add_codifab(location, max_taxa = 40)
urt_genus$taxon_pairs <- filter(urt_genus$taxon_pairs, wilcox_p < 0.05)
tacoplot_codifab(urt_genus, NF_vs_N)
The add_codifab
function will add a table called taxon_pairs to the
tidytacos object, with for each pair of a taxon and a reference taxon,
the differential abundance of the taxon between the two conditions (with
respect to the reference taxon).
The codifab_plot
function returns a plot to visualize differential
abundance of taxa between conditions, compared to all other taxa as
references. We can observe that Haemophilus is most likely to be
typical for the Nasopharynx, whereas Anaerococcus is most likely to be
typical for the nose.
Of note, there are many differential abundance analysis methods out there and none of them are perfect. Interpret your results with care.
2023-09-28
This vignette explains in more detail how to modify the stacked bar plots of tidytacos. We will illustrate those using a dataset with human microbiome samples from the upper respiratory tract (URT), taken from this paper by De Boeck et al. It contains nose as well as nasopharynx samples. Most samples were taken using a swab method, but a minority was taking with the aspirate method.
We need only two packages: tidytacos (of course) and the tidyverse set of packages.
library(tidyverse)
library(tidytacos)
We start by selecting the samples we are interesting in: only the ones from the nose, taken with the swab method.
urt_ns <- urt%>%
filter_samples(location == "N", method == "S")
We can very easily create explorative plots of our samples in the following ways:
tacoplot_stack(urt_ns)
tacoplot_stack_ly(urt_ns)
The tacoplot_stack
function returns a nice bar plot visualization of
the most abundant taxa in our samples. The tacoplot_stack_ly
function
returns an interactive version of the same bar plot. The order of the
samples on the x-axis is determined by hierarchical clustering of the
visualized sample composition. In addition, these functions do some
things behind the screens:
- Add relative abundances to the counts table.
- For each taxon, calculate the mean relative abundance across all samples where this taxon occurs.
- Give all taxa a human-understandable name so that the taxon name is unique. This is just the genus name of the taxon, followed by a number to make it unique. Taxa with a higher mean relative abundance get a smaller number. E.g. “Lactobacillus 1” is the Lactobacillus taxon with the largest mean relative abundance.
- Make a new variable that is equal to the taxon name, except that only the top-11 taxa retain their name and all others are changed to “Other taxa”. This is for visualization purposes; the human eye can only discriminate clearly between about 12 colors.
All these new variables are created under the hood, but are gone when the function execution is finished (there are no so-called “side-effects”). Luckily, for each of these variables there also exists a function that will create it and keep it! These functions are the following (there names are intended to be self-explanatory as much as possible):
-
add_rel_abundance
: adds to counts table -
add_mean_rel_abundance
: adds to taxa table -
add_taxon_name
: adds to taxa table -
add_taxon_name_color
: adds to taxa table
There are more of such functions that add an extra variable to one of the tables; we will explore them further in other parts of this vignette.
There are two interesting types of facet wrapped bar plots we can make: facets with fixed x-axis categories, and facets without fixed x-axis categories.
Facets with fixed x-axis categories
The first type is a plot where the variable on the x-axis is not the
sample name (the not-very-informative default). We want the categories
of that variable in the same order in all our subplots. For example, we
have a subplot for every participant, and for each participant we want
to see the nose sample and the nasopharynx sample, in that order. This
is achieved by adding a x = location
extra argument to the barplot
function, and adding a facet_wrap()
layer. Putting the x-axis
categories in the same order in all subplots is the default behaviour of
the facet_wrap()
function. Note that the order of the samples in each
subplot is now determined by the categories of the variable we put there
(e.g. alphabetically sorted), and no longer by a sample clustering
procedure!
urt_s <- urt%>%
filter_samples(method == "S")
tacoplot_stack(urt_s, x = location) +
facet_wrap(~ participant, nrow = 10)
Facets without fixed x-axis categories
The second type of facet wrapped plot is one where we have the sample names on the x-axis, as usual, and they are sorted according to a clustering procedure. For example, we want one facet per sampling location, with only the samples belonging to that location. This can be achieved in the following way:
tacoplot_stack(urt_s) +
facet_wrap(~ location, nrow = 2)
This is not quite right. The default behavior of facet_wrap()
is to
repeat all possible x-axis values in all facets, even if there is no
information there! In our case, the default behaviour would be to put
all samples names in the nose facet and nasopharynx facet, and plot
empty space if a sample name - sampling location combination doesn’t
exist. Adding the argument scales = "free_x"
corrects this behavior
and also makes sure that the samples are plotted in order of clustering:
tacoplot_stack(urt_s) +
facet_wrap(~ location, scales = "free_x", nrow = 2)
To be able to do this, we need to do a number of things by hand that
normally happen “under the hood” in the tacoplot_stack()
function:
- Step 1: We add relative abundances. We want them to be calculated with respect to the full samples, before the taxa we’re not interested in are removed!
- Step 2: We select only the taxa we want (e.g. family Lactobacillaceae).
- Step 3: We add the variable “taxon_name_color”; this variable is equal to “taxon_name”, except that everything apart from the n most abundant taxa (which are at this moment all Lactobacillaceae!) will be called “Other taxa”.
- Step 4: We make the plot.
Thanks to the %>%
(pipe) operator from the magrittr package, we can
achieve all this using the following elegant code:
urt %>%
add_rel_abundance() %>%
filter_taxa(family == "Lactobacillaceae") %>%
add_taxon_name_color() %>%
tacoplot_stack()+
geom_point(aes(y=-0.01,color=location,shape=method))