Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Visualization basic #18

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
301 changes: 301 additions & 0 deletions inst/tutorials/data_visualization_basic/data_visualization_basic.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
---
title: "Introduction to data visualization"
author: "Julia Anstett and Dr. Stephan Koenig (adpated from Dr. Kim Dill-McFarland)"
date: "version `r format(Sys.time(), '%B %d, %Y')`"
output:
learnr::tutorial:
progressive: true
allow_skip: true
runtime: shiny_prerendered
description: Basic data visualization with ggplot2 focusing on dot plots.
---

```{r setup, include = FALSE}
# General learnr setup
library(learnr)
knitr::opts_chunk$set(echo = TRUE)
library(educer)
# Helper function to set path to images to "/images" etc.
setup_resources()

library(dplyr)
library(readr)

raw_dat <- geochemicals

dat <- mutate(raw_dat, Depth = as.factor(Depth),
Cruise = as.factor(Cruise))

dat <- filter(dat, Depth %in% c(10, 100, 200),
# The cruises that took place in Februrary can be determined
# using the Date variable and functions that we won't discuss in
# these data science modules
Cruise %in% c(18, 30, 42, 54, 66, 80, 92))
```

## Introduction
This tutorial provides guided, hands-on instruction to create and modify plots using the `ggplot2` package in R. It also includes plots of an ANOVA analysis.

After this tutorial, you will be able to:

* Create dot and box plots in `ggplot2`
* Modify attributes of ggplots
* Complete and interpret the output of ANOVAs in R
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANOVA isn't actually covered in this tutorial


## Setup
Prior to starting this tutorial, please complete the *Pre-module download assignment* to obtain all the necessary software and data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tutorial a part of a larger curriculum? Where do students go to access the pre-module download assignment?


Create a new script named "ANOVA". As before, you begin by loading your packages with `library()`.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
```

## Explore the metadata
In addition to measurements of microbial communities, you also have geochemical data for the Saanich Inlet samples. For a brief introduction to these data, see Hallam SJ et al. 2017. Monitoring microbial responses to ocean deoxygenation in a model oxygen minimum zone. Sci Data 4: 170158 [doi:10.1038/sdata.2017.158](https://www.nature.com/articles/sdata2017158).

A subset of these data has been provided in `4.MICB301_stats_extend_data.csv` on Canvas including:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also list "Depth" here since you talk about it in the next section


- Depth: depth in meters
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to explain why you ended up converting this to a factor

- In micromolar (uM) or nanomolar (nM)
- NO3_uM: nitrate
- NCTD_O2: nitrite
- N2O_uM: nitrous oxide
- NH4_uM: ammonium
- H2S_uM: hydrogen sulfide
- CTD_O2: oxygen
- CH4_uM: methane

Read `4.MICB301_stats_extend_data.csv` into R using `read_csv` and save as `raw_dat`.
```{r}
raw_dat <- geochemicals
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's confusing to tell students to save the csv file as raw_dat, but the code saves the object geochemicals to raw_dat instead

```

## Data cleaning

A data structure we haven't utilized in R yet are factors, which are used to represent categorical variables. In general, categorical variables can take on fixed number of possible values. For example, `Cruise` is a numeric right now, but actually represents a category (i.e. each cruise just indicates a group of measurements). To convert a vector to factors we use the `as.factors()` function.
```{r}
dat <- mutate(raw_dat, Depth = as.factor(Depth),
Cruise = as.factor(Cruise))
```

```{r factors, echo=FALSE}
question("Which of these variables could potentially be represented as categorical vectors? (select ALL that apply)",
answer("Colour of flower petals of 9 different plants", correct = TRUE),
answer("Oxygen concentration in uM"),
answer("Ammonium concentration in nM"),
answer("Number of days (3-10) it takes a plant to sprout", correct = TRUE),
incorrect = "Incorrect. Hint: Categorical variables usually take on a fixed set of values")
```

Using the `%in%` binary operator we can filter for groups of values. Subset data to 3 depths in 7 cruises (i.e. specimens) in February:
```{r}
dat <- filter(dat, Depth %in% c(10, 100, 200),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have this part be an interactive exercise (assuming we expect students to know basic data wrangling skills)

# The cruises that took place in Februrary can be determined
# using the Date variable and functions that we won't discuss in
# these data science modules
Cruise %in% c(18, 30, 42, 54, 66, 80, 92))
```


## Introduction to ggplot
"ggplot2" is an R package for creating plots that is included in the "tidyverse" package. It is preferred over base R graphics because it has:

- handsome default settings
- snap-together building blocks
- automatic legends, colors, facets
- statistical overlays
Comment on lines +103 to +106
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if people will understand what these bullet points mean. Are we assuming that they have experience with base R graphics?


`ggplot2` builds plots by adding several different pieces together:

- data: 2D table of *variables*
- *aesthetics*: map variables to visual attributes, always contained in `aes()`
- *geoms*: graphical representation of data (points, lines, etc.)
- *stats*: statistical transformations (binning, summarizing, smoothing)
- *scales*: control *how* to map a variable to an aesthetic
- *guides*: axes, legend, etc.

The best way to learn about these pieces is to use them, as you will in this tutorial. Further `ggplot2` documentation is available at
[docs.ggplot2.org](http://docs.ggplot2.org/current/)



## Dot plot
Any `ggplot` needs at least 3 pieces: data, aesthetics, and a geom. So, if you want to plot the relationship between a geochemical variable and depth, you would need:

* data: `dat`
* aesthetics: x and y variables
* geom: `geom_point` to plot these data as points

Let's do so with the relationship between depth (as a categorical variable) and oxygen (O~2~) as a continuous variable (as opposed to categorical oxic vs. anoxic in your previous t-tests)

The first argument of ggplot is the data. We see here that since we have not told ggplot anything about these data, we create a blank plot.
```{r blank, exercise=TRUE}
ggplot(dat)
```

The second argument is the aesthetics `aes`, where we specify visual attributes of our plot like the x- and y-variables. Now, we see that the plot has axes with labels but we still have not told ggplot how we would like to add the data values.
```{r dot1, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2))
```

*NOTE: The x-axis showing depth is not linear because we converted depth to factors.*

Finally we add the geom to specify how we want to map our data onto these axes (*i.e.* points, boxplot, lines, etc.). Here, we will use points for the data.

Plot depth and oxygen:
```{r depthplt, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_point()
```

Importantly, aesthetics in the `ggplot` layer will be applied to all layers while those in a specific geom will only be applied to that one layer. So in the above code, we have given aesthetics in `ggplot` so they will be used in `ggplot` and `geom_point` (and any other additional layers we might add).

In contrast, in the code below, we specify the aesthetics in the geom so only `geom_point` will use it. In this case, the plots are the same.
```{r depthplta, exercise=TRUE}
ggplot(dat) +
geom_point(aes(x = Depth, y = CTD_O2))
```

Let's check your understanding by visualizing the realtionship between depth and methane using a dot plot. The resulting plot should look like this:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's check your understanding by visualizing the realtionship between depth and methane using a dot plot. The resulting plot should look like this:
Let's check your understanding by visualizing the relationship between depth and methane using a dot plot. The resulting plot should look like this:


```{r plotexample, echo = FALSE, warning = FALSE}
ggplot(dat, aes(x = Depth, y=Mean_CH4)) +
geom_point()
```

```{r exercise1, exercise=TRUE}
ggplot(dat, aes(x = , y=))
```

Since you only specified the minimum pieces, the rest are filled in with `ggplot2` defaults.

Now, you can add pieces to alter these defaults. Below, see how we keep adding to the plot with additional pieces to:

Change axes labels
```{r labels, exercise=TRUE}
ggplot(dat, aes(x= Depth, y = CTD_O2)) +
geom_point() +
labs(x="Depth [m]", y="Oxygen [uM]")
```

Change the point color. You can specify a specific color like "orange":
```{r example, exercise=TRUE}
ggplot(dat, aes(x= Depth, y = CTD_O2)) +
geom_point(color = "orange") +
labs(x="Depth [m]", y="Oxygen [uM]")
```

Build onto the previous plot visualizing depth and methane by:
1. changing the axis labels to Depth [m] and Methane [uM]
2. specifying the points to be blue

```{r plotexample2, echo = FALSE, warning = FALSE}
ggplot(dat, aes(x = Depth, y=Mean_CH4)) +
geom_point(color="blue") +
labs(x="Depth [m]", y="Methane [uM]")
```

```{r exercise2, exercise=TRUE}
ggplot(dat, aes(x = , y=))
```

There are 7 data points for each depths, but a few of them fall in the same range and cannot be resolved visulally (the data was *overplotted*). One way to deal with overplotting is to use transparency, denoted by the `alpha` argument.

```{r example3, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_point(alpha = 0.5) +
labs(x="Depth [m]", y="Oxygen [uM]")
```

Change the point shape. Similar to color, this can be a single shape or mapped to a variable:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping colour to a variable wasn't addressed above, so might be best to not bring it up here. You could also cover it in the previous section, though, since it is useful to know.

```{r example4, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_point(alpha = 0.5, shape = 17) +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that shapes are specified using numbers and provide a link to where they can find the corresponding numbers for shapes.

labs(x="Depth [m]", y="Oxygen [uM]")
```

## Box plot
Another way to visualize data points falling into the same range is through using box plots. Box plots summarize five summary statistics (the median, the 1st and 3rd quartile and the minimum and maximum values). To plot the relationship between a geochemical variable and depth, you would need:

* data: `dat`
* aesthetics: x and y variables
* geom: `geom_boxplot` to plot these data as boxplots

Let's do so with the relationship between depth (as a categorical variable) and oxygen (O~2~) as a continuous variable (as opposed to categorical oxic vs. anoxic in your previous t-tests)

```{r example5, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_boxplot() +
labs(x="Depth [m]", y="Oxygen [uM]")
```

Similarly to dotplots, we can change the colour of the boxes to represent a categorical variable. In this case, we set `fill` (the colour of the boxes) to represent depth.

```{r example6, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2, fill= Depth)) +
geom_boxplot() +
labs(x="Depth [m]", y="Oxygen [uM]")
```

## Layers

So far, we learned how to visualize data using dot plots and box plots. A powerful feature of ggplot is the ease of layering different plots together. For example, we can visualize depth and oxygen concentration as a scatter plot ontop of a boxplot like this:
```{r example7, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x="Depth [m]", y="Oxygen [uM]")
```

In addition, we can include a horizontal line to visualize the overall mean of all O~2~ values.
```{r example8, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
# Add a horizontal line for the overall mean
geom_hline(aes(yintercept=mean(CTD_O2))) +
labs(x="Depth [m]", y="Oxygen [uM]")
```

A list of shape codes can be found [here](http://sape.inf.usi.ch/quick-reference/ggplot2/shape).

Change the overall look with a theme:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide a link to all of the themes

```{r example9, exercise=TRUE}
ggplot(dat, aes(x = Depth, y = CTD_O2)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x="Depth [m]", y="Oxygen [uM]") +
geom_hline(aes(yintercept=mean(CTD_O2))) +
theme_classic()
```

There is also a handy [ggplot cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) to show you many more options!


## ggplot Exercise
Recreate the following plots:

### Exercise
```{r plot1, echo = FALSE, warning=FALSE}
ggplot(dat, aes(x = Depth, y = Mean_CH4, colour = Cruise)) +
geom_point(stat = "identity")
```

```{r plot1-exercise, exercise=TRUE}
ggplot(dat, aes(x = , y = , colour = )) +
geom_point(stat = "identity")
```

### Exercise 2
```{r plot2, echo = FALSE, arning = FALSE}
ggplot(dat, aes(x = Mean_H2S, CTD_O2)) +
geom_point()
```

```{r plot2-exercise, exercise = TRUE, exercise.lines = 5}

```
## Additional resources

* [R cheatsheets](https://www.rstudio.com/resources/cheatsheets/) also available in RStudio under Help > Cheatsheets
* Applied Statistics and Data Science Group ([ASDa](https://asda.stat.ubc.ca/)) at UBC