r_intro_tidyverse_wrangling.Rmd

---
title: "Introduction to the Tidyverse and Data Wrangling"
author: "Brian High, Nancy Carmona & Chris Zuidema"
date: "![CC BY-SA 4.0](images/cc_by-sa_4.png)"
output:
  ioslides_presentation:
    fig_caption: yes
    fig_height: 3
    fig_retina: 1
    fig_width: 5
    keep_md: yes
    logo: images/logo_128.png
    smaller: yes
editor_options: 
  chunk_output_type: console
---

```{r set_knitr_options, echo=FALSE, message=FALSE, warning=FALSE}
suppressMessages(library(knitr))
opts_chunk$set(tidy=FALSE, cache=TRUE, echo=TRUE, message=FALSE)
```

## Learning Objectives

You will learn:

* What the "tidyverse" is
* What data "wrangling" is
* Packages and functions used to tidy and wrangle data in R
* How to tidy and wrangle data with R

## Introduction to the Tidyverse

Before we get into data wrangling, let's look at the [Tidyverse](https://www.tidyverse.org/)

<div class="columns-2">

![](images/tidyverse.png)

- "The tidyverse is an opinionated collection of R packages designed for data science." 
- "All packages share an underlying design philosophy, grammar, and data structures."
- There are tidyverse packages for data wrangling, modeling, and visualization.

</div>

## Data wrangling with *dplyr*

"[*dplyr*](https://dplyr.tidyverse.org/) is a grammar of data manipulation, 
providing a consistent set of verbs that help you solve the most common data 
manipulation challenges"

Common functions include:
 
* Choosing variables based on their name with `select()` 
* Choosing rows based on a condition with `filter()`
* Grouping data by a variable with `group_by()`
* Creating new variables or changing old variables with `mutate()`

## Loading packages with *pacman*
 
* *pacman* will load packages and automatically install missing packages 
* *pacman* is not a tidyverse package, but it works well with the tidyverse
 
```{r, message = FALSE}
# Load pacman, installing if needed
if (!require(pacman)){ install.packages("pacman") }

# Load other packages, installing as needed
pacman::p_load(dplyr, ggplot2)
```

This will load these packages (installing first if neccessary):

- *dplyr* for data wrangling
- *ggplot2* for data visualization

## Example Dataset *airquality*

R has many built-in datasets - let's get an air quality dataset from New York in 1973. 

```{r}
# Load the "airquality" dataset into the environment
data(airquality)
```

We can show the structure this dataset with `str()`:

```{r}
# Show the structure of the dataset
str(airquality)
```

## Example Dataset *airquality*
 
Alternatively, you can use `class()`, `dim()`, and `head()`:
 
```{r}
# Show the class and dimensions (153 rows and 6 columns) of the dataset
class(airquality)
dim(airquality)
 
# Display the first three rows of data
head(airquality, n = 3)
```

## Wrangle *airquality*

Common data tasks are simplified with *dplyr*, once you learn the verbs.

Let's add a variable to *airquality* using `mutate()`

```{r}

# Add the year of the measurements and look at the first 10 rows
head(mutate(airquality, Year = 1973), n = 10)

```

## Sneak preview: Using "pipes"
 
Alternatively, we can do the same thing with "pipes," `|>`.

```{r}
# Add the year of the measurements and look at the first 10 rows
airquality |> mutate(Year = 1973) |> head(n = 10)
```
 
Is this code with pipes more readable? Why? We'll see pipes again a little later...

## Wrangle *airquality*
 
Let's take a look at the new *airquality* dataframe with `head()`.
 
```{r}
# Show the dataframe, looking at the first few rows
head(airquality)
```

What happened here? Where did the `Year` column go? 

In our previous step, a dataframe printed to the console (if you entered the
command at the prompt) or displayed below the code chunk (if you ran the command
from the Rmarkdown document) with the new `Year` variable.

We used the *airquality* object, but since we did not make a new assignment
(e.g., with `<-`), the change was not "saved" to the object in the environment.

## Wrangle *airquality*

Now we'll add `Year` and make a new assignment to save it to memory. 

```{r}
# Add year to the dataframe
airquality <- mutate(airquality, Year = 1973)

# Take a look!
head(airquality)
```

Better! 

## Wrangle *airquality*

Now let's create a date column. We'll create a character string first to break
the steps down.

```{r}
# Make a character variable, then a date variable.
airquality <- mutate(airquality, Date_char = paste(Year, Month, Day, sep = "-"))
airquality <- mutate(airquality, Date = as.Date(Date_char, format = "%Y-%m-%d"))

head(airquality)

# Check the class.
class(airquality$Date)
```

## Wrangle *airquality*

Say we're only interested in data from May. Let's filter our data now. 

The filter function takes a logical condition, requiring `==` (is equal to?) 
which is different than `=` when we are assigning variables within functions.

```{r}
airquality <- filter(airquality, Month == 5)
head(airquality)
dim(airquality)
```

## The pipe operator

Next, let's consider wrangling *airquality* using the pipe operator. There are
two common pipe operators in R:

* The "native" pipe operator, `|>` (starting with R version 4.1.0)
* The *magrittr* pipe operator, `%>%`

These two pipes are largely the same, but there are some
[differences](https://www.tidyverse.org/blog/2023/04/base-vs-magrittr-pipe/) to
be aware of. You're also likely to encounter both types of pipes.

Pipes help to streamline consecutive operations to be more clear and succinct. 

Pipe operators feed the output of one function into the input of the next,
avoiding having to make multiple assignments. In the next code chunk, we'll use
the *magrittr* pipe operator, `%>%`.

## Wrangle *airquality*
### with the [*magrittr*](https://magrittr.tidyverse.org/) pipe operator

```{r}
# Reload dataset to start fresh.
data(airquality)

# Wrangle airquality with pipe operators starting with original dataframe
airquality_may <- airquality %>% 
  
  # Add year and date variables using mutate
  mutate(Year = 1973,
         Date_char = paste(Year, Month, Day, sep = "-"), 
         Date = as.Date(Date_char, format = "%Y-%m-%d")) %>% 
  
  # Filter dataframe to the month of May
  filter(Month == 5) %>% 
  
  # Select the variables of interest.
  select(Ozone, Temp, Date)
```

## Wrangle *airquality*

Let's take a look.

```{r}
# Show the dataframe.
head(airquality_may)
```

## Wrangle *airquality*

What if we're interested in the average ozone and temperature by month? Would we
have to repeat this process for each month? *dplyr* has easy and powerful
functions to perform data wrangling tasks on groups: `group_by()` and
`summarise()`.

```{r}
# Average by month, using the argument `na.rm = TRUE` to ignore the NA values
# If you omit `na.rm = TRUE`, then any NAs will cause the mean to be NA
airquality_by_month <- airquality %>% 
  
  # Group dataframe by month.
  group_by(Month) %>% 
  
  # Calculate average ozone concentration and temperature.
  summarise(Ozone_avg = mean(Ozone, na.rm = TRUE), 
            Temp_avg = mean(Temp, na.rm = TRUE),
            .groups = "keep")

```

The last argument, `.groups` specifies how to leave the grouping when the 
operation is finished. In this case, we "keep" the grouping as it is. (This is 
an "experimental" feature as of *dplyr*.)

## Wrangle *airquality*

```{r}
# Show new dataframe
airquality_by_month
```


## Make a plot using *ggplot2*

Now let's make a simple plot so we have an example for our rendered document.
The goal here is just to introduce the main plotting tool of the Tidyverse -
we'll be covering *ggplot2* in more detail in the future.

```{r}

# Create a ggplot object and save it as "p"
p <- ggplot(data = airquality_may) + 
  
  # Specify the type of geometry and the aesthetic features
  geom_line(aes(x = Date, y = Ozone)) + 
  
  # Specify the labels.
  labs(y = "Ozone Conc. (ppb)") + 
  
  # Choose the theme.
  theme_classic()
```

## Make a plot using *ggplot2*

```{r show_plot}
# Show the plot.
print(p)
```

## 

```{r child = 'images/questions.html'}
```