-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path01-intro.qmd
673 lines (487 loc) · 25 KB
/
01-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
---
aliases:
- intro.html
---
# Introduction {#intro}
::: {.alert .alert-secondary}
Hi there!
this is the material for the first lecture.
To sign up, check out the [welcome page](https://jmbuhr.de/dataintro/index.html)
:::
> ... in which we get started with R and RStudio, learn about literate programming and build our first plot by discovering a Grammar of Graphics.
{{< youtube r0bWxrzu4tg >}}
## What You will Learn
Throughout your scientific career --- and potentially outside of it --- you will encounter various forms of data.
Maybe you do an experiment and measured the fluorescence of a molecular probe, or you simply count the penguins at your local zoo.
Everything is data in some form or another.
But raw numbers without context are meaningless and tables of numbers are not only boring to look at, but often hide the actual structure in the data.
In this course you will learn to handle different kinds of data.
You will learn to create pretty and insightful visualizations, compute different statistics on your data and also what these statistical concepts mean.
From penguins to p-values, I got you covered.
The course will be held in English, as the concepts covered will directly transfer to the research you do, where the working language is English.
That being said, feel free to ask questions in any language that I understand, so German is also fine.
My Latin is a little rusty, thought.
In this course, we will be using the programming language R.
R is a language particularly well suited for data analysis, because it was initially designed by statisticians and because of the interactive nature of the language, which makes it easier to get started.
So don't fret if this is your first encounter with programming, we will take one step at a time.
The datasets chosen to illustrate the various concepts and tools are not particularly centered around Biology.
Rather, I chose general datasets that require less introduction and enable us to focus on learning R and statistics.
This is why we will be talking about penguins, racing games or life expectancy instead of intricate molecular measurements.
## Execute R Code
You can now execute commands in the R console in the bottom left.
For example we can calculate a mathematical expression:
```{r}
1 + 1
```
Or generate the numbers from one to 10:
```{r}
1:10
```
But I rarely type directly into the console.
Because we want our results to be reproducible, we write our code in a **script** first, so that the next person [^01-intro-1] can see what we did and replicate our analysis.
You will see that reproducibility is quite near and dear to me, so it will pop up once or twice.
And as scientists, I am sure you understand the importance.
[^01-intro-1]: This will most likely be future You.
And you will thank yourself later
> A script is like a recipe.
> It is the most important part of your data analysis workflow, because as long as you have the recipe, you can recreate whatever products (e.g. plots, statistics, tables) you have with ease.
To create a new script, click the little button in the top left corner.
In a script you can type regular R code, but it won't get executed straight away.
To send a line of code to the console to be executed, hit **Ctrl+Enter**.
Go ahead, try it with:
## Get to know RStudio
Before we get deeper into R, let's talk a little bit about our Home when working with R: RStudio.
There is one important setting that I would like you to change: Under Tools -\> Global Options make sure that "Restore .RData into workspace at startup" is **unchecked**.
The workspace that RStudio would save as `.RData` contains all objects created in a session, which is, what we can see in the **Environment** pane (by default in the top right panel, bottom right in my setup).
Why would we not want to load the objects we created in the last session into our current session automatically?
The reason is reproducibility.
We want to make sure that everything our analysis needs is in the script.
It creates our variables and plots from the raw data and should be the sole source of truth.
Check out the lecture video for further customization of RStudio e.g. with themes and make sure to also use *RStudio Projects* to structure your work.
## Expressions: Tell R to do things
R can do lot's of things, but let's start with some basics, like calculating.
Everything that starts with `#` is a comment and will be ignored by R.
```{r}
1 + 1 # addition
32 / 11 # division
3 * 4 # multiplication
13 %% 5 # modulo
13 %/% 5 # integer division
```
Create vectors with the `:` operator, e.g. numbers from:to
```{r}
1:4
```
And mathematical operations are automatically "vectorized":
```{r}
1:3 + 1:3
```
In fact, R as no scalars (individual values), those are just vectors of length 1.
## Variables: Boxes for things
Often, you will want to store the result of a computation for reuse, or to give it a sensible name and make your code more readable.
This is what **variables** are for.
We can assign a value to a variable using the assignment operator `<-` (In RStudio, there is a shortcut for it: **Alt+Minus**):
```{r}
my_number <- 42
```
Executing the above code will not give you any output, but when you use the name of the variable, you can see its content.
```{r}
my_number
```
And you can do operations with those variables:
```{r}
x <- 41
y <- 1
x + y
```
> **NOTE** Be careful about the order of execution!
> R enables you to work interactively and to execute the code you write in your script in any order with `Ctrl+Enter`, but when you execute (="source") the whole script, it will be executed from top to bottom.
Furthermore, code is not executed again automatically, if you change some dependency of the expression later on.
So the second assignment to x doesn't change y.
```{r}
x <- 1
y <- x + 1
x <- 1000
y
```
Variable names can contain letters (capitalization matters), numbers (but not as the first character) and underscores `_`.
[^01-intro-2]
[^01-intro-2]: They can also contain dots (`.`), but it is considered bad practice, because it can lead to some confusing edge cases.
```{r}
# snake_case
main_character_name <- "Kvothe"
# or camelCase
bookTitle <- "The Name of the Wind"
# you can have numbers in the name
x1 <- 12
```
![A depiction of various naming styles by @ArtworkAllisonHorst](images/coding_cases.png)
A good convention is to always use `snake_case`.
## Atomic datatype
First we have numbers (which internally are called `numeric` or `double`)
```{r}
#| eval: false
12
12.5
```
Then, there are whole numbers (`integer`)
```{r}
#| eval: false
1L # denoted by L
```
as well as the rarely used complex numbers (`complex`)
```{r}
#| eval: false
1 + 3i # denoted by the small i for the imaginary part
```
Text data however will be used more often (`character`, `string`).
Everything enclosed in quotation marks will be treated as text.
Double or single quotation marks are both fine.
```{r}
#| eval: false
"It was night again."
'This is also text'
```
Logical values can only contain yes or no, or rather `TRUE` and `FALSE` in programming terms (`boolean`, `logical`).
```{r}
#| eval: false
TRUE
FALSE
```
There are some special types that mix with any other type.
Like `NULL` for no value and `NA` for "Not Assigned".
```{r}
#| eval: false
NULL
NA
```
`NA` is contagious.
Any computation involving `NA` will return `NA` (because R has no way of knowing the answer):
```{r}
NA + 1
max(NA, 12, 1)
```
But some functions can remove `NA`s before giving us an answer:
```{r}
max(NA, 12, 1, na.rm = TRUE)
```
You can ask for the datatype of an object with the function `typeof`:
```{r}
typeof("hello")
```
There is also a concept called factors (`factor`) for categorical data, but we will talk about that later, when we get deeper into vectors.
## Functions: Calculate, run and automate things
> In R, everything that exists is an object, everything that does something is a function.
Functions are the main workhorse of our data analysis.
For example, there are mathematical functions, like `sin`, `cos` etc.
```{r}
sin(x = 0)
```
Functions take arguments (sometimes called parameters) and sometimes they also return things.
The `sin` function takes just one argument `x` and returns its sine.
What we do with the returned value is up to us.
We can use it directly in another computation or store it in a variable.
If we don't do anything with the return value, R simply prints it to the console.
Note, that the `=` inside the function parenthesis gives `x = 0` to the function and is separate from any `x` defined outside of the function.
For example:
```{r}
x <- 10
cos(x = 0)
# x outside of the function is still 10
x
```
To learn more about a function in R, execute `?` with the function name or press **F1** with your mouse over the function.
This is actually one of the most important things to learn today, because the help pages can be... well... incredibly helpful.
```{r, eval = FALSE}
?sin
```
We can pass arguments by name or by order of appearance.
The following two expressions are equivalent.
```{r}
#| eval: false
sin(x = 12)
sin(12)
```
Other notable functions to start out with:
Combine elements into a vector:
```{r}
c(1, 3, 5, 31)
```
Convert between datatypes with:
```{r}
as.numeric("1")
as.character(1)
```
Calculate summary values of a vectore:
```{r}
x <- c(1, 3, 5, 42)
max(x)
min(x)
mean(x)
range(x)
```
Create sequences of numbers:
```{r}
seq(1, 10, by = 2)
```
You just learned about the functions `sin`, `seq` and `max`.
But wait, there is more!
Not only in the sense that there are more functions in R (what kind of language would that be with only two verbs?!), but also in a more powerful way:
> We can define our own functions!
The syntax ($\leftarrow$ grammar for programming languages) is as follows.
```{r}
name_for_the_function <- function(parameter1, parameter2, ...) { # etc.
# body of the function
# things happen
result <- parameter1 + parameter2
# Something the function should return to the caller
return(result)
}
```
The function ends when it reaches the `return` keyword.
It also ends when it reaches the end of the function body and implicitly returns the last expression.
So we could have written it a bit shorter and in fact you will often see people omitting the explicit return at the end:
```{r}
add <- function(x, y) {
x + y
}
```
And we can call our freshly defined function:
```{r}
add(23, 19)
```
Got an error like `Error in add(23, 19) : could not find function "add"`?
Check that you did in fact execute the code that defines the function (i.e. put your cursor on the line with the `function` keyword and hit Ctrl+Enter.).
## Packages: Sharing functions
You are not the only one using R.
There is a welcoming and helpful community out there.
Some people also write a bunch of functions and put them together in a so called `package`.
And some people even went a step further.
The `tidyverse` is a **collection of packages** that play very well together and also iron out some of the quirkier ways in which R works [@wickhamWelcomeTidyverse2019b].
They provide a consistent interface to enable us to do more while having to learn less special cases.
The R function `install.packages("<package_name_here>")` installs packages from CRAN a curated set of R packages.
This is one exception to our effort of having everything in our script and not just in the console.
We don't want R trying to install the package every time we run the script, as this needs to happen only once.
So you can either turn it into a comment, delete it from the script, or only type it in the console.
You can also use RStudio's built-in panel for package installation.
R packages, especially the ones we will be using, often come with great manuals and help pages and I added a link to the package website for each of the packages to the hexagonal icons for each package in the script, so make sure to **click the icons**.
If you don't have the link at hand you can also always find help on the internet.
Most of these packages publish their source code on a site called GitHub, so you will be able to find further links, help and documentation by searching for *r* <the package> github.
Sometimes it can be helpful to write our R's full name when searching (turns out there are a lot of thing with the letter R): `rstats`.
## Literate Programming with Quarto (previously Rmarkdown): Code is communication
::: aside
[![](./images/quarto.png){width="200"}](https://quarto.org)
:::
**Quarto** enables us, to combine text with `code` and then produce a range of output formats like pdf, html, word documents, presentations etc.
In fact, this whole website was created with Quarto.
Sounds exciting?
Let's dive into it!
Open up a new Quarto document with the file extension `.qmd` from the *New File* menu in the top left corner of RStudio: **File → New File → Quarto Document** and choose **html** as the output format.
I particularly like html, because you don't have to worry about page breaks and it easily works on screens of different sizes, like your phone.
A Quarto document consists of three things:
1. **Metadata**:\
Information about your document such as the author or the date in a format called `YAML`. This YAML header starts and ends with three minus signs `---`.
2. **Text**:\
Regular text is interpreted as markdown, meaning it supports things like creating headings by prefixing a line with `#`, or text that will be bold in the output by surrounding it with `**`.
3. **Code chunks**:\
Starting with a line with 3 backticks and {r} and ending with 3 backticks. They will be interpreted as R code. This is where you write the code like you did in the `.R` script file. You can insert new chunks with the button on the top right of the editor window or use the shortcut **Ctrl+Alt+i**.
Use these to document your thoughts alongside your code when you are doing data analysis.
Future you (and reviewer number 2) will be happy!
To run code inside of chunks, use,the little play button on the chunk, the tried and true **Ctrl+Enter** to run one line, or **Ctrl+Shift+Enter** to run the whole chunk.
Your chunks can be as large or small as you want, but try to maintain some sensible structure.
The lecture video also demonstrates the different output formats (though for the exercises we will only be using `html`) and the visual editor.
![Cute little monsters as Rmarkdown Wizards by Allison Horst](images/rmarkdown_wizards.png)
### The Tidyverse
<aside><a href="https://tidyverse.org/"> ![](images/tidyverse.png){width="200"} </a></aside>
Go ahead and install the tidyverse packages with
```{r}
#| eval: false
install.packages("tidyverse")
```
## Our First Dataset: The Palmer Penguins
![The three penguin species of the Palmer Archipelago, by Allison Horst](images/lter_penguins.png){#fig-penguins}
So let's explore our first dataset together in a fresh Quarto document.
The `setup` chunk is special.
It gets executed automatically before any other chunk in the document is run.
This makes it a good place to load packages.
The dataset we are working with today actually comes in its own package, so we need to install this as well (Yes, there is a lot of installing today, but you will have to do this only once):
```{r}
#| eval: false
install.packages("palmerpenguins")
```
And then we populate our `setup` chunk with
```{r}
library(tidyverse)
library(palmerpenguins)
```
This gives us the `penguins` dataset [@rpalmerpenguins]:
```{r}
penguins
```
### Dataframes: R's powerfull tables
Let's talk about the shape of the `penguins` object.
The `str` function reveals the **structure** of an object to us.
```{r}
str(penguins)
```
The penguins variable contains a `tibble`, which is the tidyverse version of a `dataframe`.
It behaves the same way but prints out nicer.
Both are a list of columns, where columns are (usually) vectors.
We will learn more about their underlying datastructure, `list`s, [next week](02-data-wrangling.qmd).
## The Grammar of Graphics: Translate data into visualizations
<aside><a href="https://ggplot2.tidyverse.org/"> ![](images/ggplot2.png){width="200"} </a></aside>
You probably took this course because you want to build some cool visualizations for you data.
In order to do that, let us talk about how we can describe visualizations.
Just like language has grammar, some smart people came up with a **grammar of graphics** [@wilkinsonGrammarGraphics2005], which was then slightly modified and turned into an R package so that we can not only talk about but also create visualizations using this grammar [@wickhamLayeredGrammarGraphics2010].
The package is called `ggplot2`, and we already have it loaded because it is included in the tidyverse.
Before looking at the code, we can describe what we need in order to create this graphic.
```{r}
library(tidyverse)
library(palmerpenguins)
penguins %>%
ggplot(aes(flipper_length_mm, bill_length_mm,
color = species,
shape = sex)) +
geom_point(size = 2.5) +
geom_smooth(aes(group = species), method = "lm", se = FALSE,
show.legend = FALSE) +
labs(x = "Flipper length [mm]",
y = "Bill length [mm]",
title = "Penguins!",
subtitle = "The 3 penguin species can be differentiated by their flipper- and bill-lengths.",
caption = "Datasource:\nHorst AM, Hill AP, Gorman KB (2020). palmerpenguins:\nPalmer Archipelago (Antarctica) penguin data.\nR package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/",
color = "Species",
shape = "Sex") +
theme_minimal() +
scale_color_brewer(type = "qual") +
theme(plot.caption = element_text(hjust = 0))
```
Having a grammar means: - we can build complex visualizations with basic building blocks that fit together according to some rules (the grammar) - just like lego bricks - we just have to learn the building blocks and not a different function for all the different types of plots (e.g. barplot, scatterplot, lineplot, piechart)
We can build this plot up step by step.
The data is the foundation of our plot, but this just gives us an empty plotting canvas.
I am assigning the individual steps we are going through to a variable, so that we can sequentially add elements, but you can do this in one step as shown above.
```{r}
plt <- ggplot(penguins)
plt
```
Then, we add and `aesthetic mapping` to the plot.
It creates a relation from the features of our dataset (like the flipper length of each penguin) to a visual property, like position of the x-axis, color or shape.
```{r}
plt <- ggplot(penguins,
aes(x = flipper_length_mm,
y = bill_length_mm,
color = species,
shape = sex))
plt
```
Still, the plot is empty, it only has a coordinate system with a certain scale.
This is because we have no geometric objects to represent our aesthetics.
Elements of the plot are added using the `+` operator and all geometric elements that ggplot knows start with `geom_`.
Let's add some points:
```{r}
plt <- plt +
geom_point()
plt
```
Look at the help page for `geom_point` to find out what aesthetics it understands.
The exact way that features are mapped to aesthetics is regulated by **scales** starting with `scale_` and the name of an aesthetic:
```{r}
plt <- plt +
scale_color_manual(values = c("red", "blue", "orange"))
plt
```
We can add or change labels (like the x-axis-label) by adding the `labs` function.
```{r}
plt <- plt +
labs(x = "Flipper length [mm]",
y = "Bill length [mm]",
title = "Penguins!",
subtitle = "The 3 penguin species can differentiated by their flipper and bill lengths")
```
The overall look of the plot is regulated by themes like the pre-made `theme_` functions or more finely regulated with the `theme()` function, which uses `element` functions to create the look of individual elements.
Autocomplete helps us out a lot here (**Ctrl+Space**).
```{r}
plt <- plt +
theme_minimal() +
theme(legend.text = element_text(face = "bold"))
plt
```
In summary, this is what our plot needs:
- data
- aesthetic mapping
- geom(s)
- (stat(s))
- coordinate system
- guides
- scales
- theme
```{r}
my_plot <- ggplot(penguins,
aes(x = flipper_length_mm,
y = bill_length_mm,
shape = sex,
color = species)) +
geom_point() +
scale_color_manual(values = c("red", "blue", "orange")) +
labs(title = "Penguins") +
theme(plot.title = element_text(colour = "purple"))
my_plot
```
We can save our plot with the `ggsave` function.
It also has more arguments to control the dimentions and resolution of the image.
```{r}
#| eval: false
ggsave("my_plot.png", my_plot)
```
Next week we will be get rid of the annoying `NA` in the legend for `sex`.
## The Community: There to catch You.
![Comunity Teamwork by Allison Horst](images/code_hero.jpg){width="50%"}
![Googling the Error Message](images/errors.jpg){width="50%"}
## Bonus: Get more RStudio themes
- and talk about where packages come from
- <https://github.com/gadenbuie/rsthemes>
## Exercises
This course is not graded, but I need some way of confirming that you did indeed take part in this course.
In order to get the confirmation, you will send your solutions for a minimum of 6 out of the 8 exercises to me before the Seminar Tuesdays
For each week I would like you to create a fresh quarto document with your solutions as code as well as any questions that arose during the lecture.
This will help me a lot in improving this course.
When you are done solving the exercises, hit the `knit` button (at the top of the editor panel) and send me the resulting **html** document via discord (confirm that it looks the way you expected beforehand).
Here are today's tasks:
### Put your flippers in the air!
In a fresh quarto document (without the example template content), load the tidyverse and the palmerpenguins packages.
- Write a section of text about your **previous experience** with data analysis and/or programming (optional, but I can use this information to customize the seminars to your needs).
- Create a **vector of all odd numbers from 1 to 99** and store it in a variable.
- Create a second variable that contains the squares of the first.
- Have a look at the `tibble` function. Remember that you can always access the help page for a function using the `?` syntax, e.g. `?tibble::tibble` (The two colons `::` specify the package a function is coming from. You only need `tibble(...)` in the code because the `tibble` package is loaded automatically with the tidyverse. Here, I specify it directly to send you to the correct help page).
- Create a `tibble` where the columns are the vectors `x` and `y`.
- Create a scatterplot (points) of the two columns using `ggplot`.
- What `geom_` function do you need to add to the plot to add a line that connects your points?
- Load the **penguins** dataset from the `palmerpenguins` package.
Produce a scatterplot of the bill length vs. the bill depth, colorcoded by species.
- Imaginary bonus points if you manage to use the same colors as in the [penguin-image](#fig-penguins) (hint: look at the help page for `scale_color_manual()` to find out how. Note, that R can work with it's built-in color names, `rgb()` specifications or as hex-codes such as `#1573c7`). Even more bonus points if you also look into the `theme()` function and it's arguments, or the `theme_<...>()` functions to make the plot prettier.
- Check the metadata (YAML) of your quarto document and make sure it contains your name as the `author:`.
- The filename of your solution should also contain your name, e.g. `lecture1-jannik.qmd`, which gets rendered to `lecture1-jannik.html` when you knit it.
- Make the output document [self contained](https://quarto.org/docs/output-formats/html-basics.html#self-contained) by adding `embed-resources: true` to the yaml header.
- [Here](https://quarto.org/docs/reference/formats/html.html) are a couple more YAML options you can try if you feel adventurous.
The top of your Quarto document should now look like this between the `---`:
``` yaml
title: "Lecture 1"
author: <Your Name>
format:
html:
embed-resources: true
# more html options if you want
# like e.g. theme: <...>
execute:
warning: false
```
- Knit it and ship it! (=press the render button and send me the rendered html document via discord)
::: callout-caution
- Some operating systems (looking at you, Windows), will not show you file extensions by default (e.g. `.html`, `.png`, `docx`). [Here](https://www.howtogeek.com/205086/beginner-how-to-make-windows-show-file-extensions/) is a guide on how to turn them on in Windows 11 and 10. I recommend to to so, to make it easier for you to send the correct file (the rendered html file instead of the source qmd).
- You can test that your file is truely self contained, by copying it to a different location on your computer and opening it (a simple double-click should suffice) with your default browser. If you can see your plots, you are good to go. If you don't see them that means the plots are not embedded into the output (just linked to their relative location). Check the second-to-last point of the exercise list again.
:::
### Exercise Tips
<!-- TODO -->
{{< youtube Ycl4CMJdneM >}}
## Learn more:
Check out the dedicated [Resources](Resources) page.