-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path06-distributions-summaries-and-dimensionality-reduction.qmd
1263 lines (1032 loc) · 39.3 KB
/
06-distributions-summaries-and-dimensionality-reduction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
aliases:
- distributions-summaries-and-dimensionality-reduction.html
---
# Distributions, Summaries and Dimensionality Reduction
```{r}
#| include: false
source("./_common.R")
```
> ... in which we explore continuous distributions with spotify data,
find out about the central limit theorem and related statistical tests
and become N-dimensional whale sharks.
As it turns out this lecture got quite long.
So I separated two small interludes that are not
crucial to the bigger picture out into a bonus video.
You can recognize the corresponding parts of the
script by a prefix of [Sidenote] in the section heading.
They should be interesting to watch and read but
are not required for the exercises.
Here is the main lecture video:
{{< youtube uBCsxqw2kEM >}}
And here are the bonus bits:
{{< youtube zGJ_Zsw8QqY >}}
## Some Preparation
Today, we will explore the process of modeling
and look at different types of models.
In part, we will do so using the **tidymodels** framework.
The tidymodels framework extends the tidyverse
with specialized tools for all kinds of modeling tasks
that fit neatly in with all the tools
we already know. Go ahead and install them with:
```{r inst-tidy-models, eval=FALSE}
install.packages("tidymodels")
```
Now we are ready to get started
```{r}
library(tidyverse)
library(tidymodels)
```
### [Sidenote] on Reproducible Environments with `renv`
<aside>
<a href="https://rstudio.github.io/renv/">
![](images/renv.svg){width=200}
</a>
</aside>
At this point, we have installed quite a lot of packages.
On one hand, this is great fun because the extend what we
can do and make tedious tasks fun.
On the other hand, every package that we add introduces
what is called a dependency.
If a user doesn't have the package installed,
our analysis will not run.
If we are feeling experimental and use functions
from packages that are under active development and might
change in the future, we will run into trouble
when we update the package.
But never updating anything ever again is no fun!
I will show you, how to get the best of both worlds:
All the packages and functions that your heart desires
while maintaining complete reproducibility.
This is to make sure that you can come back to your old
projects 2 years from now and they still just run
as they did at the time.
This solution is a package called `renv`.
The idea is as follows:
Instead of installing all your packages
into one place, where you can only have one version
of a package at a time, `renv` installs packages
locally **in your project** folder.
It also meticulously writes down the version numbers
of all the packages you installed and keeps a cache,
so it will not copy the same version twice.
It is an R package like any other, so first,
we install it with:
```{r inst-renv, eval=FALSE}
install.packages("renv")
```
Then, in our RStudio project in the R console,
we initialize the project to use `renv` with:
```{r eval=FALSE}
renv::init()
```
This does a couple of things.
It creates a file named `.Rprofile`, in which it
writes `source("renv/activate.R")`.
The R-profile file is run automatically every
time you start a R session in this folder,
so it makes sure `renv` is active every time
you open the project.
It also creates a folder called `renv`.
This is the where it will install packages
you want to use in the project.
The most important file is the `renv.lock` file.
You can have a look at it, it is just a text file
with all the packages and their exact versions.
You notice, that after initializing renv,
we have no packages, so for example we
can't load the tidyverse as usual.
We will have to install it again!
However, in this case it should be fairly
fast, because renv knows that it was
already installed globally so
it simply copies the files,
which is fast.
After having installed a new package,
we call:
```{r}
#| eval: false
renv::snapshot()
```
Renv tells us, what we changed in our environment
and after we confirm, it notes down the changes.
Not it is also really easy to collaborate with
other people. Because after we send them
our project folder, all they have to do is run:
```{r}
#| eval: false
renv::restore()
```
To install all packages noted down in the lockfile.
We can also use this ourselves if we installed
a few to many packages or did an update we
regret and want to go back
to what is written in the lockfile.
Finally, renv also provides functions to update
or install new packages. They work like `install.packages`,
but a bit more versatile.
For example, let me show you the different
locations from which we can install packages.
1. The main location is [CRAN](https://cran.r-project.org/)
(The Comprehensive R Archive Network).
This is also from where you installed R itself.
R packages on there are subject to certain standards
and usually stable and tested.
2. We can also install packages directly from the source
code other people uploaded.
[GitHub](https://github.com/) is a platform where
you can upload code and track changes to it.
A lot of times, you can find the current developement
version of an R package, or packages that are not
yet on CRAN on GitHub.
`renv` can install packages from GitHub as well,
for example let us say, we want to test out the
latest version of the `purrr` package to give feedback
to the developers.
<https://github.com/tidyverse/purrr> here it says:
```{r eval=FALSE}
# ...
# Or the the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/purrr")
```
Well, we don't need devtools for this, because `renv` can
do this with the regular install function:
```{r eval=FALSE}
renv::install("tidyverse/purrr")
```
Giving it just a package name installs a package from CRAN,
a pattern of `"username/packgename"` installs from GitHub.
Now, back to the actual topic of today!
After having initialized `renv` we need to install
the packages that we need for the project even
if we already have them in our global package cache,
just so that `renv` knows about them.
## All models are wrong, but some are useful
> _"All models are wrong, but some are useful"_
> --- George Box
What this means is that any model is but a simplification
of reality and must always omit details.
No model can depict the complete underlying
reality. However, models are useful, and to
understand what they are useful for, we must first look at
the different types of models out there.
### Types of Models
<aside>
<a href="https://www.tidymodels.org/">
![](images/tidymodels.svg){width=200}
</a>
</aside>
The [tidymodels book](https://www.tmwr.org/software-modeling.html#types-of-models)
names three types of models,
where any particular model can fall into multiple
categories at once:
1. **Descriptive Models**\
are purely used to describe the underlying
data to make patters easier to see.
When we add a smooth line to a ggplot
with `geom_smooth`, the default method is a so called LOESS curve,
which stands for Locally Estimated Scatterplot Smoothing.
It does produce insights by revealing patterns to us,
but by itself can not be used e.g. to make predictions.
It is just a pretty looking smooth line.
```{r echo=FALSE, out.width="20%"}
mpg %>%
ggplot(aes(displ, hwy)) +
geom_smooth() +
geom_point()
```
2. **Inferential Models**\
are designed to test hypothesis or
make decisions. They rely heavily on
our assumptions about the data
(e.g. what probability distribution
the populations follows) and will
be most likely encountered by you
to answer research questions.
They are the models that typically
produce a p-value, which you compare
to a threshold like we did last week.
3. **Predictive Models**\
are designed to process the data we
have and make predictions about
some response variable upon receiving
new data. When done correctly,
we also hold out on some of the data
that our model never gets to see,
until it is time to evaluate and
test how it performs on unseen data.
Depending on how much we know
(or want to know) about the
underlying processes, we differentiate
between **mechanistic** models like
fitting a physically meaningful
function to data and **empirically driven** models, which are mainly
concerned with creating good
predictions, no matter the underlying
mechanism.
We will now explore different examples.
First, let me introduce our dataset for today:
## Say Hello to Spotify Data
I created a playlist on spotify,
which is quite diverse so that we can
look at a range of features. You can
even listen to it [here](https://open.spotify.com/playlist/2Ljg7QjvftQ7qNbGiM2Qzd?si=_DqZqE1xTmKoFqp898cftA) while you do the
exercises if you want. I am doing so, as I write
this. The cool thing about spotify is,
that they have an API, an Application Interface. APIs are ways for
computer programs to talk to each other.
So while we use the spotify app to look up songs, computers
use the API to talk to the spotify server.
And because R has a rich ecosystem of packages,
someone already wrote a package that allows R to talk to
this API: [`spotifyr`](https://www.rcharlie.com/spotifyr/index.html).
If you check out the R folder in this lecture,
you can see how I downloaded and processed that data about
the playlist. Note that the script will not work for you
right away, because you first need to register
with spotify as a developer and then get a so called token,
like a username and password in one long text, to be allowed
to send bots their way.
You probably just want to download the data from
my github repository.
Let's have a look, shall we?
```{r}
songs <- read_csv("data/06/spotify_playlist.csv")
```
We can get a quick overview of all columns with:
```{r}
glimpse(songs)
```
Finally some decent numbers!
Not just these measly discrete values
we had last week.
For each song in the playlist, we get the artist,
the year it arrived and a number of features like
how danceable, how loud or fast the song is.
You can easily imagine spotify using these
numbers to suggest new songs based on the features
of those that you have listened to.
And in fact, we are going to lay the foundations
for such an algorithm today.
## Visualising Continuous Distributions
When dealing with a continuous distribution,
like we have for a lot of our features in the spotify
songs dataset, there are always multiple
ways to represent the same data.
First, we just look at the numbers. We will
use the `valence` values for our songs:
```{r}
head(songs$valence)
```
Notice anything interesting in the numbers?
I don't either.
Our brain is way better suited for looking at graphical representations,
so:
**To the ggplot cave**!
```{r}
songs %>%
ggplot(aes(x = "", y = valence)) +
geom_point()
```
This is kind of hard to see, because points overlap.
We can get a better picture of the distribution
by using transparency or a bit of jitter:
```{r}
songs %>%
ggplot(aes(x = "", y = valence)) +
geom_jitter(width = 0.1)
```
Using a histogram, we can put the points into bins
and get a plot similar to what we got for discrete values.
Note that the plot is flipped on it's side now.
```{r}
songs %>%
ggplot(aes(valence)) +
geom_histogram()
```
And we might want to play around with the bin size to
get a better feel for the distribution.
Another way is to apply a smoothing function
and estimate the density of points along a continuous
range, even in places where we originally had no points:
```{r}
songs %>%
ggplot(aes(valence)) +
geom_density(fill = "midnightblue", alpha = 0.6)
```
Both of these plots can be misleading, if the original
number of points is quite small, and in most cases,
we are better off, showing the actual individual
points as well. This is the reason, why the first plots
I did where vertical, because there is a cool way
of showing both the points and the distribution,
while still having space to show multiple
distributions next to each other.
Imagine taking the density plot, turning it 90 degrees
and then mirroring through the middle.
What we get is a so called **violin plot**.
To overlay the points on top, we will use something
a little more predictable than `jitter` this time:
From the `ggbeeswarm` package I present: `geom_quasirandom`.
```{r}
songs %>%
ggplot(aes(x = "", y = valence)) +
geom_violin(fill = "midnightblue", alpha = 0.6) +
ggbeeswarm::geom_quasirandom(width = 0.35)
```
This is cool, because now we can easily
compare two different distributions
next to each other and still see all the individual
points.
For example, we might ask:
> "Do songs in major cord have a higher valence than
songs in minor cord in our dataset?"
```{r}
songs %>%
filter(!is.na(mode), !is.na(valence)) %>%
ggplot(aes(x = factor(mode), y = valence)) +
geom_violin(fill = "midnightblue", alpha = 0.6) +
ggbeeswarm::geom_quasirandom(width = 0.35) +
scale_x_discrete(labels = c(`0` = "minor", `1` = "major"))
```
> **Note**: This jittering only works, because the
feature on the x-axis is discrete. If it where continuous,
we would be changing the data by jittering on the x-axis.
We might also want to add summaries like the mean for each
group to the plot with an additional marker.
This leads us to the general concept of **summary statistics**.
There is a number of them, and they can be quite useful
to, well, summarise a complex distribution.
But they can also be very misleading, as can any simplification be.
## Summary Statistics...
## Mean, Median (and other Quartiles), Range
Let us start by considering different things we
can say about our distribution in one number.
First, we might look at the range of our numbers,
the maximum and minimum.
We will do this per mode, so we can compare the values.
Next, we want to know the centers of the points.
There are different notions of being at the center
of the distribution.
The **mean** or *average* is the sum of all values
divided by the number of values.
The **median** is what we call a quantile, a point that
divides a distribution in equally sized parts,
specifically such that 50% values are below and 50%
are above the median.
```{r}
songs %>%
drop_na(valence, mode) %>%
group_by(mode) %>%
summarise(
min = min(valence),
max = max(valence),
mean = mean(valence),
median = median(valence)
)
```
It appears the valence can assume values between 0 and 1.
A shortcut for this is the `range` function:
```{r}
range(songs$valence)
```
The median is just one of the many percentiles
we can think of. If we display the 50th as
well as the 25th and 75th percentile on one plot,
we get what is called a boxplot:
```{r}
# in the lecture I used filter(!is.na(mode), !is.na(valence)),
# but drop_na is more elegant.
songs %>%
drop_na(valence, mode) %>%
ggplot(aes(x = factor(mode), y = valence)) +
geom_boxplot(fill = "midnightblue", alpha = 0.6, outlier.color = NA) +
ggbeeswarm::geom_quasirandom(width = 0.35) +
scale_x_discrete(labels = c(`0` = "minor", `1` = "major"))
```
The "whiskers" of the box extend to 1.5 times the box size or to the
last data point, whichever makes smaller whiskers.
Points that are more extreme than the whiskers are
labeled outliers by the boxplot and usually displayed as
their own points. Like with the violin plot,
we also have the option to plot the original un-summarized
points on top. In this case, we need to make sure
to change the outlier color for the boxplot to
`NA`, because otherwise we are plotting them twice.
This hints at one downside of boxplots
(when used without adding the raw datapoints as well):
The box is a very prominent focus point
of the plot, but by definition,
it only contains 50% of all datapoints.
The rest is delegated to thin whiskers.
### Variance
Finally, we want to know, how far the values
scatter around their means and the potential
population mean. This is encompassed
in two closely related measures: the **variance**
and the **standard deviation**.
For illustrative purposes, we can plot all datapoints
for e.g the valence in the order in which they appear in the data
and add a line for the mean.
```{r valence-mean}
songs %>%
filter(!is.na(valence)) %>%
mutate(index = 1:n()) %>%
ggplot(aes(index, valence)) +
geom_segment(aes(y = mean(valence),
yend = mean(valence),
x = 0,
xend = length(valence))) +
geom_segment(aes(xend = index, yend = mean(valence)),
color = "darkred", alpha = 0.6) +
annotate(x = length(songs$valence),
y = mean(songs$valence, na.rm = TRUE),
label = "Mean",
geom = "text") +
geom_point()
```
> The variance is the expected value of the squared deviation
of a random variable from its mean.
In other words: Take the distance of all points
to the mean and sum them (add all red lines in the plot
above together) and then divide by $n-1$.
$$var(X) = \frac{\sum_{i=0}^{n}{(x_i-\bar x)^2}}{(n-1)}$$
"Hang on!", I hear you saying: "Why $n-1$?".
And it is an excellent question. The first
statement talked about an expected value.
(One example of an expected value is the mean,
which is the expected value of... well, the values).
And indeed, and expected value often has
the term $1/n$. But the statement was talking
about the expected value (of the squared deviation)
for **the whole population**.
We can only use the uncorrected version when we
have the whole population (e.g. all songs that ever existed)
and want to talk about that population.
But usually, all we have is a **sample**, from
which we want to draw conclusions about the population.
But when we are using the sample to estimate
the variance of the population, it will be biased.
We can correct for this bias by using $n-1$ instead
of $n$.
This is known as Bessel's correction.
I am yet to come by a really intuitive explanation,
but here is one idea: The thing we are dividing
by is not necessarily the sample size any time
we want to try to calculate the expected
value of an estimator, it just happens to be the
sample size in a bunch of cases.
What the term really represents here is the
**degrees of freedom** (DF) of the deviations.
DFs can be thought of as the number of independent things.
The degrees of freedom are $n$ reduced by $1$, because
if we know the mean of a sample (we use it in our
calculation), once we know all but $1$
of the individual values, the last value is automatically
known and thus doesn't count towards the degrees of freedom.
### Standard deviation
Next up: The **Standard Deviation** (SD) is the square root
of the variance. Which is more commonly used on error
bars, because the square root inverts the squaring that
was done to get the variance. So we are back
in the dimensions of the data.
$$\sigma_X=\sqrt{var(X)}$$
### Standard Error of the Mean
Finally, we have the **Standard Error of the Mean**,
sometimes only called Standard Error (SEM, SE).
It is also used very commonly in error bars.
The reason for a lot of people to favor it over
the SD might just be, that it is smaller,
but they have distinct use-cases.
$$SEM=\sigma / \sqrt{n}$$
We take the standard deviation and divide it
by the square-root of $n$. Imagine this:
We actually have the whole population available.
Like for example all penguins on earth.
And then we repeatedly take samples of
size $n$. The means of these individual samples
will vary, so it will have it's own mean,
standard deviation and variance. The
standard error is the standard deviation of these means.
So it is a measure of how far the means of repeated samples
scatter around the true population mean.
However, we don't usually have the whole population!
Measuring some property of all penguins in the
world takes a long time, and running
an experiment in the lab for all cells that exist
and will ever exist takes an infinite amount of
time. This is probably more than our research grant
money can finance.
So, instead, the Standard Error of the Mean
used the standard deviation of our sample in
the formula above. It is our best estimate
for the standard deviation of the whole population.
So, when you are trying so make inferences
about the mean of the whole population based on your sample,
it makes sense to also give the SEM as a way of
quantifying the uncertainty.
While R has functions for `sd`, `mean` and `var`,
there is not built in function for the `sem`,
but we can easily write one ourselves:
```{r}
sem <- function(x) sd(x) / sqrt(length(x))
```
```{r}
songs %>%
drop_na(mode, valence) %>%
group_by(mode) %>%
summarise(
mean = mean(valence),
var = var(valence),
sd = sd(valence),
sem = sem(valence)
)
```
## ... or: How to Lie with Graphs
However, be very wary of simple bar graphs with error bars;
there is a lot that can be misleading about them.
```{r horrible-plot}
songs %>%
drop_na(speechiness, mode) %>%
group_by(mode) %>%
summarise(across(speechiness, list(m = mean, sd = sd, sem = sem))) %>%
ggplot(aes(factor(mode), speechiness_m, fill = factor(mode))) +
geom_errorbar(aes(ymin = speechiness_m - speechiness_sem,
ymax = speechiness_m + speechiness_sem,
color = factor(mode)),
size = 1.3, width = 0.3, show.legend = FALSE) +
geom_col(size = 1.3, show.legend = FALSE) +
coord_cartesian(ylim = c(0.06, 0.08)) +
scale_fill_manual(values = c("#1f6293", "#323232")) +
scale_color_manual(values = c("#1f6293", "#323232")) +
labs(title = "Don't Do This at Home!",
y = "Speechiness",
x = "Mode (Minor / Major)") +
theme(
plot.title = element_text(size = 44, family = "Daubmark",
color = "darkred")
)
```
When people say "The y-axis has to include 0",
this is the reason for it. It is no always true,
when there is another sensible baseline that is not 0,
but especially for barplots not having the y-axis start
at 0 is about the most misleading thing you can do.
The main reason for this is that humans perceive
the height of the bars via their area,
and this is no longer proportional when the bars
don't start at 0.
This plot also makes no indication of the type of
error-bars used or the sample size in each group.
It uses the `speechiness` feature, but it hides the
actual distribution behind just 2 numbers
(mean and SEM) per group.
```{r}
songs %>%
ggplot(aes(speechiness, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
```
So the next time you see a barplot ask the question:
![Summary statistics by @ArtworkAllisonHorst](images/summary_statistics.png)
I hope you can take some inspiration from
this chapter and now have the vocabulary to
know where to look when it comes to your own data.
## Graphic Devices, Fonts and the ggplot Book
### ggplot book
Firstly, for all things `ggplot`, the third edition
of the **ggplot book** is currently being worked on
by three absolute legends of their craft
[@WelcomeGgplot2; @wickhamGgplot2ElegantGraphics2016].
[Hadley Wickham](http://hadley.nz/) is the author of the original ggplot and
ggplot2 package, [Danielle Navaro](https://djnavarro.net/)
makes amazing artwork
with and teaches ggplot and
[Thomas Lin Pedersen](https://www.data-imaginist.com/about)
is the current maintainer of ggplot2 and constantly
makes cool features for it.
The under-development book is already available online
for free: <https://ggplot2-book.org/>.
### [Sidenote] Graphics Devices
Secondly, we need to briefly talk about a concept
we have only brushed by: **graphics devices** are
to R what your printer is to your computer.
When we create a plot in R, it starts out as mere numbers,
but something has to turn these numbers into
pixels (in the case of raster-images) or vectors
(in the case of vector images; you might know svg or pdf files.
Sorry, but these are not the vectors in R but rather
descriptions of lines).
This is the job ob the graphics device.
When we use the `ggsave` function for example,
it figures out what to use based on the file extension,
but we can also specify it manually.
I am mentioning this here, because
in the plot I just showed you, I used a different font
than the default. This is something that can be
incredibly tricky for graphics devices,
because fonts are handled differently on every operating
system. Luckily, it is about to get way easier,
because Thomas Lin Pedersen is working on another
package, a graphics device, that is both really
fast and works well with fonts.
You can check the current development version here:
<https://ragg.r-lib.org/>
Here are some examples of using graphics devices manually
by opening the device first and then finalizing the plot
by closing the device:
```{r}
#| eval: false
png("myplot.png")
songs %>%
ggplot(aes(speechiness, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
dev.off()
```
```{r}
#| eval: false
svg("myplot.svg")
songs %>%
ggplot(aes(speechiness, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
dev.off()
```
Or by manually specifying the device in `ggsave`.
```{r}
#| eval: false
plt <- songs %>%
ggplot(aes(speechiness, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
ggsave("newplot.png", plt, device = ragg::agg_png)
```
## The Normal Distribution and the Central Limit Theorem
There are many different distributions out there.
Luckily, one of them is quite special and can
be used in a multitude of settings.
It is the harmlessly named **Normal Distribution**.
R has the usual functions for it (density,
probability, quantile, random).
```{r, out.width="50%", fig.show='hold'}
tibble(x = seq(-3, 3, 0.01)) %>%
ggplot(aes(x)) +
geom_function(fun = dnorm) +
stat_function(geom = "area", fun = dnorm,
fill = "darkblue", alpha = 0.3) +
labs(y = "density", title = "Normal Distribution Density")
tibble(x = seq(-3, 3, 0.01)) %>%
ggplot(aes(x)) +
geom_function(fun = pnorm) +
labs(y = "probability", title = "Cummulative Probability")
```
Now, why is is distribution so special?
> The **Central Limit Theorem** (CLT) states that the
sample mean of a sufficiently large number
of independent random variables is approximately
normally distributed.
The larger the sample, the better the approximation.
For a great visualization of the central limit
theorem, check out this interactive tutorial by
[Seeing Theory](https://seeing-theory.brown.edu/probability-distributions/index.html#section3).
Because a lot of values we measure are actually the
sum of many random processes, distributions of things
we measure can often be approximated with a normal distribution.
We can visually test if some values follow the normal
distribution by using a quantile-quantile plot,
which plots the quantiles of our sample against
where the quantiles should be on the normal distribution.
A straight line means it is perfectly normal.
```{r}
qqnorm(songs$valence)
qqline(songs$valence, col = "red")
```
The values close to the mean are pretty normal,
but the tails of the distribution stray further
from the normal distribution. There are way more
very small and very large values than would
be expected from a normal distribution.
### Log-normality
There is one thing that comes up a lot in biological data:
because a lot of processes in biology are reliant on
signal cascades, they tend to be the result of many
multiplicative effects, rather than additive effects,
as would be required for the Central Limit Theorem.
As a result, they are not distributed normally,
but rather log-normally,
because taking the logarithm of all values
transforms multiplicative effects into additive effects!
## The T-Distribution
The CLT is only valid for **large sample sizes**.
For smaller sample sizes, the distribution of
means has fatter tails than a normal distribution.
This is why for most statistical tests,
we use the **t-distribution** instead of the
normal distribution.
As the degrees of freedom get higher, the
t-distribution approaches the normal distribution.
```{r tdist, fig.cap="t-distributions; normal distribution in black."}
base <- ggplot() + xlim(-5, 5)
base +
geom_function(aes(colour = "t, df = 1"), fun = dt, args = list(df = 1), size = 1.2) +
geom_function(aes(colour = "t, df = 3"), fun = dt, args = list(df = 3), size = 1.2) +
geom_function(aes(colour = "t, df = 30"), fun = dt, args = list(df = 30), size = 1.2) +
geom_function(aes(colour = "normal"), fun = dnorm, size = 1.2) +
guides(color = guide_legend(title = "")) +
scale_color_viridis_d()
```
Remember the valence plot by mode?
```{r}
songs %>%
ggplot(aes(factor(mode), valence)) +
geom_violin(fill = "darkblue", alpha = 0.3) +
ggbeeswarm::geom_quasirandom(alpha = 0.6)
```
For our purposes we are going to treat these two distributions
as close enough to a normal distribution at first
so that we can look at some hypothesis tests:
## Student's T-Test
The first test is called student's t-test. "Student"
was the pseudonym of it's inventor. And the "t" stands
for the t-distribution. We can use it to test
the null hypothesis, that two samples come from the
same (approximately normal) distribution.
```{r}
t.test(valence ~ mode, data = songs)
```
The two samples are so similar that is is quite likely for those
values to have come form the same distribution, so we would not
reject the null hypothesis.
Let us pretend for a moment that there is in fact a
difference by creating some fake data (don't do this in the lab...).
```{r}
fake_songs <- songs %>%
drop_na(mode, valence) %>%
mutate(valence = if_else(mode == 1, valence + 0.2, valence))
```
```{r}
fake_songs %>%
ggplot(aes(valence, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
```
Now we end up with a statistically significant p-value.
```{r}
t.test(valence ~ mode, data = fake_songs)
```
Note, that the p-value itself says nothing about the effect size,
the difference in means between the sample.
You can get a significant p-value either by showing a tiny
difference with lot's of data points or by showing a larger
difference with less data points.
Tests, that rely on the assumption of normality
are called **parametric tests**, but when this
assumption can not be met, we need **non-parametric tests**.
## Wilcoxon rank-sum test
The Wilcoxon rank-sum test, or
Mann–Whitney U test, is one of these.
I get's around the assumption of normality by
transforming the data into **ranks** first.
i.e. all points (independent of group) are
ordered and their values replaced by their
position in the ordering (their rank).
If we think of the t-test as testing for a
difference in means, we can think of the
Wilcoxon rank-sum test as testing for a difference
in medians.
```{r}
x <- c(1, 3, 2, 42, 5, 1000)
x
rank(x)
```
```{r}
wilcox.test(valence ~ mode, data = songs)
```
### Direction of Testing
Both tests have the argument `alternative`,
which can be any of `c("two.sided", "less", "greater")`.
This is the direction of our alternative hypothesis.
Are we testing, for x being greater or less than y?
Or are we testing for a difference in any direction (the default)?
Having a hypothesis about the direction beforehand will
result in smaller p-values (half of the two-sided ones),
but you need to have this hypothesis before looking at
the data, and especially not after running e.g. the
two sided test and then deciding, that you want a
smaller p-value! This is not how p-values work.
```{r}
t.test(valence ~ mode, data = fake_songs, alternative = "less")
```
```{r}
fake_songs %>%
ggplot(aes(valence, color = factor(mode), fill = factor(mode))) +
geom_density(alpha = 0.3)
```
If you are unsure about how to tell the functions,
which of two groups is supposed to be greater or lesser,
you can also supply the data as `x` and `y` instead
of using the formula interface as I did above:
```{r}