-
Notifications
You must be signed in to change notification settings - Fork 4.2k
/
Copy pathfunctions.qmd
899 lines (671 loc) · 31 KB
/
functions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
# Functions {#sec-functions}
```{r}
#| echo: false
source("_common.R")
```
## Introduction
One of the best ways to improve your reach as a data scientist is to write functions.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
Writing a function has four big advantages over using copy-and-paste:
1. You can give a function an evocative name that makes your code easier to understand.
2. As requirements change, you only need to update code in one place, instead of many.
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
In this chapter, you'll learn about three useful types of functions:
- Vector functions take one or more vectors as input and return a vector as output.
- Data frame functions take a data frame as input and return a data frame as output.
- Plot functions that take a data frame as input and return a plot as output.
Each of these sections includes many examples to help you generalize the patterns that you see.
These examples wouldn't be possible without the help of folks of twitter, and we encourage you to follow the links in the comments to see the original inspirations.
You might also want to read the original motivating tweets for [general functions](https://twitter.com/hadleywickham/status/1571603361350164486) and [plotting functions](https://twitter.com/hadleywickham/status/1574373127349575680) to see even more functions.
### Prerequisites
We'll wrap up a variety of functions from around the tidyverse.
We'll also use nycflights13 as a source of familiar data to use our functions with.
```{r}
#| message: false
library(tidyverse)
library(nycflights13)
```
## Vector functions
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
For example, take a look at this code.
What does it do?
```{r}
df <- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
)
df |> mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(a, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
```
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
But did you spot the mistake?
When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an `a` to a `b`.
Preventing this type of mistake is one very good reason to learn how to write functions.
### Writing a function
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
If we take the code above and pull it outside of `mutate()`, it's a little easier to see the pattern because each repetition is now one line:
```{r}
#| eval: false
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
```
To make this a bit clearer we can replace the bit that varies with `█`:
```{r}
#| eval: false
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
```
To turn this into a function you need three things:
1. A **name**.
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
2. The **arguments**.
The arguments are things that vary across calls and our analysis above tells us that we have just one.
We'll call it `x` because this is the conventional name for a numeric vector.
3. The **body**.
The body is the code that's repeated across all the calls.
Then you create a function by following the template:
```{r}
name <- function(arguments) {
body
}
```
For this case that leads to:
```{r}
rescale01 <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
```
At this point you might test with a few simple inputs to make sure you've captured the logic correctly:
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
Then you can rewrite the call to `mutate()` as:
```{r}
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d),
)
```
(In @sec-iteration, you'll learn how to use `across()` to reduce the duplication even further so all you need is `df |> mutate(across(a:d, rescale01))`).
### Improving our function
You might notice that the `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
```
Or you might try this function on a vector that includes an infinite value:
```{r}
x <- c(1:10, Inf)
rescale01(x)
```
That result is not particularly useful so we could ask `range()` to ignore infinite values:
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```
These changes illustrate an important benefit of functions: because we've moved the repeated code into a function, we only need to make the change in one place.
### Mutate functions
Now that you've got the basic idea of functions, let's take a look at a whole bunch of examples.
We'll start by looking at "mutate" functions, i.e. functions that work well inside of `mutate()` and `filter()` because they return an output of the same length as the input.
Let's start with a simple variation of `rescale01()`.
Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:
```{r}
z_score <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
```
Or maybe you want to wrap up a straightforward `case_when()` and give it a useful name.
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
```{r}
clamp <- function(x, min, max) {
case_when(
x < min ~ min,
x > max ~ max,
.default = x
)
}
clamp(1:10, min = 3, max = 7)
```
Of course functions don't just need to work with numeric variables.
You might want to do some repeated string manipulation.
Maybe you need to make the first character upper case:
```{r}
first_upper <- function(x) {
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
}
first_upper("hello")
```
Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
```{r}
# https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number <- function(x) {
is_pct <- str_detect(x, "%")
num <- x |>
str_remove_all("%") |>
str_remove_all(",") |>
str_remove_all(fixed("$")) |>
as.numeric()
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
clean_number("45%")
```
Sometimes your functions will be highly specialized for one data analysis step.
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
```{r}
fix_na <- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}
```
We've focused on examples that take a single vector because we think they're the most common.
But there's no reason that your function can't take multiple vector inputs.
### Summary functions
Another important family of vector functions is summary functions, functions that return a single value for use in `summarize()`.
Sometimes this can just be a matter of setting a default argument or two:
```{r}
commas <- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
```
Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:
```{r}
cv <- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
cv(runif(100, min = 0, max = 500))
```
Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:
```{r}
# https://twitter.com/gbganalyst/status/1571619641390252033
n_missing <- function(x) {
sum(is.na(x))
}
```
You can also write functions with multiple vector inputs.
For example, maybe you want to compute the mean absolute percentage error to help you compare model predictions with actual values:
```{r}
# https://twitter.com/neilgcurrie/status/1571607727255834625
mape <- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}
```
::: callout-note
## RStudio
Once you start writing functions, there are two RStudio shortcuts that are super useful:
- To find the definition of a function that you've written, place the cursor on the name of the function and press `F2`.
- To quickly jump to a function, press `Ctrl + .` to open the fuzzy file and function finder and type the first few letters of your function name.
You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.
:::
### Exercises
1. Practice turning the following code snippets into functions.
Think about what each function does.
What would you call it?
How many arguments does it need?
```{r}
#| eval: false
mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)
```
2. In the second variant of `rescale01()`, infinite values are left unchanged.
Can you rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1?
3. Given a vector of birthdates, write a function to compute the age in years.
4. Write your own functions to compute the variance and skewness of a numeric vector.
You can look up the definitions on Wikipedia or elsewhere.
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
6. Read the documentation to figure out what the following functions do.
Why are they useful even though they are so short?
```{r}
is_directory <- function(x) {
file.info(x)$isdir
}
is_readable <- function(x) {
file.access(x, 4) == 0
}
```
## Data frame functions
Vector functions are useful for pulling out code that's repeated within a dplyr verb.
But you'll often also repeat the verbs themselves, particularly within a large pipeline.
When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function.
Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.
To let you write a function that uses dplyr verbs, we'll first introduce you to the challenge of indirection and how you can overcome it with embracing, `{{ }}`.
With this theory under your belt, we'll then show you a bunch of examples to illustrate what you might do with it.
### Indirection and tidy evaluation
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
Let's illustrate the problem with a very simple function: `grouped_mean()`.
The goal of this function is to compute the mean of `mean_var` grouped by `group_var`:
```{r}
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by(group_var) |>
summarize(mean(mean_var))
}
```
If we try and use it, we get an error:
```{r}
#| error: true
diamonds |> grouped_mean(cut, carat)
```
To make the problem a bit more clear, we can use a made up data frame:
```{r}
df <- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |> grouped_mean(group, x)
df |> grouped_mean(group, y)
```
Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarize(mean(mean_var))`, instead of `df |> group_by(group) |> summarize(mean(x))` or `df |> group_by(group) |> summarize(mean(y))`.
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
Here we need some way to tell `group_by()` and `summarize()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use.
Tidy evaluation includes a solution to this problem called **embracing** 🤗.
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
So to make `grouped_mean()` work, we need to surround `group_var` and `mean_var` with `{{ }}`:
```{r}
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}))
}
df |> grouped_mean(group, x)
```
Success!
### When to embrace? {#sec-embracing}
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
Fortunately, this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
- **Tidy-selection**: this is used for functions like `select()`, `relocate()`, and `rename()` that select variables.
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g., `x + 1`) or select (e.g., `a:x`).
In the following sections, we'll explore the sorts of handy functions you might write once you understand embracing.
### Common use cases
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
```{r}
summary6 <- function(data, var) {
data |> summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
diamonds |> summary6(carat)
```
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
The nice thing about this function is, because it wraps `summarize()`, you can use it on grouped data:
```{r}
diamonds |>
group_by(cut) |>
summary6(carat)
```
Furthermore, since the arguments to summarize are data-masking, so is the `var` argument to `summary6()`.
That means you can also summarize computed variables:
```{r}
diamonds |>
group_by(cut) |>
summary6(log10(carat))
```
To summarize multiple variables, you'll need to wait until @sec-across, where you'll learn how to use `across()`.
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
```{r}
# https://twitter.com/Diabb6/status/1571635146658402309
count_prop <- function(df, var, sort = FALSE) {
df |>
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
diamonds |> count_prop(clarity)
```
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables.
Note that we use a default value for `sort` so that if the user doesn't supply their own value it will default to `FALSE`.
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply a condition:
```{r}
unique_where <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
distinct({{ var }}) |>
arrange({{ var }})
}
# Find all the destinations in December
flights |> unique_where(month == 12, dest)
```
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
We've made all these examples to take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
```{r}
subset_flights <- function(rows, cols) {
flights |>
filter({{ rows }}) |>
select(time_hour, carrier, flight, {{ cols }})
}
```
### Data-masking vs. tidy-selection
Sometimes you want to select variables inside a function that uses data-masking.
For example, imagine you want to write a `count_missing()` that counts the number of missing observations in rows.
You might try writing something like:
```{r}
#| error: true
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
count_missing(c(year, month, day), dep_time)
```
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
We can work around that problem by using the handy `pick()` function, which allows you to use tidy-selection inside data-masking functions:
```{r}
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
count_missing(c(year, month, day), dep_time)
```
Another convenient use of `pick()` is to make a 2d table of counts.
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange the counts into a grid:
```{r}
# https://twitter.com/pollicipes/status/1571606508944719876
count_wide <- function(data, rows, cols) {
data |>
count(pick(c({{ rows }}, {{ cols }}))) |>
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
diamonds |> count_wide(c(clarity, color), cut)
```
While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the `pivot_wider()` docs you can see that `names_from` uses tidy-selection.
### Exercises
1. Using the datasets from nycflights13, write a function that:
1. Finds all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
```{r}
#| eval: false
flights |> filter_severe()
```
2. Counts the number of cancelled flights and the number of flights delayed by more than an hour.
```{r}
#| eval: false
flights |> group_by(dest) |> summarize_severe()
```
3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
```{r}
#| eval: false
flights |> filter_severe(hours = 2)
```
4. Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:
```{r}
#| eval: false
weather |> summarize_weather(temp)
```
5. Converts the user supplied variable that uses clock time (e.g., `dep_time`, `arr_time`, etc.) into a decimal time (i.e. hours + (minutes / 60)).
```{r}
#| eval: false
flights |> standardize_time(sched_dep_time)
```
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
3. Generalize the following function so that you can supply any number of variables to count.
```{r}
count_prop <- function(df, var, sort = FALSE) {
df |>
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
```
## Plot functions
Instead of returning a data frame, you might want to return a plot.
Fortunately, you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
For example, imagine that you're making a lot of histograms:
```{r}
#| fig-show: hide
diamonds |>
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.1)
diamonds |>
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.05)
```
Wouldn't it be nice if you could wrap this up into a histogram function?
This is easy as pie once you know that `aes()` is a data-masking function and you need to embrace:
```{r}
#| fig-alt: |
#| A histogram of carats of diamonds, ranging from 0 to 5, showing a unimodal,
#| right-skewed distribution with a peak between 0 to 1 carats.
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |> histogram(carat, 0.1)
```
Note that `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want.
Just remember to switch from `|>` to `+`:
```{r}
#| fig.show: hide
diamonds |>
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")
```
### More variables
It's straightforward to add more variables to the mix.
For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:
```{r}
#| fig-alt: |
#| Scatterplot of height vs. mass of StarWars characters showing a positive
#| relationship. A smooth curve of the relationship is plotted in red, and
#| the best fit line is ploted in blue.
# https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check <- function(df, x, y) {
df |>
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE)
}
starwars |>
filter(mass < 1000) |>
linearity_check(mass, height)
```
Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:
```{r}
#| fig-alt: |
#| Hex plot of price vs. carat of diamonds showing a positive relationship.
#| There are more diamonds that are less than 2 carats than more than 2 carats.
# https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
df |>
ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(color = after_scale(fill)), # make border same color as fill
bins = bins,
fun = fun,
)
}
diamonds |> hex_plot(carat, price, depth)
```
### Combining with other tidyverse
Some of the most useful helpers combine a dash of data manipulation with ggplot2.
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
```{r}
#| fig-alt: |
#| Bar plot of clarify of diamonds, where clarity is on the y-axis and counts
#| are on the x-axis, and the bars are ordered in order of frequency: SI1,
#| VS2, SI2, VS1, VVS2, VVS1, IF, I1.
sorted_bars <- function(df, var) {
df |>
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |> sorted_bars(clarity)
```
We have to use a new operator here, `:=` (commonly referred to as the "walrus operator"), because we are generating the variable name based on user-supplied data.
Variable names go on the left hand side of `=`, but R's syntax doesn't allow anything to the left of `=` except for a single literal name.
To work around this problem, we use the special operator `:=` which tidy evaluation treats in exactly the same way as `=`.
Or maybe you want to make it easy to draw a bar plot just for a subset of the data:
```{r}
#| fig-alt: |
#| Bar plot of clarity of diamonds. The most common is SI1, then SI2, then
#| VS2, then VS1, then VVS2, then VVS1, then I1, then lastly IF.
conditional_bars <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
ggplot(aes(x = {{ var }})) +
geom_bar()
}
diamonds |> conditional_bars(cut == "Good", clarity)
```
You can also get creative and display data summaries in other ways.
You can find a cool application at <https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b>; it uses the axis labels to display the highest value.
As you learn more about ggplot2, the power of your functions will continue to increase.
We'll finish with a more complicated case: labelling the plots you create.
### Labeling
Remember the histogram function we showed you earlier?
```{r}
histogram <- function(df, var, binwidth = NULL) {
df |>
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
```
Wouldn't it be nice if we could label the output with the variable and the bin width that was used?
To do so, we're going to have to go under the covers of tidy evaluation and use a function from the package we haven't talked about yet: rlang.
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).
To solve the labeling problem we can use `rlang::englue()`.
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
But it also understands `{{ }}`, which automatically inserts the appropriate variable name:
```{r}
#| fig-alt: |
#| Histogram of carats of diamonds, ranging from 0 to 5. The distribution is
#| unimodal and right skewed with a peak between 0 to 1 carats.
histogram <- function(df, var, binwidth) {
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |>
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |> histogram(carat, 0.1)
```
You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.
### Exercises
Build up a rich plotting function by incrementally implementing each of the steps below:
1. Draw a scatterplot given dataset and `x` and `y` variables.
2. Add a line of best fit (i.e. a linear model with no standard errors).
3. Add a title.
## Style
R doesn't care what your function or arguments are called but the names make a big difference for humans.
Ideally, the name of your function will be short, but clearly evoke what the function does.
That's hard!
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
Generally, function names should be verbs, and arguments should be nouns.
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
```{r}
#| eval: false
# Too short
f()
# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()
```
R also doesn't care about how you use white space in your functions but future readers will.
Continue to follow the rules from @sec-workflow-style.
Additionally, `function()` should always be followed by squiggly brackets (`{}`), and the contents should be indented by an additional two spaces.
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
```{r}
# Missing extra two spaces
density <- function(color, facets, binwidth = 0.1) {
diamonds |>
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Pipe indented incorrectly
density <- function(color, facets, binwidth = 0.1) {
diamonds |>
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
```
As you can see we recommend putting extra spaces inside of `{{ }}`.
This makes it very obvious that something unusual is happening.
### Exercises
1. Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
```{r}
f1 <- function(string, prefix) {
str_sub(string, 1, str_length(prefix)) == prefix
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
```
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
3. Make a case for why `norm_r()`, `norm_d()` etc. would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
How could you make the names even clearer?
## Summary
In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frame, or creating a plot.
Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
We have only shown you the bare minimum to get started with functions and there's much more to learn.
A few places to learn more are:
- To learn more about programming with tidy evaluation, see useful recipes in [programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) and [programming with tidyr](https://tidyr.tidyverse.org/articles/programming.html) and learn more about the theory in [What is data-masking and why do I need {{?](https://rlang.r-lib.org/reference/topic-data-mask.html).
- To learn more about reducing duplication in your ggplot2 code, read the [Programming with ggplot2](https://ggplot2-book.org/programming.html){.uri} chapter of the ggplot2 book.
- For more advice on function style, see the [tidyverse style guide](https://style.tidyverse.org/functions.html){.uri}.
In the next chapter, we'll dive into iteration which gives you further tools for reducing code duplication.