Skip to content

Commit

Permalink
Merge branch 'master' of github.com:hadley/r4ds
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Aug 24, 2018
2 parents 5f87402 + 03eb8d0 commit b2de993
Show file tree
Hide file tree
Showing 18 changed files with 65 additions and 53 deletions.
2 changes: 1 addition & 1 deletion EDA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
1. Visualise the distribution of carat, partitioned by price.

1. How does the price distribution of very large diamonds compare to small
diamonds. Is it as you expect, or does it surprise you?
diamonds? Is it as you expect, or does it surprise you?

1. Combine two of the techniques you've learned to visualise the
combined distribution of cut, carat, and price.
Expand Down
4 changes: 2 additions & 2 deletions datetimes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ one_pm
one_pm + ddays(1)
```

Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different time.
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.

### Periods

Expand Down Expand Up @@ -538,7 +538,7 @@ x1 - x2
x1 - x3
```

Unless other specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:

```{r}
x4 <- c(x1, x2, x3)
Expand Down
4 changes: 2 additions & 2 deletions factors.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,8 @@ Another type of reordering is useful when you are colouring the lines on a plot.
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
group_by(age, marital) %>%
count() %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
Expand Down
2 changes: 1 addition & 1 deletion functions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ if (condition) {

To get help on `if` you need to surround it in backticks: `` ?`if` ``. The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it!

Here's a simple function that uses an if statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
Here's a simple function that uses an `if` statement. The goal of this function is to return a logical vector describing whether or not each element of a vector is named.

```{r}
has_name <- function(x) {
Expand Down
2 changes: 1 addition & 1 deletion iteration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
## For loop variations
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section.
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've mastered the FP techniques you'll learn about in the next section.
There are four variations on the basic theme of the for loop:
Expand Down
18 changes: 9 additions & 9 deletions model-assess.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ There are lots of high-level helpers to do these resampling methods in R. We're

<http://topepo.github.io/caret>. [Applied Predictive Modeling](https://amzn.com/1461468485), by Max Kuhn and Kjell Johnson.

If you're competing in competitions, like Kaggle, that are predominantly about creating good predicitons, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.
If you're competing in competitions, like Kaggle, that are predominantly about creating good predictions, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.

There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.

Expand Down Expand Up @@ -155,7 +155,7 @@ models %>%

But do you think this model will do well if we apply it to new data from the same population?

In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
In real-life you can't easily go out and recollect your data. There are two approaches to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.

```{r}
boot <- bootstrap(df, 100) %>%
Expand All @@ -181,7 +181,7 @@ last_plot() +

Bootstrapping is a useful tool to help us understand how the model might vary if we'd collected a different sample from the population. A related technique is cross-validation which allows us to explore the quality of the model. It works by repeatedly splitting the data into two pieces. One piece, the training set, is used to fit, and the other piece, the test set, is used to measure the model quality.

The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evalute the error on the test set:
The following code generates 100 test-training splits, holding out 20% of the data for testing each time. We then fit a model to the training set, and evaluate the error on the test set:

```{r}
cv <- crossv_mc(df, 100) %>%
Expand All @@ -192,7 +192,7 @@ cv <- crossv_mc(df, 100) %>%
cv
```

Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and teseting), and you can see it's very optimistic.
Obviously, a plot is going to help us see distribution more easily. I've added our original estimate of the model error as a white vertical line (where the same dataset is used for both training and testing), and you can see it's very optimistic.

```{r}
cv %>%
Expand All @@ -202,7 +202,7 @@ cv %>%
geom_rug()
```

The distribution of errors is highly skewed: there are a few cases which have very high errors. These respresent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look:
The distribution of errors is highly skewed: there are a few cases which have very high errors. These represent samples where we ended up with a few cases on all with low values or high values of x. Let's take a look:

```{r}
filter(cv, rmse > 1.5) %>%
Expand All @@ -214,13 +214,13 @@ filter(cv, rmse > 1.5) %>%

All of the models that fit particularly poorly were fit to samples that either missed the first one or two or the last one or two observation. Because polynomials shoot off to positive and negative, they give very bad predictions for those values.

Now that we've given you a quick overview and intuition for these techniques, lets dive in more more detail.
Now that we've given you a quick overview and intuition for these techniques, let's dive in more detail.

## Resamples

### Building blocks

Both the boostrap and cross-validation are build on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.
Both the boostrap and cross-validation are built on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.

These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory.

Expand Down Expand Up @@ -250,7 +250,7 @@ If you get a strange error, it's probably because the modelling function doesn't
```
`strap` gives the bootstrap sample dataset, and `.id` assigns a
unique identifer to each model (this is often useful for plotting)
unique identifier to each model (this is often useful for plotting)
* `crossv_mc()` return a data frame with three columns:
Expand Down Expand Up @@ -290,7 +290,7 @@ It's called the $R^2$ because for simple models like this, it's just the square
cor(heights$income, heights$height) ^ 2
```

The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're assessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.



Expand Down
6 changes: 3 additions & 3 deletions model-basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ There are two parts to a model:

1. First, you define a __family of models__ that express a precise, but
generic, pattern that you want to capture. For example, the pattern
might be a straight line, or a quadatric curve. You will express
might be a straight line, or a quadratic curve. You will express
the model family as an equation like `y = a_1 * x + a_2` or
`y = a_1 * x ^ a_2`. Here, `x` and `y` are known variables from your
data, and `a_1` and `a_2` are parameters that can vary to capture
Expand Down Expand Up @@ -185,7 +185,7 @@ ggplot(sim1, aes(x, y)) +

Don't worry too much about the details of how `optim()` works. It's the intuition that's important here. If you have a function that defines the distance between a model and a dataset, an algorithm that can minimise that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for.

There's one more approach that we can use for this model, because it's is a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:
There's one more approach that we can use for this model, because it's a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:

```{r}
sim1_mod <- lm(y ~ x, data = sim1)
Expand Down Expand Up @@ -214,7 +214,7 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm()
```{r}
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
diff <- data$y - model1(mod, data)
mean(abs(diff))
}
```
Expand Down
4 changes: 2 additions & 2 deletions model-building.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -165,10 +165,10 @@ Nothing really jumps out at me here, but it's probably worth spending time consi
the relationship between `price` and `carat`?

1. Extract the diamonds that have very high and very low residuals.
Is there anything unusual about these diamonds? Are the particularly bad
Is there anything unusual about these diamonds? Are they particularly bad
or good, or do you think these are pricing errors?

1. Does the final model, `mod_diamonds2`, do a good job of predicting
1. Does the final model, `mod_diamond2`, do a good job of predicting
diamond prices? Would you trust it to tell you how much to spend
if you were buying a diamond?

Expand Down
10 changes: 5 additions & 5 deletions model-many.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
because once you have tidy data, you can apply all of the techniques that
you've learned about earlier in the book.

We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signals so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.

The following sections will dive into more detail about the individual techniques:

Expand Down Expand Up @@ -107,7 +107,7 @@ by_country

(I'm cheating a little by grouping on both `continent` and `country`. Given `country`, `continent` is fixed, so this doesn't add any more groups, but it's an easy way to carry an extra variable along for the ride.)

This creates an data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.
This creates a data frame that has one row per group (per country), and a rather unusual column: `data`. `data` is a list of data frames (or tibbles, to be precise). This seems like a crazy idea: we have a data frame with a column that is a list of other data frames! I'll explain shortly why I think this is a good idea.

The `data` column is a little tricky to look at because it's a moderately complicated list, and we're still working on good tools to explore these objects. Unfortunately using `str()` is not recommended as it will often produce very long output. But if you pluck out a single element from the `data` column you'll see that it contains all the data for that country (in this case, Afghanistan).

Expand Down Expand Up @@ -266,7 +266,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa
1. To create the last plot (showing the data for the countries with the
worst model fits), we needed two steps: we created a data frame with
one row per country and then semi-joined it to the original dataset.
It's possible avoid this join if we use `unnest()` instead of
It's possible to avoid this join if we use `unnest()` instead of
`unnest(.drop = TRUE)`. How?

## List-columns
Expand Down Expand Up @@ -377,14 +377,14 @@ df %>%
unnest()
```

(If you find yourself using this pattern a lot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern).
(If you find yourself using this pattern a lot, make sure to check out `tidyr::separate_rows()` which is a wrapper around this common pattern).

Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:

```{r}
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = -1),
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
Expand Down
6 changes: 3 additions & 3 deletions pipes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ library(magrittr)

## Piping alternatives

The point of the pipe is to help you write code in a way that easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:
The point of the pipe is to help you write code in a way that is easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:

> Little bunny Foo Foo
> Went hopping through the forest
Expand Down Expand Up @@ -127,11 +127,11 @@ Finally, we can use the pipe:
```{r, eval = FALSE}
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mouse) %>%
scoop(up = field_mice) %>%
bop(on = head)
```

This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share you code with others who aren't familiar with the pipe, you can easily teach them.
This is my favourite form, because it focusses on verbs, not nouns. You can read this series of function compositions like it's a set of imperative actions. Foo Foo hops, then scoops, then bops. The downside, of course, is that you need to be familiar with the pipe. If you've never seen `%>%` before, you'll have no idea what this code does. Fortunately, most people pick up the idea very quickly, so when you share your code with others who aren't familiar with the pipe, you can easily teach them.

The pipe works by performing a "lexical transformation": behind the scenes, magrittr reassembles the code in the pipe to a form that works by overwriting an intermediate object. When you run a pipe like the one above, magrittr does something like this:

Expand Down
2 changes: 1 addition & 1 deletion relational-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ So far, the pairs of tables have always been joined by a single variable, and th
variables from `x` will be used in the output.
For example, if we want to draw a map we need to combine the flights data
with the airports data which contains the location (`lat` and `long`) of
with the airports data which contains the location (`lat` and `lon`) of
each airport. Each flight has an origin and destination `airport`, so we
need to specify which one we want to join to:
Expand Down
Loading

0 comments on commit b2de993

Please sign in to comment.