Skip to content

Commit

Permalink
About 85% done
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Fannin authored and Brian Fannin committed Sep 8, 2016
1 parent 8fda97e commit 6ee2635
Show file tree
Hide file tree
Showing 15 changed files with 418 additions and 219 deletions.
10 changes: 10 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
language: r
cache: packages

before_script:
- chmod +x ./_build.sh
- chmod +x ./_deploy.sh

script:
- ./_build.sh
- ./_deploy.sh
185 changes: 121 additions & 64 deletions DataFrames.Rmd
Original file line number Diff line number Diff line change
@@ -1,125 +1,149 @@
# Data Frames

Finally! This
This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it.

By the end of this chapter you will know:
By the end of this chapter you will know the following:

* What's a data frame and how do I create one?
* How can I access and assign elements to and from a data frame?
* How do I read and write external data?

## What's a data frame?

All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R.

Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it.

A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length.
Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it's constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there's no danger that I'll get character data when I only want dates or integers. At the same time, the list's flexibility means that we can store different data types in one single data construct.

### Creating a data frame

Although there are many functions that will return a data frame, let's start by looking at the `data.frame` function. We'll first create some vectors and then join them together.

```{r }
set.seed(1234)
State = rep(c("TX", "NY", "CA"), 10)
EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1)
EarnedPremium = round(EarnedPremium, -3)
Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9)
State <- rep(c("TX", "NY", "CA"), 10)
EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1)
df <- data.frame(State, EarnedPremium, stringsAsFactors=FALSE)
```

We didn't have to create the vectors first. If we wanted, we could pass them in as the result of function calls within the call to `data.frame`.

df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE)
```{r}
set.seed(1234)
df <- data.frame(State = rep(c("TX", "NY", "CA"), 10)
, EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3)
, stringsAsFactors = FALSE)
```

Note that I've set the "stringsAsFactors" argument to FALSE. If I hadn't, the column "State" would be a factor, rather than a character. See [Data Types: factors](#Factors) for some reasons why we might not want our data to be a factor.

### Basic properties of a data frame

Once created, it's straightforward to get some basic information about the data we have.

```{r }
summary(df)
str(df)
head(df)
tail(df)
```

We can also query metadata about our data frame. Note that, for a data frame, `names` and `colnames` will return the same result. `dim` will give the number of rows and columns, or you can use `nrow` and `ncol` directly. In particular, note what result is returned by the `length` function. If you think about the fact that a data frame is actually a list, you may be able to guess why `length` returns what the value it does.

```{r }
names(df)
colnames(df)
length(df)
dim(df)
length(df)
nrow(df)
ncol(df)
```

### Merging

If you've got more than one data frame, it's possible to combine them in several different ways: `rbind`, `cbind` and `merge`.

`rbind` will append rows to the data frame. New rows must have the same number of columns and data types.

```{r results='hide'}
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
```

`cbind` must have the same number of rows as the data frame.

```{r }
head(df)
head(df, 2)
tail(df)
dfC = dfA[, 1:2]
cbind(dfA, dfC)
```

`merge` is similar to a __JOIN__ operation in a relational database. If you're used to __VLOOKUP__ in Excel[^havent_used_vlookup], you'll love `merge`. It's possible to use multiple columns (i.e. state and effective date) when specifying how to join. If no columns are specified, `merge` will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data.

```{r }
dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
```

[^havent_used_vlookup]: If you've never used VLOOKUP, I can't imagine why you're reading this.

Finally, consider `expand.grid`. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame.

```{r}
dfStateClass <- expand.grid(State = c("TX", "CA", "NY")
, Class = c(8810, 1066, 1492))
```

### Referencing
## Access and Assignment

Very similar to referencing a 2D matrix.
Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We'll often use the `[]` operator. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element.

```{r eval=FALSE}
df[2,3]
df[2]
df[2,]
df[1, 2]
df[2, ]
df[, 2]
df[2, -1]
```

Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list.
If we want columns of the data frame, we can approach this one of two ways. The `$` operator works the same as a list (because a data frame _is_ a list) to access named columns. We can also pass character strings in as arguments to get columns by name, rather than by position.

```{r eval=FALSE}
df$EarnedPremium
# Columns of a data frame may be treated as vectors
df$EarnedPremium[3]
df[2:4, 1:2]
df[, "EarnedPremium"]
df[, c("EarnedPremium", "State")]
```

### Ordering
If we're only interested in one column, we can use the `[[]]` operator to return it.

```{r }
order(df$EarnedPremium)
df = df[order(df$EarnedPremium), ]
```{r}
head(df[[1]])
head(df[["EarnedPremium"]])
```

### Altering and adding columns
Note the interesting case when we specify one argument but leave the other blank. I'll wrap it in a `head` function to minimize the output.

```{r }
df$LossRatio = df$EarnedPremium / df$Losses
df$LossRatio = 1 / df$LossRatio
```{r}
head(df[2])
```

### Eliminating columns
Again, if you bear in mind that a data frame is a list, this makes a bit more sense. Without the additional comma, R will assume that we're dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list.

```{r }
df$LossRatio = NULL
df = df[, 1:2]
```{r}
head(df[1:2])
```

### rbind, cbind
It'll probably be rare that you want this sort of behavior. It's not so rare that you'll want to return a single vector from a data frame, but in these cases, I'd recommend either using the `$` or the `[[]]` operator.

`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame.

```{r results='hide'}
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
dfC = dfA[, 1:2]
cbind(dfA, dfC)
```

### Merging
### Altering and adding columns

```{r size='tiny'}
dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
```{r }
df$Losses <- df$EarnedPremium * runif(nrow(df), 0.4, 1.2)
df$LossRatio = df$Losses / df$EarnedPremium
```

Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL.

### Altering column names
### Eliminating columns

```{r }
df$LossRation = with(df, Losses / EarnedPremium)
names(df)
colnames(df)[4] = "Loss Ratio"
colnames(df)
df$LossRatio = NULL
df = df[, 1:2]
```

### Subsetting - The easy way
Expand Down Expand Up @@ -148,6 +172,40 @@ dfBigPolicies = df[whichEP, ]

I use each of these three methods routinely. They're all good.

### Ordering

```{r }
order(df$EarnedPremium)
df = df[order(df$EarnedPremium), ]
```

### Altering column names

```{r eval=FALSE}
df$LossRation = with(df, Losses / EarnedPremium)
names(df)
colnames(df)[4] = "Loss Ratio"
colnames(df)
```

### `with` and `attach`

If the name of a data frame is long, continually typing it to access column elements might start to seem tedious. There are any number of texts that will use the `attach` function to alleviate this. This will attach the data frame onto something called the "search path" (which I might have described in the section on packages). What's the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it'll proceed with whatever calculation you've asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search.

```{r}
attach(dfA)
# do some stuff
attach(dfB)
```

So why is this a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis.

```{r}
mean(EarnedPremium)
```

Which EarnedPremium am I talking about? In all the confusion, I lost track myself. Well, do you feel lucky punk?

## Summarizing

```{r }
Expand Down Expand Up @@ -180,7 +238,7 @@ dotchart(dfByState$EarnedPremium, dfByState$State, pch=19)

Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve.

### Reading data
## Reading and writing external data

```{r eval=FALSE}
myData = read.csv("SomeFile.csv")
Expand All @@ -199,15 +257,13 @@ wbk = loadWorkbook("myWorkbook.xlsx")
df = readWorksheet(wbk, someSheet)
```

### Reading from the web - 1
### Reading from the web

```{r eval=FALSE}
URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df = read.csv(URL, stringsAsFactors = FALSE)
```

### Reading from the web - 2

```{r eval=FALSE}
library(XML)
URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm"
Expand All @@ -232,8 +288,9 @@ df = read.csv("../data-raw/StateData.csv")
View(df)
```

### Exercises
## Exercises

*
* Load the data from "StateData.csv" into a data frame.
* Which state has the most premium?

Expand Down
Loading

0 comments on commit 6ee2635

Please sign in to comment.