DataFrames.Rmd

# Data Frames

This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. 

By the end of this chapter you will know the following:

* What's a data frame and how do I create one?
* How can I access and assign elements to and from a data frame?
* How do I read and write external data?

## What's a data frame?

Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it's constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there's no danger that I'll get character data when I only want dates or integers. At the same time, the list's flexibility means that we can store different data types in one single data construct. 

### Creating a data frame

Although there are many functions that will return a data frame, let's start by looking at the `data.frame` function. We'll first create some vectors and then join them together.

```{r }
set.seed(1234)
State <- rep(c("TX", "NY", "CA"), 10)
EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1)

df <- data.frame(State, EarnedPremium, stringsAsFactors=FALSE)
```

We didn't have to create the vectors first. If we wanted, we could pass them in as the result of function calls within the call to `data.frame`.

```{r}
set.seed(1234)
df <- data.frame(State = rep(c("TX", "NY", "CA"), 10)
                 , EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3)
                 , stringsAsFactors = FALSE)
```

Note that I've set the "stringsAsFactors" argument to FALSE. If I hadn't, the column "State" would be a factor, rather than a character. See [Data Types: factors](#Factors) for some reasons why we might not want our data to be a factor.

### Basic properties of a data frame

Once created, it's straightforward to get some basic information about the data we have.

```{r }
summary(df)
str(df)
head(df)
tail(df)
```

We can also query metadata about our data frame. Note that, for a data frame, `names` and `colnames` will return the same result. `dim` will give the number of rows and columns, or you can use `nrow` and `ncol` directly. In particular, note what result is returned by the `length` function. If you think about the fact that a data frame is actually a list, you may be able to guess why `length` returns what the value it does.

```{r }
names(df)
colnames(df)
dim(df)
length(df)
nrow(df)
ncol(df)
```

### Merging

If you've got more than one data frame, it's possible to combine them in several different ways: `rbind`, `cbind` and `merge`. 

`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. 

```{r results='hide'}
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
```

`cbind` must have the same number of rows as the data frame.

```{r }
dfC = dfA[, 1:2]
cbind(dfA, dfC)
```

`merge` is similar to a __JOIN__ operation in a relational database. If you're used to __VLOOKUP__ in Excel[^havent_used_vlookup], you'll love `merge`. It's possible to use multiple columns (e.g. state and effective date) when specifying how to join. If no columns are specified, `merge` will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data.

```{r }
dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
```

[^havent_used_vlookup]: If you've never used VLOOKUP, I can't imagine why you're reading this.

### `expand.grid`

Consider `expand.grid`. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame.

```{r}
dfStateClass <- expand.grid(State = c("TX", "CA", "NY")
                  , Class = c(1776, 1066, 1492))
```

## Access and Assignment

Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We'll often use the `[]` operator to specify which row and column we'd like. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element.

```{r eval=FALSE}
df[1, 2]
df[2, ]
df[, 2]
df[2, -1]
```

If we want columns of the data frame, we can approach this one of a few ways. For a single column, we can use either the `$`  or the `[[]]` operators. These work the same as a list (because a data frame _is_ a list). 

```{r eval=FALSE}
df$EarnedPremium
head(df[[1]])
head(df[["EarnedPremium"]])
```

For multiple columns, we can pass a character vector to get columns by name, rather than by position.

```{r}
df[, "EarnedPremium"]
df[, c("EarnedPremium", "State")]
```

Note the interesting case when we specify one argument but leave the other blank. I'll wrap it in a `head` function to minimize the output.

```{r}
head(df[2])
```

Again, if you bear in mind that a data frame is a list, this should makes sense. Without the additional comma, R will assume that we're dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list.

```{r}
head(df[1:2])
```

It'll probably be rare that you want this sort of behavior. It's not so rare that you'll want to return a single vector from a data frame, but in these cases, I'd recommend either using the `$`  or the `[[]]` operator.

### Altering and adding columns

We can add columns by using the same operators as for access[^also_cbind].

[^also_cbind]: We can also use `cbind` to add columns.

```{r }
df$Losses <- df$EarnedPremium * runif(nrow(df), 0.4, 1.2)
df$LossRatio = df$Losses / df$EarnedPremium
```

The `transform` and `within` functions will also create a new column.

```{r}
df <- transform(df, LogLoss = log(Losses))
df <- within(df, {LogLoss = log(Losses)})
```


### Eliminating columns

We can eliminate columns in one of two ways. If we only want to remove one column, we can assign the value NULL to it. 

```{r eval=FALSE}
df$LossRatio <- NULL
```

If we want to eliminate more than one column, we may construct a new data frame which only includes the columns we'd like to keep.

```{r eval = FALSE}
df <- df[, 1:2]
df <- df[, c("State", "EarnedPremium")]
```

If we'd like to remove specific columns, Hadley Wickham showed a nice means to do this by using set differences.

```{r}
df <- df[, setdiff(colnames(df), c("RateChange", "Losses"))]
```

### Subsetting

There are at least three ways to subset. The easiest way would be to use the `subset` function. 

```{r }
dfTX = subset(df, State == "TX")
dfBigPolicies = subset(df, EarnedPremium >= 50000)
```

A slightly harder way would be to use logical access

```{r }
dfTX = df[df$State == "TX", ]
dfBigPolicies = df[df$EarnedPremium >= 50000, ]
```

Finally, I could assign the results of a logical expression to a variable and then use this to subset the data frame.

```{r }
whichState = df$State == "TX"
dfTX = df[whichState, ]

whichEP = df$EarnedPremium >= 50000
dfBigPolicies = df[whichEP, ]
```

I use each of these three methods routinely. They're all good.

### Ordering

For vectors, we can use `sort` to get a sorted vector. For a data frame, we'll need to use `order`. Remember that `order` functions a lot like `which`. It will return indices which may then be used to return specific contents.

```{r }
order(df$EarnedPremium)
df = df[order(df$EarnedPremium), ]
```

### Altering column names

```{r eval=FALSE}
df$LossRation = with(df, Losses / EarnedPremium)
names(df)
colnames(df)[4] = "Loss Ratio"
colnames(df)
```

### `with` and `attach`

If the name of a data frame is long, typing it to access column elements might start to seem tedious. The `attach` function will alleviate this, by attaching the data frame onto something called the "search path" (which I might have described in the section on packages). What's the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it'll proceed with whatever calculation you've asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search.

```{r}
attach(dfA)
# do some stuff
attach(dfB)
```

There are many references that suggest using `attach`. I don't. I'll actually advise against it. Why is `attach` a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis.

```{r}
mean(EarnedPremium)
```

Which EarnedPremium am I talking about? "Well to tell you the truth in all this excitement I kinda lost track myself." [^dirty_harry] `attach` won't necessarily harm you. But it is a loaded gun.

[^dirty_harry]: Copyright some film company.

## Summarizing

```{r }
sum(df$EarnedPremium)
sum(df$EarnedPremium[df$State == "TX"])

aggregate(df[,-1], list(df$State), sum)
```

### Summarizing visually - 1

```{r size='tiny', fig.height=5}
dfByState = aggregate(df$EarnedPremium, list(df$State), sum)
colnames(dfByState) = c("State", "EarnedPremium")
barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue")
```

### Summarizing visually - 2

```{r size='tiny', fig.height=5}
dotchart(dfByState$EarnedPremium, dfByState$State, pch=19)
```

### Advanced data frame tools
  
* dplyr
* tidyr
* reshape2
* data.table

Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve.

## Reading and writing external data

```{r eval=FALSE}
myData = read.csv("SomeFile.csv")
```

### Reading from Excel

Actually there are several ways:
* XLConnect
* xlsx
* Excelsi-r

```{r eval=FALSE}
library(XLConnect)
wbk = loadWorkbook("myWorkbook.xlsx")
df = readWorksheet(wbk, someSheet)
```

### Reading from the web 

```{r eval=FALSE}
URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df = read.csv(URL, stringsAsFactors = FALSE)
```

```{r eval=FALSE}
library(XML)
URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm"
games = readHTMLTable(URL, stringsAsFactors = FALSE)
```

### Reading from a database 

```{r eval=FALSE}
library(RODBC)
myChannel = odbcConnect(dsn = "MyDSN_Name")
df = sqlQuery(myChannel, "SELECT stuff FROM myTable")
```

### Read some data

```{r eval=FALSE}
df = read.csv("../data-raw/StateData.csv")
```

```{r eval=FALSE}
View(df)
```

## Exercises

* Create a data frame with 500 rows. Include a column for policy effective date, policy expiration date and claim count. For claim count, consider using the `rpois` function to simulate from a poisson distribution.
* Save this data to a .CSV file and then reload that file.
* Which policy had the most claims? Which policy year?
* Add a column for 

### Answer

```{r }
```