Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Brian Fannin authored and Brian Fannin committed Sep 1, 2016
1 parent a9c087d commit 6862635
Showing 1 changed file with 243 additions and 0 deletions.
243 changes: 243 additions & 0 deletions DataFrames.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# Data Frames

Finally! This

By the end of this chapter you will know:

* What's a data frame and how do I create one?
* How do I read and write external data?

## What's a data frame?

All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R.

Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it.

A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length.

### Creating a data frame

```{r }
set.seed(1234)
State = rep(c("TX", "NY", "CA"), 10)
EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1)
EarnedPremium = round(EarnedPremium, -3)
Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9)
df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE)
```

### Basic properties of a data frame

```{r }
summary(df)
str(df)
```

```{r }
names(df)
colnames(df)
length(df)
dim(df)
nrow(df)
ncol(df)
```

```{r }
head(df)
head(df, 2)
tail(df)
```

### Referencing

Very similar to referencing a 2D matrix.

```{r eval=FALSE}
df[2,3]
df[2]
df[2,]
df[2, -1]
```

Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list.

```{r eval=FALSE}
df$EarnedPremium
# Columns of a data frame may be treated as vectors
df$EarnedPremium[3]
df[2:4, 1:2]
df[, "EarnedPremium"]
df[, c("EarnedPremium", "State")]
```

### Ordering

```{r }
order(df$EarnedPremium)
df = df[order(df$EarnedPremium), ]
```

### Altering and adding columns

```{r }
df$LossRatio = df$EarnedPremium / df$Losses
df$LossRatio = 1 / df$LossRatio
```

### Eliminating columns

```{r }
df$LossRatio = NULL
df = df[, 1:2]
```

### rbind, cbind

`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame.

```{r results='hide'}
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
dfC = dfA[, 1:2]
cbind(dfA, dfC)
```

### Merging

```{r size='tiny'}
dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
```

Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL.

### Altering column names

```{r }
df$LossRation = with(df, Losses / EarnedPremium)
names(df)
colnames(df)[4] = "Loss Ratio"
colnames(df)
```

### Subsetting - The easy way

```{r }
dfTX = subset(df, State == "TX")
dfBigPolicies = subset(df, EarnedPremium >= 50000)
```

### Subsetting - The hard(ish) way

```{r }
dfTX = df[df$State == "TX", ]
dfBigPolicies = df[df$EarnedPremium >= 50000, ]
```

### Subsetting - Yet another way

```{r }
whichState = df$State == "TX"
dfTX = df[whichState, ]
whichEP = df$EarnedPremium >= 50000
dfBigPolicies = df[whichEP, ]
```

I use each of these three methods routinely. They're all good.

## Summarizing

```{r }
sum(df$EarnedPremium)
sum(df$EarnedPremium[df$State == "TX"])
aggregate(df[,-1], list(df$State), sum)
```

### Summarizing visually - 1

```{r size='tiny', fig.height=5}
dfByState = aggregate(df$EarnedPremium, list(df$State), sum)
colnames(dfByState) = c("State", "EarnedPremium")
barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue")
```

### Summarizing visually - 2

```{r size='tiny', fig.height=5}
dotchart(dfByState$EarnedPremium, dfByState$State, pch=19)
```

### Advanced data frame tools

* dplyr
* tidyr
* reshape2
* data.table

Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve.

### Reading data

```{r eval=FALSE}
myData = read.csv("SomeFile.csv")
```

### Reading from Excel

Actually there are several ways:
* XLConnect
* xlsx
* Excelsi-r

```{r eval=FALSE}
library(XLConnect)
wbk = loadWorkbook("myWorkbook.xlsx")
df = readWorksheet(wbk, someSheet)
```

### Reading from the web - 1

```{r eval=FALSE}
URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df = read.csv(URL, stringsAsFactors = FALSE)
```

### Reading from the web - 2

```{r eval=FALSE}
library(XML)
URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm"
games = readHTMLTable(URL, stringsAsFactors = FALSE)
```

### Reading from a database

```{r eval=FALSE}
library(RODBC)
myChannel = odbcConnect(dsn = "MyDSN_Name")
df = sqlQuery(myChannel, "SELECT stuff FROM myTable")
```

### Read some data

```{r eval=FALSE}
df = read.csv("../data-raw/StateData.csv")
```

```{r eval=FALSE}
View(df)
```

### Exercises

* Load the data from "StateData.csv" into a data frame.
* Which state has the most premium?

### Answer

```{r }
```

0 comments on commit 6862635

Please sign in to comment.