diff --git a/DataFrames.Rmd b/DataFrames.Rmd new file mode 100644 index 0000000..a74361f --- /dev/null +++ b/DataFrames.Rmd @@ -0,0 +1,243 @@ +# Data Frames + +Finally! This + +By the end of this chapter you will know: + +* What's a data frame and how do I create one? +* How do I read and write external data? + +## What's a data frame? + +All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R. + +Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. + +A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length. + +### Creating a data frame + +```{r } +set.seed(1234) +State = rep(c("TX", "NY", "CA"), 10) +EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1) +EarnedPremium = round(EarnedPremium, -3) +Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9) + +df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE) +``` + +### Basic properties of a data frame + +```{r } +summary(df) +str(df) +``` + +```{r } +names(df) +colnames(df) +length(df) +dim(df) +nrow(df) +ncol(df) +``` + +```{r } +head(df) +head(df, 2) +tail(df) +``` + +### Referencing + +Very similar to referencing a 2D matrix. + +```{r eval=FALSE} +df[2,3] +df[2] +df[2,] +df[2, -1] +``` + +Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list. + +```{r eval=FALSE} +df$EarnedPremium +# Columns of a data frame may be treated as vectors +df$EarnedPremium[3] +df[2:4, 1:2] +df[, "EarnedPremium"] +df[, c("EarnedPremium", "State")] +``` + +### Ordering + +```{r } +order(df$EarnedPremium) +df = df[order(df$EarnedPremium), ] +``` + +### Altering and adding columns + +```{r } +df$LossRatio = df$EarnedPremium / df$Losses +df$LossRatio = 1 / df$LossRatio +``` + +### Eliminating columns + +```{r } +df$LossRatio = NULL +df = df[, 1:2] +``` + +### rbind, cbind + +`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame. + +```{r results='hide'} +dfA = df[1:10,] +dfB = df[11:20, ] +rbind(dfA, dfB) +dfC = dfA[, 1:2] +cbind(dfA, dfC) +``` + +### Merging + +```{r size='tiny'} +dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2)) +df = merge(df, dfRateChange) +``` + +Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL. + +### Altering column names + +```{r } +df$LossRation = with(df, Losses / EarnedPremium) +names(df) +colnames(df)[4] = "Loss Ratio" +colnames(df) +``` + +### Subsetting - The easy way + +```{r } +dfTX = subset(df, State == "TX") +dfBigPolicies = subset(df, EarnedPremium >= 50000) +``` + +### Subsetting - The hard(ish) way + +```{r } +dfTX = df[df$State == "TX", ] +dfBigPolicies = df[df$EarnedPremium >= 50000, ] +``` + +### Subsetting - Yet another way + +```{r } +whichState = df$State == "TX" +dfTX = df[whichState, ] + +whichEP = df$EarnedPremium >= 50000 +dfBigPolicies = df[whichEP, ] +``` + +I use each of these three methods routinely. They're all good. + +## Summarizing + +```{r } +sum(df$EarnedPremium) +sum(df$EarnedPremium[df$State == "TX"]) + +aggregate(df[,-1], list(df$State), sum) +``` + +### Summarizing visually - 1 + +```{r size='tiny', fig.height=5} +dfByState = aggregate(df$EarnedPremium, list(df$State), sum) +colnames(dfByState) = c("State", "EarnedPremium") +barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue") +``` + +### Summarizing visually - 2 + +```{r size='tiny', fig.height=5} +dotchart(dfByState$EarnedPremium, dfByState$State, pch=19) +``` + +### Advanced data frame tools + +* dplyr +* tidyr +* reshape2 +* data.table + +Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve. + +### Reading data + +```{r eval=FALSE} +myData = read.csv("SomeFile.csv") +``` + +### Reading from Excel + +Actually there are several ways: +* XLConnect +* xlsx +* Excelsi-r + +```{r eval=FALSE} +library(XLConnect) +wbk = loadWorkbook("myWorkbook.xlsx") +df = readWorksheet(wbk, someSheet) +``` + +### Reading from the web - 1 + +```{r eval=FALSE} +URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv" +df = read.csv(URL, stringsAsFactors = FALSE) +``` + +### Reading from the web - 2 + +```{r eval=FALSE} +library(XML) +URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm" +games = readHTMLTable(URL, stringsAsFactors = FALSE) +``` + +### Reading from a database + +```{r eval=FALSE} +library(RODBC) +myChannel = odbcConnect(dsn = "MyDSN_Name") +df = sqlQuery(myChannel, "SELECT stuff FROM myTable") +``` + +### Read some data + +```{r eval=FALSE} +df = read.csv("../data-raw/StateData.csv") +``` + +```{r eval=FALSE} +View(df) +``` + +### Exercises + +* Load the data from "StateData.csv" into a data frame. +* Which state has the most premium? + +### Answer + +```{r } +``` \ No newline at end of file