diff --git a/Data.Rmd b/Data.Rmd deleted file mode 100644 index a74361f..0000000 --- a/Data.Rmd +++ /dev/null @@ -1,243 +0,0 @@ -# Data Frames - -Finally! This - -By the end of this chapter you will know: - -* What's a data frame and how do I create one? -* How do I read and write external data? - -## What's a data frame? - -All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R. - -Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. - -A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length. - -### Creating a data frame - -```{r } -set.seed(1234) -State = rep(c("TX", "NY", "CA"), 10) -EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1) -EarnedPremium = round(EarnedPremium, -3) -Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9) - -df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE) -``` - -### Basic properties of a data frame - -```{r } -summary(df) -str(df) -``` - -```{r } -names(df) -colnames(df) -length(df) -dim(df) -nrow(df) -ncol(df) -``` - -```{r } -head(df) -head(df, 2) -tail(df) -``` - -### Referencing - -Very similar to referencing a 2D matrix. - -```{r eval=FALSE} -df[2,3] -df[2] -df[2,] -df[2, -1] -``` - -Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list. - -```{r eval=FALSE} -df$EarnedPremium -# Columns of a data frame may be treated as vectors -df$EarnedPremium[3] -df[2:4, 1:2] -df[, "EarnedPremium"] -df[, c("EarnedPremium", "State")] -``` - -### Ordering - -```{r } -order(df$EarnedPremium) -df = df[order(df$EarnedPremium), ] -``` - -### Altering and adding columns - -```{r } -df$LossRatio = df$EarnedPremium / df$Losses -df$LossRatio = 1 / df$LossRatio -``` - -### Eliminating columns - -```{r } -df$LossRatio = NULL -df = df[, 1:2] -``` - -### rbind, cbind - -`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame. - -```{r results='hide'} -dfA = df[1:10,] -dfB = df[11:20, ] -rbind(dfA, dfB) -dfC = dfA[, 1:2] -cbind(dfA, dfC) -``` - -### Merging - -```{r size='tiny'} -dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2)) -df = merge(df, dfRateChange) -``` - -Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL. - -### Altering column names - -```{r } -df$LossRation = with(df, Losses / EarnedPremium) -names(df) -colnames(df)[4] = "Loss Ratio" -colnames(df) -``` - -### Subsetting - The easy way - -```{r } -dfTX = subset(df, State == "TX") -dfBigPolicies = subset(df, EarnedPremium >= 50000) -``` - -### Subsetting - The hard(ish) way - -```{r } -dfTX = df[df$State == "TX", ] -dfBigPolicies = df[df$EarnedPremium >= 50000, ] -``` - -### Subsetting - Yet another way - -```{r } -whichState = df$State == "TX" -dfTX = df[whichState, ] - -whichEP = df$EarnedPremium >= 50000 -dfBigPolicies = df[whichEP, ] -``` - -I use each of these three methods routinely. They're all good. - -## Summarizing - -```{r } -sum(df$EarnedPremium) -sum(df$EarnedPremium[df$State == "TX"]) - -aggregate(df[,-1], list(df$State), sum) -``` - -### Summarizing visually - 1 - -```{r size='tiny', fig.height=5} -dfByState = aggregate(df$EarnedPremium, list(df$State), sum) -colnames(dfByState) = c("State", "EarnedPremium") -barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue") -``` - -### Summarizing visually - 2 - -```{r size='tiny', fig.height=5} -dotchart(dfByState$EarnedPremium, dfByState$State, pch=19) -``` - -### Advanced data frame tools - -* dplyr -* tidyr -* reshape2 -* data.table - -Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve. - -### Reading data - -```{r eval=FALSE} -myData = read.csv("SomeFile.csv") -``` - -### Reading from Excel - -Actually there are several ways: -* XLConnect -* xlsx -* Excelsi-r - -```{r eval=FALSE} -library(XLConnect) -wbk = loadWorkbook("myWorkbook.xlsx") -df = readWorksheet(wbk, someSheet) -``` - -### Reading from the web - 1 - -```{r eval=FALSE} -URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv" -df = read.csv(URL, stringsAsFactors = FALSE) -``` - -### Reading from the web - 2 - -```{r eval=FALSE} -library(XML) -URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm" -games = readHTMLTable(URL, stringsAsFactors = FALSE) -``` - -### Reading from a database - -```{r eval=FALSE} -library(RODBC) -myChannel = odbcConnect(dsn = "MyDSN_Name") -df = sqlQuery(myChannel, "SELECT stuff FROM myTable") -``` - -### Read some data - -```{r eval=FALSE} -df = read.csv("../data-raw/StateData.csv") -``` - -```{r eval=FALSE} -View(df) -``` - -### Exercises - -* Load the data from "StateData.csv" into a data frame. -* Which state has the most premium? - -### Answer - -```{r } -``` \ No newline at end of file diff --git a/DataTypes.Rmd b/DataTypes.Rmd new file mode 100644 index 0000000..6407061 --- /dev/null +++ b/DataTypes.Rmd @@ -0,0 +1,173 @@ +# Data {-} + +Yay, data! The good stuff! Well, pretty good, but we'll have to walk through some + +# Data types {#chap-data-types} + +* What are the different data types? +* When and how is one type converted to another? +* How can I tell what sort of data I'm working with? + +To a human, the difference between something numeric- like a person's age- and something textual - like their name - isn't a big deal. To a computer, however, this matters a lot. In order to ensure that there is sufficient memory to store the information and to ensure that it may be used in an operation, the computer needs to know what type of data it's working with. In other words: 5 + "Steve" = Huh? We' + +## Data types + +* logical +* integer +* double +* character + +### What is it? + +```{r} +x <- 6 +y <- 6L +z <- TRUE +typeof(x) +typeof(y) +typeof(z) +is.logical(x) +is.double(x) +``` + +### Data conversion + +Most conversion is implicit. For explicit conversion, use the `as.*` functions. + +Implicit conversion alters everything to the most complex form of data present as follows: + +logical -> integer -> double -> character + +Explicit conversion usually implies truncation and loss of information. + +```{r } +# Implicit conversion +w <- TRUE +x <- 4L +y <- 5.8 +z <- w + x + y +typeof(z) + +# Explicit conversion. Note loss of data. +as.integer(z) +``` + +### Class + +A class is an extension of the basic data types. We'll see many examples of these. The class of a basic type will be equal to its type apart from 'double', whose class is 'numeric' for reasons I don't pretend to understand. + +```{r } +class(TRUE) +class(pi) +class(4L) +``` + +### Mode + +There is also a function called 'mode' which looks tempting. Ignore it. + +### Dates and times + +Dates in R can be tricky. Two basic classes: `Date` and `POSIXt`. The `Date` class does not get more granular than days. The `POSIXt` class can handle seconds, milliseconds, etc. + +My recommendation is to stick with the "Date" class. Introducing times means introducing time zones and possibility for confusion or error. Actuaries rarely need to measure things in minutes. + +```{r } +x <- as.Date('2010-01-01') +class(x) +typeof(x) +``` + +### More on dates + +The default behavior for dates is that they don't follow US conventions. + +Don't do this: +```{r error=TRUE} +x <- as.Date('06-30-2010') +``` + +But this is just fine: +```{r } +x <- as.Date('30-06-2010') +``` + +If you want to preserve your sanity, stick with year, month, day. +```{r } +x <- as.Date('2010-06-30') +``` + +### What day is it? + +To get the date and time of the computer, use the either `Sys.Date()` or `Sys.time()`. Note that `Sys.time()` will return both the day AND the time as a POSIXct object. + +```{r } +x <- Sys.Date() +y <- Sys.time() +``` + +### More reading on dates + +Worth reading the documentation about dates. Measuring time periods is a common task for actuaries. It's easy to make huge mistakes by getting dates wrong. + +The `lubridate` package has some nice convenience functions for setting month and day and reasoning about time periods. It also enables you to deal with time zones, leap days and leap seconds. Probably more than you need. + +`mondate` was written by an actuary and supports (among other things) handling time periods in terms of months. + +* [Date class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html) +* [lubridate](http://www.jstatsoft.org/v40/i03/paper) +* [Ripley and Hornik](http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf) +* [mondate](https://code.google.com/p/mondate/) + +### Factors + +Another gotcha. Factors were necessary many years ago when data collection and storage were expensive. A factor is a mapping of a character string to an integer. Particularly when importing data, R often wants to convert character values into a factor. You will often want to convert a factor into a string. + +```{r } +myColors <- c("Red", "Blue", "Green", "Red", "Blue", "Red") +myFactor <- factor(myColors) +typeof(myFactor) +class(myFactor) +is.character(myFactor) +is.character(myColors) +``` + +### Altering factors + +```{r } +# This probably won't give you what you expect +myOtherFactor <- c(myFactor, "Orange") +myOtherFactor + +# And this will give you an error +myFactor[length(myFactor)+1] <- "Orange" + +# Must do things in two steps +myOtherFactor <- factor(c(levels(myFactor), "Orange")) +myOtherFactor[length(myOtherFactor)+1] <- "Orange" +``` + +### Avoid factors + +Now that you know what they are, you can spend the next few months avoiding factors. When R was created, there were compelling reasons to include factors and they still have some utility. More often than not, though, they're a confusing hindrance. + +If characters aren't behaving the way you expect them to, check the variables with `is.factor`. Convert them with `as.character` and you'll be back on the road to happiness. + +## Exercises + +* Create a logical, integer, double and character variable. +* Can you create a vector with both logical and character values? +* What happens when you try to add a logical to an integer? An integer to a double? + +### Answers +```{r } +myLogical <- TRUE +myInteger <- 1:4 +myDouble <- 3.14 +myCharacter <- "Hello!" + +y <- myLogical + myInteger +typeof(y) +y <- myInteger + myDouble +typeof(y) +```