diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 0000000..a03b66c --- /dev/null +++ b/.travis.yml @@ -0,0 +1,10 @@ +language: r +cache: packages + +before_script: + - chmod +x ./_build.sh + - chmod +x ./_deploy.sh + +script: + - ./_build.sh + - ./_deploy.sh diff --git a/DataFrames.Rmd b/DataFrames.Rmd index a74361f..b5eb2c7 100644 --- a/DataFrames.Rmd +++ b/DataFrames.Rmd @@ -1,125 +1,149 @@ # Data Frames -Finally! This +This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. -By the end of this chapter you will know: +By the end of this chapter you will know the following: * What's a data frame and how do I create one? +* How can I access and assign elements to and from a data frame? * How do I read and write external data? ## What's a data frame? -All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R. - -Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. - -A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length. +Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it's constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there's no danger that I'll get character data when I only want dates or integers. At the same time, the list's flexibility means that we can store different data types in one single data construct. ### Creating a data frame +Although there are many functions that will return a data frame, let's start by looking at the `data.frame` function. We'll first create some vectors and then join them together. + ```{r } set.seed(1234) -State = rep(c("TX", "NY", "CA"), 10) -EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1) -EarnedPremium = round(EarnedPremium, -3) -Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9) +State <- rep(c("TX", "NY", "CA"), 10) +EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1) + +df <- data.frame(State, EarnedPremium, stringsAsFactors=FALSE) +``` + +We didn't have to create the vectors first. If we wanted, we could pass them in as the result of function calls within the call to `data.frame`. -df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE) +```{r} +set.seed(1234) +df <- data.frame(State = rep(c("TX", "NY", "CA"), 10) + , EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3) + , stringsAsFactors = FALSE) ``` +Note that I've set the "stringsAsFactors" argument to FALSE. If I hadn't, the column "State" would be a factor, rather than a character. See [Data Types: factors](#Factors) for some reasons why we might not want our data to be a factor. + ### Basic properties of a data frame +Once created, it's straightforward to get some basic information about the data we have. + ```{r } summary(df) str(df) +head(df) +tail(df) ``` +We can also query metadata about our data frame. Note that, for a data frame, `names` and `colnames` will return the same result. `dim` will give the number of rows and columns, or you can use `nrow` and `ncol` directly. In particular, note what result is returned by the `length` function. If you think about the fact that a data frame is actually a list, you may be able to guess why `length` returns what the value it does. + ```{r } names(df) colnames(df) -length(df) dim(df) +length(df) nrow(df) ncol(df) ``` +### Merging + +If you've got more than one data frame, it's possible to combine them in several different ways: `rbind`, `cbind` and `merge`. + +`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. + +```{r results='hide'} +dfA = df[1:10,] +dfB = df[11:20, ] +rbind(dfA, dfB) +``` + +`cbind` must have the same number of rows as the data frame. + ```{r } -head(df) -head(df, 2) -tail(df) +dfC = dfA[, 1:2] +cbind(dfA, dfC) +``` + +`merge` is similar to a __JOIN__ operation in a relational database. If you're used to __VLOOKUP__ in Excel[^havent_used_vlookup], you'll love `merge`. It's possible to use multiple columns (i.e. state and effective date) when specifying how to join. If no columns are specified, `merge` will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data. + +```{r } +dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2)) +df = merge(df, dfRateChange) +``` + +[^havent_used_vlookup]: If you've never used VLOOKUP, I can't imagine why you're reading this. + +Finally, consider `expand.grid`. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame. + +```{r} +dfStateClass <- expand.grid(State = c("TX", "CA", "NY") + , Class = c(8810, 1066, 1492)) ``` -### Referencing +## Access and Assignment -Very similar to referencing a 2D matrix. +Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We'll often use the `[]` operator. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element. ```{r eval=FALSE} -df[2,3] -df[2] -df[2,] +df[1, 2] +df[2, ] +df[, 2] df[2, -1] ``` -Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list. +If we want columns of the data frame, we can approach this one of two ways. The `$` operator works the same as a list (because a data frame _is_ a list) to access named columns. We can also pass character strings in as arguments to get columns by name, rather than by position. ```{r eval=FALSE} df$EarnedPremium -# Columns of a data frame may be treated as vectors -df$EarnedPremium[3] -df[2:4, 1:2] df[, "EarnedPremium"] df[, c("EarnedPremium", "State")] ``` -### Ordering +If we're only interested in one column, we can use the `[[]]` operator to return it. -```{r } -order(df$EarnedPremium) -df = df[order(df$EarnedPremium), ] +```{r} +head(df[[1]]) +head(df[["EarnedPremium"]]) ``` -### Altering and adding columns +Note the interesting case when we specify one argument but leave the other blank. I'll wrap it in a `head` function to minimize the output. -```{r } -df$LossRatio = df$EarnedPremium / df$Losses -df$LossRatio = 1 / df$LossRatio +```{r} +head(df[2]) ``` -### Eliminating columns +Again, if you bear in mind that a data frame is a list, this makes a bit more sense. Without the additional comma, R will assume that we're dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list. -```{r } -df$LossRatio = NULL -df = df[, 1:2] +```{r} +head(df[1:2]) ``` -### rbind, cbind +It'll probably be rare that you want this sort of behavior. It's not so rare that you'll want to return a single vector from a data frame, but in these cases, I'd recommend either using the `$` or the `[[]]` operator. -`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame. - -```{r results='hide'} -dfA = df[1:10,] -dfB = df[11:20, ] -rbind(dfA, dfB) -dfC = dfA[, 1:2] -cbind(dfA, dfC) -``` - -### Merging +### Altering and adding columns -```{r size='tiny'} -dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2)) -df = merge(df, dfRateChange) +```{r } +df$Losses <- df$EarnedPremium * runif(nrow(df), 0.4, 1.2) +df$LossRatio = df$Losses / df$EarnedPremium ``` -Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL. - -### Altering column names +### Eliminating columns ```{r } -df$LossRation = with(df, Losses / EarnedPremium) -names(df) -colnames(df)[4] = "Loss Ratio" -colnames(df) +df$LossRatio = NULL +df = df[, 1:2] ``` ### Subsetting - The easy way @@ -148,6 +172,40 @@ dfBigPolicies = df[whichEP, ] I use each of these three methods routinely. They're all good. +### Ordering + +```{r } +order(df$EarnedPremium) +df = df[order(df$EarnedPremium), ] +``` + +### Altering column names + +```{r eval=FALSE} +df$LossRation = with(df, Losses / EarnedPremium) +names(df) +colnames(df)[4] = "Loss Ratio" +colnames(df) +``` + +### `with` and `attach` + +If the name of a data frame is long, continually typing it to access column elements might start to seem tedious. There are any number of texts that will use the `attach` function to alleviate this. This will attach the data frame onto something called the "search path" (which I might have described in the section on packages). What's the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it'll proceed with whatever calculation you've asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search. + +```{r} +attach(dfA) +# do some stuff +attach(dfB) +``` + +So why is this a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis. + +```{r} +mean(EarnedPremium) +``` + +Which EarnedPremium am I talking about? In all the confusion, I lost track myself. Well, do you feel lucky punk? + ## Summarizing ```{r } @@ -180,7 +238,7 @@ dotchart(dfByState$EarnedPremium, dfByState$State, pch=19) Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve. -### Reading data +## Reading and writing external data ```{r eval=FALSE} myData = read.csv("SomeFile.csv") @@ -199,15 +257,13 @@ wbk = loadWorkbook("myWorkbook.xlsx") df = readWorksheet(wbk, someSheet) ``` -### Reading from the web - 1 +### Reading from the web ```{r eval=FALSE} URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv" df = read.csv(URL, stringsAsFactors = FALSE) ``` -### Reading from the web - 2 - ```{r eval=FALSE} library(XML) URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm" @@ -232,8 +288,9 @@ df = read.csv("../data-raw/StateData.csv") View(df) ``` -### Exercises +## Exercises +* * Load the data from "StateData.csv" into a data frame. * Which state has the most premium? diff --git a/DataTypes.Rmd b/DataTypes.Rmd index 6407061..60250fe 100644 --- a/DataTypes.Rmd +++ b/DataTypes.Rmd @@ -1,27 +1,32 @@ # Data {-} -Yay, data! The good stuff! Well, pretty good, but we'll have to walk through some +Yay, data! The good stuff! Well, pretty good, but we've got a lot to cover. Some of this may not make much sense on a first read. Don't get bogged down, press on and refer to this later if you feel you need to. # Data types {#chap-data-types} +To a human, the difference between something numeric- like a person's age- and something textual - like their name - isn't a big deal. To a computer, however, this matters a lot. In order to ensure that there is sufficient memory to store the information and to ensure that it may be used in an operation, the computer needs to know what type of data it's working with. In other words: 5 + "Steve" = Huh? + +In this chapter, we'll talk through the various primitive data types that R supports. By the end of this chapter, you will be able to answer the following: + * What are the different data types? * When and how is one type converted to another? -* How can I tell what sort of data I'm working with? - -To a human, the difference between something numeric- like a person's age- and something textual - like their name - isn't a big deal. To a computer, however, this matters a lot. In order to ensure that there is sufficient memory to store the information and to ensure that it may be used in an operation, the computer needs to know what type of data it's working with. In other words: 5 + "Steve" = Huh? We' +* How can I work with dates? +* What the heck is a factor? ## Data types +R supports four "primitive" data types as shown below: + * logical * integer * double * character -### What is it? +To know what type of data you're working with, you use the (wait for it) `typeof` function. If you want to test for a specific data type, you can use the suite of `is.` functions. Have a look at the example below. Note that when we want something to be an integer, we type the letter "L" after the number. ```{r} x <- 6 -y <- 6L +y <- 3L z <- TRUE typeof(x) typeof(y) @@ -30,15 +35,27 @@ is.logical(x) is.double(x) ``` -### Data conversion +## Data conversion -Most conversion is implicit. For explicit conversion, use the `as.*` functions. +It's possible to convert from one type to another. Most of the time, this happens implicitly as part of an operation. R will alter data in order for calculations to take place. For example, let's say that I'm adding together `x` and `y` from the code snippet above. We know that an integer and a real number will add together easily, but the computer needs to convert the integer before the operation can take place. -Implicit conversion alters everything to the most complex form of data present as follows: +```{r} +typeof(x + y) +``` + +Implicit conversion will change data types in the order shown below. Note that all data types for an operation will be converted to the most complex number involved in the calculation. logical -> integer -> double -> character -Explicit conversion usually implies truncation and loss of information. +Note that implicit conversion can't always help us. Let's try the example from the start of this chapter. + +```{r error=TRUE} +5 + 'Steve' +``` + +Here, R is telling us that it doesn't know how to add a number and a word. I don't either. + +For explicit conversion, use the `as.*` functions. When explicit conversion is used to convert a value to a simpler data type - double to integer, say - that there will likely be loss of information. ```{r } # Implicit conversion @@ -52,25 +69,39 @@ typeof(z) as.integer(z) ``` -### Class +In addition to `typeof` there are two other functions which will return basic information about an object. -A class is an extension of the basic data types. We'll see many examples of these. The class of a basic type will be equal to its type apart from 'double', whose class is 'numeric' for reasons I don't pretend to understand. +The `mode` of an object will return a value indicating how the object is meant to be stored. This will generally mirror the output produced by `typeof` except that double and integers both have a mode of "numeric". This function has never improved my life and it won't be discussed any further. + +A `class` of an object is a very special kind of metadata. (We'll get more into metadata in the next chapter [Vectors](#chap-vectors).) When we get beyond primitive data types, this starts to become important. We'll see two examples in just a moment when we talk about dates and factors. The class of a basic type will be equal to its type apart from 'double', whose class is 'numeric' for reasons I don't pretend to understand. ```{r } class(TRUE) class(pi) class(4L) +class(Sys.Date()) ``` -### Mode +The table below summarizes most of the ways we can sort out what sort of data we're working with. + +```{r echo = FALSE} +dfDataTypeFunctions <- data.frame(Function = c("typeof", "mode", "class", "inherits", "is.") + , Returns = c("The type of the object" + , "Storage mode of the object" + , "The class(es) of the object" + , "Whether the object is a particular class" + , "Whether the object is a particular type")) -There is also a function called 'mode' which looks tempting. Ignore it. +knitr::kable( + dfDataTypeFunctions, booktabs = TRUE, + caption = 'Key similarities and differences between vectors and lists' +) +``` -### Dates and times -Dates in R can be tricky. Two basic classes: `Date` and `POSIXt`. The `Date` class does not get more granular than days. The `POSIXt` class can handle seconds, milliseconds, etc. +## Dates and times -My recommendation is to stick with the "Date" class. Introducing times means introducing time zones and possibility for confusion or error. Actuaries rarely need to measure things in minutes. +Dates in R can be tricky. There are two basic classes: `Date` and `POSIXt`. The `Date` class does not get more granular than days. The `POSIXt` class can handle seconds, milliseconds, etc. My recommendation is to stick with the "Date" class. Introducing times means introducing time zones and the possibility for confusion or error. Actuaries rarely need to measure things in minutes. ```{r } x <- as.Date('2010-01-01') @@ -78,27 +109,19 @@ class(x) typeof(x) ``` -### More on dates +By default, dates don't follow US conventions. Much like avoiding the metric system, United Statesians are sticking with a convention that doesn't have a lot of logical support. If you want to preserve your sanity, stick with year, month, day order. -The default behavior for dates is that they don't follow US conventions. - -Don't do this: ```{r error=TRUE} +# Don't do this: x <- as.Date('06-30-2010') -``` -But this is just fine: -```{r } +# But this is just fine: x <- as.Date('30-06-2010') -``` -If you want to preserve your sanity, stick with year, month, day. -```{r } +# Year, month, day is your friend x <- as.Date('2010-06-30') ``` -### What day is it? - To get the date and time of the computer, use the either `Sys.Date()` or `Sys.time()`. Note that `Sys.time()` will return both the day AND the time as a POSIXct object. ```{r } @@ -106,33 +129,46 @@ x <- Sys.Date() y <- Sys.time() ``` -### More reading on dates +It's worth reading the documentation about dates. Measuring time periods is a common task for actuaries. It's easy to make huge mistakes by getting dates wrong. + +The `lubridate` package has some nice convenience functions for setting month and day and reasoning about time periods. It also enables you to deal with time zones, leap days and leap seconds. This is probably more than most folks need, but it's worth looking into. -Worth reading the documentation about dates. Measuring time periods is a common task for actuaries. It's easy to make huge mistakes by getting dates wrong. +The `mondate` package was written by Daniel Murphy (an actuary) and supports handling time periods in terms of months. This is a very good thing. You'll quickly learn that the base functions don't like dealing with time periods as measured in months. Why? Because they're all different lengths. It's not clear how to add "one month" to a set of dates. And yet, we very often want to do this. An easy example is adding a set of months to the last day in a month. The close of a quarter is a common task in financial circles. The code below will produce the end of the quarter for a single year. [^mondate_quarter] -The `lubridate` package has some nice convenience functions for setting month and day and reasoning about time periods. It also enables you to deal with time zones, leap days and leap seconds. Probably more than you need. +[^mondate_quarter]: You can also use the `quarter` function to achieve much the same thing. -`mondate` was written by an actuary and supports (among other things) handling time periods in terms of months. +```{r} +library(mondate) +add(mondate("2010-03-31"), c(0, 3, 6, 9), units = "months") +``` + +The items below are all worth reading. + +* Date class: [https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html) +* lubridate: [http://www.jstatsoft.org/v40/i03/paper](http://www.jstatsoft.org/v40/i03/paper) +* Ripley and Hornik: http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf +* mondate: (https://code.google.com/p/mondate/) -* [Date class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html) -* [lubridate](http://www.jstatsoft.org/v40/i03/paper) -* [Ripley and Hornik](http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf) -* [mondate](https://code.google.com/p/mondate/) +## Factors -### Factors +Factors are a pretty big gotcha. They were necessary many years ago when data collection and storage were expensive. A factor maps a character string to an integer, so that it takes up less space. The code below will illustrate the difference between a factor and a compararble character vector[^getting_to_vectors]. -Another gotcha. Factors were necessary many years ago when data collection and storage were expensive. A factor is a mapping of a character string to an integer. Particularly when importing data, R often wants to convert character values into a factor. You will often want to convert a factor into a string. +[^getting_to_vectors]: We haven't covered vectors yet, but we're getting there. If this code is confusing, just skip this section for now and come back after you've read up on vectors. ```{r } myColors <- c("Red", "Blue", "Green", "Red", "Blue", "Red") myFactor <- factor(myColors) +myColors +myFactor typeof(myFactor) class(myFactor) is.character(myFactor) is.character(myColors) ``` -### Altering factors +Note that when we printed the value of `myFactor` we got the list of colors, but without the quotes around them. We are also told that our object has "Levels". This is important as it defines the set of possible values for the factor. This is rather useful if you have a data set where the permissible values are constrained to a closed set, like gender, education, smoker/non-smoker, etc. + +So, what happens if we want to add a new element to our factor? ```{r } # This probably won't give you what you expect @@ -147,11 +183,15 @@ myOtherFactor <- factor(c(levels(myFactor), "Orange")) myOtherFactor[length(myOtherFactor)+1] <- "Orange" ``` -### Avoid factors +Ugh. In the first instance, R recognizes that it can't append a new item to the factor. So, it converts the values to a string and then appends the string "Orange". But note that the items are string values of integers. That's because the underlying data of a factor _is_ an integer. In the second instance, we first have to change the levels of the factor and then we can append our new data element. + +Often when creating a data frame, R's default behavior is to convert character values into a factor. When we get to creating data frames and importing data, you'll often see us use code like the following: -Now that you know what they are, you can spend the next few months avoiding factors. When R was created, there were compelling reasons to include factors and they still have some utility. More often than not, though, they're a confusing hindrance. +```{r eval=FALSE} +mojo <- read.csv("myFile.csv", stringsAsFactors = FALSE) +``` -If characters aren't behaving the way you expect them to, check the variables with `is.factor`. Convert them with `as.character` and you'll be back on the road to happiness. +Now that you know what they are, you can spend the next few months avoiding factors. When R was created, there were compelling reasons to include factors and they still have some utility. More often than not, though, they're a confusing hindrance. If characters aren't behaving the way you expect them to, check the variables with `class` or `is.factor`. Convert them with `as.character` and you'll be back on the road to happiness. ## Exercises diff --git a/GettingStarted.Rmd b/GettingStarted.Rmd index 26b2287..b9750e2 100644 --- a/GettingStarted.Rmd +++ b/GettingStarted.Rmd @@ -1,11 +1,12 @@ # Getting started -> "R isn't software. It's a community." -> --- John Chambers +This chapter will give you a short tour through R. We'll assume that you've installed R and Rstudio. By the end of this chapter you will be able to: -This chapter will give you a short tour through R. - -* Enter a few basic commands +* enter some basic commands, +* draw some pictures, +* create your first statistical model, +* save and reload a script, +* understand what a package is and how to install them. ## The Operating Environment @@ -25,18 +26,18 @@ R, like S before it, presumed that users would interact with the program from th Throughout this book, I will assume that you're using RStudio. You don't have to, but I will strongly recommend it. Why? * Things are easier with RStudio +* RStudio, keeps track of all the variables in memory +* Everyone else is using it[^bad_reason]. -RStudio, keeps track of all the variables in memory - -* Everyone else is using it. - -OK, not much of an argument. This is the exact opposite of the logic our parents used to try and discourage us from smoking. However, in this case, it makes sense. When you're talking with other people and trying to reproduce your problem or share your awesome code, they're probably using RStudio. Using the same tool reduces the amount of effort needed to communicate. +[^bad_reason]: OK, not much of an argument. This is the exact opposite of the logic our parents used to try and discourage us from smoking. However, in this case, it makes sense. When you're talking with other people and trying to reproduce your problem or share your awesome code, they're probably using RStudio. Using the same tool reduces the amount of effort needed to communicate. ## Entering Commands -Now that you've got an is environment, you're ready to go. That cursor is blinking and waiting for you to tell it what to do! So what's the first thing you'll accomplish? +Now that you've got an environment, you're ready to go. That cursor is blinking and waiting for you to tell it what to do! So what's the first thing you'll accomplish? + +In RStudio, the console may be reached by pressing CTRL-2 (Command-2 on Mac). -Well, not much. We'll get into more fun stuff in the next chapter, but for now let's play it safe. You can use R a basic calculator, so take a few minutes to enter some basic mathematical expressions. +Well, not much. We'll get into more fun stuff soon, but for now let's play it safe. You can use R a basic calculator, so take a few minutes to enter some basic mathematical expressions. ```{r eval=TRUE, echo=TRUE} 1 + 1 @@ -46,47 +47,7 @@ pi 2*pi*4^2 ``` -* I can't find the console - -In RStudio, the console may be reached by pressing CTRL-2 (Command-2 on Mac). - -## Getting help - -```{r eval=FALSE, echo=TRUE, size='tiny'} -?plot - -??cluster -``` - -Within RStudio, the TAB key will autocomplete - -## The working directory - -The source of much frustration when starting out. - -Where am I? - -```{r eval=TRUE, echo=TRUE, size='tiny'} -getwd() -``` - -How do I get somewhere else? - -```{r eval=FALSE, results='hide', size='tiny'} -setwd("~/SomeNewDirectory/SomeSubfolder") -``` - -Try to stick with relative pathnames. This makes work portable. - -### Directory paths - -R prefers *nix style directories, i.e. "/", NOT "\\". Windows prefers "\\". - -"\\" is an "escape" character, used for things like tabs and newline characters. To get a single slash, just type it twice. - -More on file operations in the handout. - -### Source files +## Your first script Typing, editing and debugging at the command line will get tedious quickly. @@ -94,8 +55,6 @@ A source file (file extension .R) contains a sequence of commands. Analogous to the formulae entered in a spreadsheet (but so much more powerful!) -## Your first script - ```{r} N <- 100 B0 <- 5 @@ -129,3 +88,38 @@ source("SomefileName.R") Within RStudio, you may also click the "Source" button in the upper right hand corner. +## Getting help + +```{r eval=FALSE, echo=TRUE, size='tiny'} +?plot + +??cluster +``` + +Within RStudio, the TAB key will autocomplete + +## The working directory + +The source of much frustration when starting out. + +Where am I? + +```{r eval=TRUE, echo=TRUE, size='tiny'} +getwd() +``` + +How do I get somewhere else? + +```{r eval=FALSE, results='hide', size='tiny'} +setwd("~/SomeNewDirectory/SomeSubfolder") +``` + +Try to stick with relative pathnames. This makes work portable. + +### Directory paths + +R prefers Unix style directories[^mac_also_unix]. This means "/", __not__ "\\". Windows prefers "\\". All things being equal, this isn't much of a big deal; it's just an arbitrary convention, like deciding that electricity flows from negative to positive rather than the other way around[^ben_franklin_electricity]. R is designed to be deployed on both Windows and Unix, so its internal functions will use whatever convention applies on the target operating system. However, there's a catch: "\\" is an "escape" character, used for things like tabs and newline characters. To get a single slash, you need to type it twice. + +[^mac_also_unix]: The Mac OS was based on Unix and adopts many of its conventions, among them file path separators. +[^ben_franklin_electricity]: Benjamin Franklin reference here. + diff --git a/LanguageElements.Rmd b/LanguageElements.Rmd index db6ce6c..c571e6f 100644 --- a/LanguageElements.Rmd +++ b/LanguageElements.Rmd @@ -4,17 +4,17 @@ library(pander) # Elements of the Language -There are certain concepts common to virtually all programming languages. Those elements are: variables, functions and operators. This chapter will discuss what those are and how they're implemented in R. By the end of this chapter, you will be able to answer the following: +There are certain concepts common to virtually all programming languages, which tell the computer how to behave. Those elements are: variables, functions and operators. This chapter will discuss what those are and how they're implemented in R. By the end of this chapter, you will be able to answer the following: * What is a variable and how do I create and modify them? * How do functions work? -* +* How can I use logic to control what commends get executed? If you're familiar with other languages like Visual Basic, Python or Java Script, you may be tempted to skip this section. If you do, you'll survive, but I'd suggest giving it a quick read. You may learn something about how R differs from those other languages. ## Variables -Programming languages work by assigning values to space in you computer's memory. Those values are then available for computation. Because the value of what's stored in memory may change, we call these things "variables". Think of a cell in a spreadsheet. Before we put something in it, it's just an empty box. We can fill it with whatever we like, be it a person's name, their birthdate, their age, whatever. +Programming languages work by assigning values to space in your computer's memory. Those values are then available for computation. Because the value of what's stored in memory may "vary", we call these things "variables". Think of a cell in a spreadsheet. Before we put something in it, it's just an empty box. We can fill it with whatever we like, be it a person's name, their birthdate, their age, whatever. ### Assignment @@ -28,10 +28,6 @@ r + 2 Both "<-" and "=" will work for assignment. -### Data types - -To a human, the difference between something numeric- like a person's age- and something textual - like their name - isn't a big deal. To a computer, however, this matters a lot. In order to ensure that there is sufficient memory to store the information and to ensure that it may be used in an operation, the computer needs to know what type of data it's working with. In other words: 5 + "Steve" = Huh? - ## Operators ### Mathematical Operators @@ -83,7 +79,7 @@ sqrt(exp(sin(pi))) * cos, sin, tan (and many others) * lgamma, gamma, digamma, trigamma -### Comments +## Comments R uses the hash/pound character "#" to indicate comments. @@ -93,7 +89,7 @@ Comment early and often! Comments should describe "why", not "what". -#### Bad comment +### Bad comment ```{r eval=FALSE} # Take the ratio of loss to premium to determine the loss ratio @@ -101,7 +97,7 @@ Comments should describe "why", not "what". lossRatio <- Losses / Premium ``` -#### Good comment +### Good comment ```{r eval=FALSE} # Because this is a retrospective view of diff --git a/Lists.Rmd b/Lists.Rmd index 37591c5..320a741 100644 --- a/Lists.Rmd +++ b/Lists.Rmd @@ -4,38 +4,68 @@ library(rblocks) # Lists -In this chapter, we're going to learn about lists. Lists can be a bit confusing the first time you begin to use them. Heaven knows it took me ages to get comfortable with them. However, they're a very powerful way to structure data and, once mastered, will give you all kinds of control over pretty much anything the world can throw at you. If vectors are R's atom, lists are molecules. +In this chapter, we're going to learn about lists. Lists can be a bit confusing the first time you begin to use them. Heaven knows it took me ages to get comfortable with them. However, they're a very powerful way to structure data and, once mastered, will give you all kinds of control over pretty much anything the world can throw at you. If vectors are R's atoms, lists are molecules. -By the end of this chapter, you will know: +By the end of this chapter, you will know the following: * What is a list and how are they created? * What is the difference between a list and vector? * When and how do I use `lapply`? -Lists have data of arbitrary complexity. Any type, any length. Note the new `[[ ]]` double bracket operator. +## Lists Overview + +A list is a bit like a vector in that it is a container for many elements of data. However, unlike a vector, the elements of a list may have different data types. In addition, lists may store data _recursively_. This means that a list may contain list, which contains another list and so on. Partially for this reason, access and assignment will use a new operator: `[[]]`. Confusing? Sorta. But don't worry, we'll walk through why and how you'll work with lists. + +```{r echo = FALSE} +dfListVectorDifferences <- data.frame(Vectors = c("Ordered", "All elements have the same data type", "One dimension", "Access and assignment via []", "May contain metadata") + , Lists = c("Also ordered", "Elements may be any data type, even other lists", "Doesn't apply in this context" + , "Access and assignment via [[ ]] and $", "May contain metadata")) +knitr::kable( + dfListVectorDifferences, booktabs = TRUE, + caption = 'Key similarities and differences between vectors and lists' +) +``` + +## List construction + +The `list` function will create a list. Have a look at the code below and try it out yourself. ```{r } -x <- list() +x <- list(c("This", "is", "a", "list") + , c(pi, exp(1))) typeof(x) -x[[1]] <- c("Hello", "there", "this", "is", "a", "list") -x[[2]] <- c(pi, exp(1)) summary(x) str(x) ``` -## Lists Overview +Note that, weirdly - and confusingly - you can create a list by using the `vector` function. + +```{r} + +``` + + +```{r } +dim(x) +``` ```{r echo=FALSE} make_block(x) ``` -### [ vs. [[ +## Access and assignment + +Because list elements can be arbitrarily complex, access and assignment gets a bit stranger. This + +[ vs. [[ `[` is (almost always) used to set and return an element of the same type as the _containing_ object. `[[` is used to set and return an element of the same type as the _contained_ object. -This is why we use `[[` to set an item in a list. +This is why we use `[[` to set an item in a list. Note that `[[` will drop names[^why_drop_names]. + +[^why_drop_names]: I have no idea why it does this. Don't worry if this doesn't make sense yet. It's difficult for most R programmers. @@ -69,9 +99,11 @@ y$Artist y$Age ``` -### `lapply` +## Summary functions + +Because lists are arbitrary, we can't expect functions like `sum` or `mean` to work. Instead, we use functions like `lapply` to summarize particular list elements. -`lapply` is one of many functions which may be applied to lists. Can be difficult at first, but very powerful. Applies the same function to each element of a list. +`lapply` is one of many functions which may be applied to lists. They may not be intuitive at first, but they're very powerful. `lapply` will apply the same function to each element of a list. In the example below, we'll generate some statistics for three different vectors stored in a list. ```{r } myList <- list(firstVector = c(1:10) @@ -82,16 +114,16 @@ lapply(myList, median) lapply(myList, sum) ``` -### Why `lapply`? - -Two reasons: +Why `lapply`? Two reasons: 1. It's expressive. A loop is a lot of code which does little to clarify intent. `lapply` indicates that we want to apply the same function to each element of a list. Think of a formula that exists as a column in a spreadsheet. 2. It's easier to type at an interactive console. In its very early days, `S` was fully interactive. Typing a `for` loop at the console is a tedius and unnecessary task. -### Summary functions +Note that we can also use `lapply` on structures like a vector. + +## Flotsam -Because lists are arbitrary, we can't expect functions like `sum` or `mean` to work. Use `lapply` to summarize particular list elements. +### `split` ## Exercises diff --git a/Setup.Rmd b/Setup.Rmd index 46d5c63..03be9d0 100644 --- a/Setup.Rmd +++ b/Setup.Rmd @@ -14,38 +14,38 @@ By the end of this chapter, you will have done the following: * Install RStudio * Install the `raw` package -## Operating systems +## Installing R + +### Operating systems Although R was developed primarily on Unix-based operating systems, it may be used on many different platforms. As I write these words, there are three major systems in use: Windows, Mac OS and Linux. I've used R and RStudio on all three and the experience is pretty much the same. This is one of the fantastic features of the software. It's meant to be as widely used and portable as possible to maximize its use. -There are one or two operating system quirks, but in general I won't need to OS differences again, apart from one preliminary note. When refrring to keystroke combinationsn, I will only refer to the CTRL key. Mac OS users will understand that this key is CMD on their keyboards. +There are one or two operating system quirks, but in general I won't need to refer to OS differences again, apart from one preliminary note. When refrring to keystroke combinationsn, I will only refer to the CTRL key. Mac OS users will understand that this key is CMD on their keyboards. If you're curious, the version of R and system architecture being used to write this book are noted below: -```{r echo = FALSE} +```{r echo = TRUE} sessionInfo()$platform ``` -### Installing R - In each case, what you'll do is download a file from the internet and then follow the standard process you go through to install software on whichever system you're using. For the most part, installation is quick and painless, but there may be limitations placed on you by your IT department. I have a few suggestions which I hope can help overcome any difficulties you might experience. The first place to look for installation is [cran.r-project.org]. From there, you will see links to downloads for Windows, Mac and Linux. Clicking on the appropriate link will take you to the page that's relevant for your operating system. You may see lots of bizarre, arcane language around binaries, source and tarballs. If those words (in this context) mean nothing to you, don't panic. Some folk like to build their own version of R directly from the source code. If you're reading these instructions, you're probably not one of those people. I recommend getting familiar with the CRAN website and reading the documentation there. If you get totally lost, try the links below which should take you directly to the download site for Windows and Mac. (If you're running Linux, I can't imagine you need my help.) -* [Windows install](http://cran.revolutionanalytics.com/bin/windows/base/) -* [Mac install](http://cran.revolutionanalytics.com/bin/macosx/) - -https://cran.r-project.org/bin/windows/base/ +* Windows install: [http://cran.revolutionanalytics.com/bin/windows/base/](http://cran.revolutionanalytics.com/bin/windows/base/) +* Mac install: [http://cran.revolutionanalytics.com/bin/macosx/](http://cran.revolutionanalytics.com/bin/macosx/) -It's possible that you'll be asked to identify a "mirror". R is hosted on a number of servers throughout the world. It's all the same R, but distributing it in this way helps to minimized load on servers which host the files. +It's possible that you'll be asked to identify a "mirror". R is hosted on a number of servers throughout the world. It's all the same R, but distributing it in this way helps to minimized load on servers which host the files. There's nothing much to decide here - the internet is pretty fast. ### Bitness -32 vs 64 +You may be asked to decide between 32 and 64 bit R. The numbers refer to the width of an address in memory. Software will look for instructions or data in memory using one of those two memory addressing schemes. This young, ungrateful millenial generation wasn't around when it happened, but I can remember when we moved from 16 to 32 bit software. It was a big deal. This is one is less so. You'll probably want to use 64 bit R. Know that there's a chance that you'll run into problems working with other software, particularly the Microsoft Access database or the Java virtual machine. This could result if another program uses 32 bit memory addressing. I'm not going to pretend to be sensitive to all the technical nuances. Pretend that you're from the deep south trying to have a conversation with someone in Scottland. Although you're speaking the same language, it's possible that you're not able to have a conversation. -### Installing R Studio +If this happens, you may be able to use 32 bit R, presuming you've installed it. I won't walk through how to implement that here. + +## Installing R Studio Installing R is most of the battle. Depending on the sort of person you are, it may even be all of the battle (see the following section on environments). R comes with a fairly spartan user interface, which is sufficient to get work done. However, most folk find that they enjoy using an Integrated Development Environment (IDE). This allows one to work on several source files at the same time, read help, observe console output, see what variables are loaded in memory, etc. There are a few options, but I've not yet found anything better than RStudio. @@ -61,26 +61,30 @@ The first thing to bear in mind is that, despite any appearance to the contrary, With that understood, there are several situations you may find yourself in.: -* You have- or can talk your way into- admin rights to your computer +#### You have- or can talk your way into- admin rights to your computer {-} Lucky you. Also lucky me as this is the happy situation that I enjoy. How do you handle this situation? Don't blow it! Be careful what you download, don't greedily consume bandwidth, server space or any of your company's other scarce resources and be VERY NICE to your IT staff. Acknowledge that you're grateful to have been given such trust and pledge not to do anything to have it removed. -* You don't have admin rights to your computer +#### You don't have admin rights to your computer {-} What to do? Request to be given admin rights. Explain why, in detail and don't be evasive or vague. Trust and mutual respect help. Talk to other folks in your department and get them on board. Present a strong business case for why use of this software will permit you and your department to work more efficiently. Show them this book and underline the parts where I tell folk to be nice and respectful towards IT staff. -* IT won't give you admin rights to your computer +#### IT won't give you admin rights to your computer {-} In this case, you may ask them to install it for you. Pool your resources. Talk to other actuaries and analysts in your company. Talk to your boss. -* IT won't install the software. +#### IT won't install the software. {-} Solution? Install the software to a memory stick. Yes, it is often (but not always!) possible to do this. This is is obviously not a preferred option, but it will get you up and running and enable you to attend the workshop. -* Memory sticks are locked down +#### Memory sticks are locked down {-} In this case, your IT department really wants you all to be running terminals. OK. Suggest an install of RStudio Server. This enables R to run on a server with controlled user access. This is quite a lot more work for your IT staff and you'll need to make a strong use case for it. If you're a large organization which has a predictive modelling or analytics area, they'll likely want this software. This won't allow you to use R remotely, so getting the most out of the workshop will be tough. However, there's still one more option. -* The nuclear option +#### The nuclear option {-} + +Your IT staff won't run R on a server, won't give you a laptop with R installed. They're really against this software. I'd like to advise you to get another job, but that's defeatist. This is where we reach the nuclear option, which is to use your own computer. This will drive folks at your company nuts. Now you're transferring data from a secure machine to one which you use for personal e-mail, Facebook, sports, personal finance and other activities that we needn't dwell on here. This is an absolute last resort and the overheard of moving stuff from one device to another will obviate most of the efficiency gains that open source software will provide. Here's how to make it work: produce work that is ONLY POSSIBLE using R, or Python or any of the tools which we will discuss. Show a killer visual and then patiently explain to your boss why it can't be done in Excel and why you can't share it with other departments and why it can't be done every quarter. This is a tall order, but it just might get someone's attention. + +## Conclusion -Your IT staff won't run R on a server, won't give you a laptop with R installed. They're really against this software. I'd like to advise you to get another job, but that's defeatist. This is where we reach the nuclear option, which is to use your own computer. This will drive folks at your company nuts. Now you're transferring data from a secure machine to one which you use for personal e-mail, Facebook, sports, personal finance and other activities that we needn't dwell on here. This is an absolute last resort and the overheard of moving stuff from one device to another will obviate most of the efficiency gains that open source software will provide. Here's how to make it work: produce work that is ONLY POSSIBLE using R, or Python or any of the tools which we will discuss. Show a killer visual and then patiently explain to your boss why it can't be done in Excel and why you can't share it with other departments and why it can't be done every quarter. This is a tall order, but it just might get someone's attention. \ No newline at end of file +Beg, borrow or steal, make friends, vanquish enemies, do whatever you need to do and get the software installed. It may be easy and you'll wonder why I'm making such a fuss. I hope that's how it goes down. If it's difficult, just know that it'll all be worth it. There are a host of crazy issues that you may have to resolve that you'd never see with Excel or Outlook or whatever. Hang in there. With every problem that you solve, you're getting closer and closer to data/stats nirvana. If you emerge from installation with a few battle scars, wear them like badges of honor. One day, you'll be knocking back a beer with Brian Ripley, Hadley Wickham or Dirk Eddelbeutel and speaking their language. \ No newline at end of file diff --git a/Vectors.Rmd b/Vectors.Rmd index eb50b54..981162b 100644 --- a/Vectors.Rmd +++ b/Vectors.Rmd @@ -1,5 +1,5 @@ -# Vectors +# Vectors {#chap-vectors} In this chapter, we're going to learn about vectors, which are one of the key building blocks of R programming. By the end of this chapter, you will know the following: @@ -38,7 +38,7 @@ To generate 100 new values, I just applied a multiplication operation in a singl So, how will I know a vector when I see one? For starters, it will have the following basic properties: -1. Every element in a vector must be the same basic data type. (Refer to {#chap-data-types} for a review of basic data types.) +1. Every element in a vector must be the same basic data type. (Refer to [Chapter 4: Data Types](#chap-data-types) for a review of basic data types.) 2. Vectors are ordered. 3. Vectors have one dimension. Higher dimensions are possible via matrices and arrays. 4. Vectors may have metadata like names for each element. @@ -51,6 +51,14 @@ You needn't spend too much time sweating over that last paragraph. I'm just gett So how do I create a vector? No real trick here. You'll be constructing vectors whether you want to or not. But let's talk about a few core functions for vector construction and manipulation. These are ones that you're likely to see and use often. +### `vector` + +a + +```{r} +x <- vector(mode = "numeric", length = 10) +``` + ### `seq` The first functions we'll look at are `seq` and its first cousin the `:` operator[^colon_gotcha]. You've already seen `:` where we've used it generate a sequence of integers. `seq` is much more flexible: you can specify the starting point, the ending point, the length of the interval and the number of elements to output. You'll need to provide at least three of the parameters and R will figure out everything else. We can think of the four use cases as being associated with which element we opt to leave out. @@ -131,6 +139,18 @@ block_j block_k ``` +### `paste` + +Another gotcha: in many other languages, "concatenation" is an operation which joins multiple strings to generate a single string. However, in R, the `c` function will concatenate multiple vectors (or other objects) into one object. So, how do you concatenate a string? With the `paste` function and its close relative `paste0`. + +```{r} +firstName <- "Brian" +lastName <- "Fannin" +fullName <- paste(lastName, firstName, sep = ", ") +``` + +`paste0` is a shortcut for the common use case where we want to join characters together without anything separating them: + ### `sample` The `sample` function will generate a random sample. This is a great one to use for randomizing a vector. @@ -323,9 +343,13 @@ comment(j) attributes(i) ``` +### `length`, `dim` + +`length` will return the number of elements in a vector. + ## Summarization -Loads of functions take vector input and return scalar output. Translation of a large set of numbers into a few, informative values is one of the cornerstones of statistics. +There are loads of functions which take vector input and return scalar output. Translation of a large set of numbers into a few, informative values is one of the cornerstones of statistics. ```{r eval=FALSE} x = 1:50 diff --git a/_bookdown.yml b/_bookdown.yml index b00468f..e9ac179 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -8,12 +8,10 @@ rmd_files: [ "GettingStarted.Rmd", "LanguageElements.Rmd", + "DataTypes.Rmd", "Vectors.Rmd", "Lists.Rmd", - "Data.Rmd", - - "BasicVisualization.Rmd", - "LossDistributions.Rmd", + "DataFrames.Rmd", "References.Rmd", diff --git a/images/CRAN_Mirrors.png b/images/CRAN_Mirrors.png new file mode 100644 index 0000000..97b90aa Binary files /dev/null and b/images/CRAN_Mirrors.png differ diff --git a/images/CRAN_download.png b/images/CRAN_download.png new file mode 100644 index 0000000..20cf74d Binary files /dev/null and b/images/CRAN_download.png differ diff --git a/images/CRAN_winDownload.png b/images/CRAN_winDownload.png new file mode 100644 index 0000000..d291d0d Binary files /dev/null and b/images/CRAN_winDownload.png differ diff --git a/index.Rmd b/index.Rmd index f134468..fba6d42 100644 --- a/index.Rmd +++ b/index.Rmd @@ -13,9 +13,24 @@ knit: "bookdown::render_book" cover-image: raw_cover.png --- +```{r echo=FALSE} +knitr::opts_chunk$set( + comment = "#>", + collapse = TRUE, + cache = TRUE, + fig.align="center", + fig.pos="t" +) +``` + + + # Introduction {-} -Cover image + +> "R isn't software. It's a community." +> --- John Chambers + Hello! Very happy to have you here and I hope you find this useful. This book is meant to serve as a companion to any of the R training sessions that I'm involved in. A few quick notes before we proceed. diff --git a/packages.bib b/packages.bib index a0f4e9b..d3a4185 100644 --- a/packages.bib +++ b/packages.bib @@ -9,7 +9,7 @@ @Manual{R-base @Manual{R-bookdown, title = {bookdown: Authoring Books with R Markdown}, author = {Yihui Xie}, - note = {R package version 0.0.63}, + note = {R package version 0.1}, url = {https://github.com/rstudio/bookdown}, year = {2016}, } @@ -17,13 +17,34 @@ @Manual{R-knitr title = {knitr: A General-Purpose Package for Dynamic Report Generation in R}, author = {Yihui Xie}, year = {2016}, - note = {R package version 1.12.22}, - url = {http://yihui.name/knitr/}, + note = {R package version 1.14}, + url = {https://CRAN.R-project.org/package=knitr}, +} +@Manual{R-mondate, + title = {mondate: Keep track of dates in terms of months}, + author = {Dan Murphy}, + year = {2013}, + note = {R package version 0.10.01.02}, + url = {https://CRAN.R-project.org/package=mondate}, +} +@Manual{R-pander, + title = {pander: An R Pandoc Writer}, + author = {Gergely Daróczi and Roman Tsegelskyi}, + year = {2015}, + note = {R package version 0.6.0}, + url = {https://CRAN.R-project.org/package=pander}, +} +@Manual{R-rblocks, + title = {rblocks: rBlocks is a port of ipythonblocks to R, to provide a fun and +visual tool to explore data structures and control flow.}, + author = {Ramnath Vaidyanathan}, + year = {2014}, + note = {R package version 0.1.1}, } @Manual{R-rmarkdown, title = {rmarkdown: Dynamic Documents for R}, author = {JJ Allaire and Joe Cheng and Yihui Xie and Jonathan McPherson and Winston Chang and Jeff Allen and Hadley Wickham and Aron Atkins and Rob Hyndman}, year = {2016}, - note = {R package version 0.9.5.9}, + note = {R package version 1.0.9010}, url = {http://rmarkdown.rstudio.com}, } diff --git a/references.bib b/references.bib new file mode 100644 index 0000000..57bf62a --- /dev/null +++ b/references.bib @@ -0,0 +1,8 @@ +@Book{Matloff + , title = {The Art of R Programming} + , author = {Norman Matloff} + , year = {2011} + , publisher = {no starch press} + , isbn = {978-1-59327-384-2} +} +