From 74f4a8afff302d0b62519fbdbc0193b49a709df4 Mon Sep 17 00:00:00 2001 From: Brian Fannin Date: Mon, 11 Apr 2016 22:10:15 -0400 Subject: [PATCH] Initial commit --- 20_Setup.Rmd | 110 +++++++ 25_GettingStarted.Rmd | 174 +++++++++++ 30_Vectors.Rmd | 576 +++++++++++++++++++++++++++++++++++ 35_Lists.Rmd | 104 +++++++ 40_Data.Rmd | 243 +++++++++++++++ 45_BasicVisualization.Rmd | 110 +++++++ 50_LossDistributions.Rmd | 250 +++++++++++++++ 60_Simulation.Rmd | 127 ++++++++ 90_AdvancedVisualization.Rmd | 245 +++++++++++++++ 9 files changed, 1939 insertions(+) create mode 100644 20_Setup.Rmd create mode 100644 25_GettingStarted.Rmd create mode 100644 30_Vectors.Rmd create mode 100644 35_Lists.Rmd create mode 100644 40_Data.Rmd create mode 100644 45_BasicVisualization.Rmd create mode 100644 50_LossDistributions.Rmd create mode 100644 60_Simulation.Rmd create mode 100644 90_AdvancedVisualization.Rmd diff --git a/20_Setup.Rmd b/20_Setup.Rmd new file mode 100644 index 0000000..0f59213 --- /dev/null +++ b/20_Setup.Rmd @@ -0,0 +1,110 @@ + +```{r echo=FALSE, results='hide'} +library(pander) +``` + +# Setup + +By the end of this chapter, you should have mastered the following: + +* Install R installed +* Choose and possibly install an operating environment +* Enter a few basic commands + +## Installation + +R may be used on any of the popular operating systems available today. I've used R on a Windows system, a Mac and in a Linux environment. The experience is pretty much the same everywhere, which is one of the fantastic features of the software. In each case, what you'll do is download a file from the internet and then follow the standard process you go through to install software on whichever system you're using. For the most part, installation is quick and painless, but there may be limitations placed on you by your IT department. I have a few suggestions which I hope can help overcome any difficulties you might experience. + +* Installing R + +The first place to look for installation is cran.r-project.org. From there, you will see links to downloads for Windows, Mac and Linux. Clicking on the appropriate link will take you to the page that's relevant for your operating system. You may see lots of bizarre, arcane language around binaries, source and tarballs. If those words (in this context) mean nothing to you, don't panic. Some folk like to build their own version of R directly from the source code. If you're reading these instructions, you're probably not one of those people. + +I reccommend getting familiar with the CRAN website and reading the documentation there. If you get totally lost, try the links below which should take you directly to the download site for Windows and Mac. (If you're running Linux, I can't imagine you need my help.) + +* [Windows install](http://cran.revolutionanalytics.com/bin/windows/base/) +* [Mac install](http://cran.revolutionanalytics.com/bin/macosx/) + +It's possible that you'll be asked to identify a "mirror". R is hosted on a number of servers throughout the world. It's all the same R, but distributing it in this way helps to minimized load on servers which host the files. + +* Installing R Studio + +Installing R is most of the battle. Depending on the sort of person you are, it may even be all of the battle (see the following section on environments). R comes with a fairly spartan user interface, which is sufficient to get work done. However, most folk find that they enjoy using an Integrated Development Environment (IDE). This allows one to work on several source files at the same time, read help, observe console output, see what variables are loaded in memory, etc. There are a few options, but I've not yet found anything better than RStudio. + +RStudio's main website may be found at [www.rstudio.com](www.rstudio.com). At the time of writing, the download page may be found at [http://www.rstudio.com/products/rstudio/download/](http://www.rstudio.com/products/rstudio/download/). Here you will find links to specific systems. The browser will even attempt to detect what operating system you're using and suggest a link for you. Cool, huh? + +### IT + +I don't think I've ever met anyone who's made it through a white-collar existence without at least one or two frustrating exchanges with a corporate IT department. If you work for a large- or even small- company, you likely have a staff of folks who keep the network running and handle software requests from every user in the company. To ensure that your company's network is free from malicious attack or well-intentioned, but careless or imperfect users, most computers have sensitive areas restricted. This means that if you want to install software, you need an administrator to do it for you. + +What this also means is that your IT department might not be as delighted as you are to install open-source software on the company laptop. This might be a problem that they're not inclined to solve for you and you may find your interaction with IT folks to a bit frustrating and they may seem as though they're not at all helpful. + +The first thing to bear in mind is that, despite any appearance to the contrary, your IT staff is there to help you. Moreover, they're people. They have families who love them, possibly small children who think their moms and dads are awesome, pets who miss them and lives outside of work. They have to deal with ridiculous hours to accommodate you and they get far more complaints than they do praise. Be nice to them and you may be surprised how supportive they can be. + +With that understood, there are several situations you may find yourself in.: + +* You have- or can talk your way into- admin rights to your computer + +Lucky you. Also lucky me as this is the happy situation that I enjoy. How do you handle this situation? Don't blow it! Be careful what you download, don't greedily consume bandwidth, server space or any of your company's other scarce resources and be VERY NICE to your IT staff. Acknowledge that you're grateful to have been given such trust and pledge not to do anything to have it removed. + +* You don't have admin rights to your computer + +What to do? Request to be given admin rights. Explain why, in detail and don't be evasive or vague. Trust and mutual respect help. Talk to other folks in your department and get them on board. Present a strong business case for why use of this software will permit you and your department to work more efficiently. Show them this book and underline the parts where I tell folk to be nice and respectful towards IT staff. + +* IT won't give you admin rights to your computer + +In this case, you may ask them to install it for you. Pool your resources. Talk to other actuaries and analysts in your company. Talk to your boss. + +* IT won't install the software. + +Solution? Install the software to a memory stick. Yes, it is often (but not always!) possible to do this. This is is obviously not a preferred option, but it will get you up and running and enable you to attend the workshop. + +* Memory sticks are locked down + +In this case, your IT department really wants you all to be running terminals. OK. Suggest an install of RStudio Server. This enables R to run on a server with controlled user access. This is quite a lot more work for your IT staff and you'll need to make a strong use case for it. If you're a large organization which has a predictive modelling or analytics area, they'll likely want this software. This won't allow you to use R remotely, so getting the most out of the workshop will be tough. However, there's still one more option. + +* The nuclear option + +Your IT staff won't run R on a server, won't give you a laptop with R installed. They're really against this software. I'd like to advise you to get another job, but that's defeatist. This is where we reach the nuclear option, which is to use your own computer. This will drive folks at your company nuts. Now you're transferring data from a secure machine to one which you use for personal e-mail, Facebook, sports, personal finance and other activities that we needn't dwell on here. This is an absolute last resort and the overheard of moving stuff from one device to another will obviate most of the efficiency gains that open source software will provide. Here's how to make it work: produce work that is ONLY POSSIBLE using R, or Python or any of the tools which we will discuss. Show a killer visual and then patiently explain to your boss why it can't be done in Excel and why you can't share it with other departments and why it can't be done every quarter. This is a tall order, but it just might get someone's attention. + +## The Operating Environment + +Right. So, you've got R installed. Now what? Among the first differences you'll encounter relative to Excel is that you now have several different options when it comes to using R. R is an engine designed to process R commands. Where you store those commands and how you deal with that output is something over which you have a great deal of control. Terrible, frighening control. Here are those options in a nutshell: + +* Command-line interface (CLI) +* RGui +* RStudio +* Others + +### Command-line interface + +R, like S before it, presumed that users would interact with the program from the command line. And, if you invoke the R command from a terminal, that's exactly what you'll get. The image below is from my + +![R at the command-line](images/R_CommandLine.png) + +Throughout this book, I will assume that you're using RStudio. You don't have to, but I will strongly recommend it. Why? + +* Things are easier with RStudio + +RStudio, keeps track of all the variables in memory + +* Everyone else is using it. + +OK, not much of an argument. This is the exact opposite of the logic our parents used to try and discourage us from smoking. However, in this case, it makes sense. When you're talking with other people and trying to reproduce your problem or share your awesome code, they're probably using RStudio. Using the same tool reduces the amount of effort needed to communicate. + +## Entering Commands + +Now that you've got an is environment, you're ready to go. That cursor is blinking and waiting for you to tell it what to do! So what's the first thing you'll accomplish? + +Well, not much. We'll get into more fun stuff in the next chapter, but for now let's play it safe. You can use R a basic calculator, so take a few minutes to enter some basic mathematical expressions. + +```{r eval=TRUE, echo=TRUE} +1 + 1 + +pi + +2*pi*4^2 +``` + +* I can't find the console + +In RStudio, the console may be reached by pressing CTRL-2 (Command-2 on Mac). diff --git a/25_GettingStarted.Rmd b/25_GettingStarted.Rmd new file mode 100644 index 0000000..026064d --- /dev/null +++ b/25_GettingStarted.Rmd @@ -0,0 +1,174 @@ +# Getting started + +### Mathematical Operators + +```{r echo=FALSE, results='asis'} +df = data.frame(Operator = c("+", "-", "*", "/", "^") + , Operation = c("Addition", "Subtraction", "Multiplication" + , "Division", "Exponentiation")) +myTable = pandoc.table(df) +``` + +### Logical Operators + +```{r echo=FALSE, results='asis'} +df = data.frame(Operator = c("&", "|", "!", "==", "!=", "<", "<=", ">", ">=" + , "xor()", "&&", "||") + , Operation = c("and", "or", "not", "equal", "not equal" + , "less than", "less than or equal" + , "greater than", "greater than or equal" + , "exclusive or", "non-vector and", "non-vector or")) +myTable = pandoc.table(df) +``` + +### Assignment + +Assignment will create a variable which contains a value. This value may be used later. + +```{r} +r <- 4 + +r + 2 +``` + +Both "<-" and "=" will work for assignment. + +### Functions + +Functions in R are very similar to functions in a spreadsheet. The function takes in arguments and returns a result. + +```{r } +sqrt(4) +``` + +Functions may be composed by having the output of one serve as the input to another. + +$\sqrt{e^{sin{\pi}}}$ + +```{r } +sqrt(exp(sin(pi))) +``` + +### A few mathematical functions + +```{r eval=FALSE} +?S3groupGeneric +``` + +* abs, sign +* floor, ceiling, trunc, round, signif +* sqrt, exp, log +* cos, sin, tan (and many others) +* lgamma, gamma, digamma, trigamma + +## Getting help + +```{r eval=FALSE, echo=TRUE, size='tiny'} +?plot + +??cluster +``` + +Within RStudio, the TAB key will autocomplete + +## The working directory + +The source of much frustration when starting out. + +Where am I? + +```{r eval=TRUE, echo=TRUE, size='tiny'} +getwd() +``` + +How do I get somewhere else? + +```{r eval=FALSE, results='hide', size='tiny'} +setwd("~/SomeNewDirectory/SomeSubfolder") +``` + +Try to stick with relative pathnames. This makes work portable. + +### Directory paths + +R prefers *nix style directories, i.e. "/", NOT "\\". Windows prefers "\\". + +"\\" is an "escape" character, used for things like tabs and newline characters. To get a single slash, just type it twice. + +More on file operations in the handout. + +### Source files + +Typing, editing and debugging at the command line will get tedious quickly. + +A source file (file extension .R) contains a sequence of commands. + +Analogous to the formulae entered in a spreadsheet (but so much more powerful!) + +## Your first script + +```{r} +N <- 100 +B0 <- 5 +B1 <- 1.5 + +set.seed(1234) + +e <- rnorm(N, mean = 0, sd = 1) +X1 <- rep(seq(1,10),10) + +Y <- B0 + B1 * X1 + e + +myFit <- lm(Y ~ X1) +``` + +Save this file. + +CTRL-S on Windows/Linux, CMD-S on Mac. + +### Executing a script + +Either: + +1. Open the file and execute the lines one at a time, or + +2. Use the "source" function. + +```{r eval=FALSE} +source("SomefileName.R") +``` + +Within RStudio, you may also click the "Source" button in the upper right hand corner. + +### Comments + +R uses the hash/pound character "#" to indicate comments. + +SQL or C++ style multiline comments are not supported. + +Comment early and often! + +Comments should describe "why", not "what". + +#### Bad comment +```{r eval=FALSE} +# Take the ratio of loss to premium to determine the loss ratio + +lossRatio <- Losses / Premium +``` + +#### Good comment +```{r eval=FALSE} +# Because this is a retrospective view of +# profitability, these losses have been +# developed, but not trended to a future +# point in time +lossRatio <- Losses / Premium +``` + +## Quiz + +* What is the area of a cylinder with radius = e and height = pi? +* What arguments are listed for the "plot" function? +* Find the help file for a generalized linear model +* Create a script which calculates the area of a cylinder. From a new script, assign the value 4 to a variable and source the other file. Assign the value 8 to your variable and source again. What happened? \ No newline at end of file diff --git a/30_Vectors.Rmd b/30_Vectors.Rmd new file mode 100644 index 0000000..6a47930 --- /dev/null +++ b/30_Vectors.Rmd @@ -0,0 +1,576 @@ +# Vectors + +In this chapter, we're going to learn about vectors, one of the key building blocks of R programming. By the end of this chapter, you will know: + +* What is a vector? +* How are vectors created? +* What are data types and how can I tell what sort of data I'm working with? +* What is metadata? +* How can I summarize a vector? + +## What is a vector? + +In R, every variable is a vector. Think of a set of contiguous cells in a spreadsheet. + +```{r eval=TRUE} +set.seed(1234) +e <- rnorm(100) +X1 <- 1:10 +``` + +Here, `e` is a vector with `N` values. `X1` is the sequence of integers from 1 through 10. + +Vectors can grow and shrink automatically. No need to move cells around on a sheet. No need to copy formulas or change named ranges. + +### Vector properties + +* Every element in a vector must be the same type. + * R will change data types if they are not! + * Different types are possible by using a list or a data frame (later) +* Vectors have one dimension + * Higher dimensions are possible via matrices and arrays +* Possible to add metadata (like names) via attributes + +### Vector construction + +Vectors are constructed in one of several ways: + +* Return from a function or operation + * `seq`, `rep`, `sample`, `rnorm`, etc. +* Concatenation +* Growth by assignment + +### `seq` + +`seq` is used often to generate a sequence of values. The colon operator `:` is a shortcut for a sequence of integers. + +```{r } +pies = seq(from = 0, by = pi, length.out = 5) +i <- 1:5 +year = 2000:2004 +``` + +```{r echo=FALSE} +library(rblocks) +i <- 1:5 +block = make_block(i, type = 'vector') +block +``` + +### `rep` + +The `rep` function will replicate its input +```{r } +i = rep(pi, 100) +head(i) +``` + +### Concatenation + +The `c()` function will concatenate values. + +```{r results='hide'} +i <- c(1, 2, 3, 4, 5) +j <- c(6, 7, 8, 9, 10) +k <- c(i, j) +l <- c(1:5, 6:10) +``` + +```{r echo=FALSE} +i = c(1, 2, 3, 4, 5) +j = c(6, 7, 8, 9, 10) +k = c(i, j) +block_i = make_block(i, type = 'vector') +block_j = make_block(j, type = 'vector') +block_i[1:5] = "red" +block_j[1:5] = "blue" +block_k = make_block(c(block_i,block_j), type='vector') +block_k[1:5] = "red" +block_k[6:10] <- "blue" + +block_i +block_j + + +block_k +``` + +### Growth by assignment + +Assigning a value beyond a vectors limits will automatically grow the vector. Interim values are assigned `NA`. + +```{r } +i <- 1:10 +i[30] = pi +i +``` + +### Vector access - by index + +Vectors may be accessed by their numeric indices. Remember, ':' is shorthand to generate a sequence. + +```{r } +set.seed(1234) +e <- rnorm(100) +e[1] +e[1:4] +e[c(1,3)] +``` + +### Vector access - logical access + +Vectors may be accessed logically. This may be done by passing in a logical vector, or a logical expression. + +```{r } +i = 5:9 +i[c(TRUE, FALSE, FALSE, FALSE, TRUE)] +i[i > 7] +b = i > 7 +b +i[b] +``` + +### `which` + +The `which` function returns indices that match a logical expression. + +```{r } +i <- 11:20 +which(i > 12) +i[which(i > 12)] +``` + +### `sample` + +The `sample` function will generate a random sample. Great to use for randomizing a vector. + +```{r } +months <- c("January", "February", "March", "April" + , "May", "June", "July", "August" + , "September", "October", "November", "December") + +set.seed(1234) +mixedMonths <- sample(months) +head(mixedMonths) +``` + +Get lots of months with the `size` parameter: + +```{r } +set.seed(1234) +lotsOfMonths <- sample(months, size = 100, replace = TRUE) +head(lotsOfMonths) +``` + +### `sample` II + +Sample may also be used within the indexing of the vector itself: + +```{r } +set.seed(1234) +moreMonths <- months[sample(1:12, replace=TRUE, size=100)] +head(moreMonths) + +# Cleaner with sample.int +set.seed(1234) +evenMoreMonths <- months[sample.int(length(months), size=100, replace=TRUE)] +head(evenMoreMonths) +``` + +### `order` + +The function `order` will return the indices of the vector in order. + +```{r } +set.seed(1234) +x <- sample(1:10) +x +order(x) +x[order(x)] +``` + +### Vector arithmetic + +Vectors may be used in arithmetic operations. + +```{r eval=FALSE} +B0 <- 5 +B1 <- 1.5 + +set.seed(1234) + +e <- rnorm(N, mean = 0, sd = 1) +X1 <- rep(seq(1,10),10) + +Y <- B0 + B1 * X1 + e +``` + +Y is now a vector with length equal to the longest vector used in the calculation. + +Question: B0 and B1 are vectors of length 1. + +X1 and e are vectors of length 100. + +How are they combined? + + +### Recycling + +R will "recycle" vectors until there are enough elements to perform an operation. Everything gets as "long" as the longest vector in the operation. For scalar operations on a vector this doesn't involve any drama. Try the following code: + +```{r size='tiny'} +vector1 = 1:10 +vector2 = 1:5 +scalar = 3 + +print(vector1 + scalar) +print(vector2 + scalar) +print(vector1 + vector2) +``` + +### Set theory - Part I + +The `%in%` operator will return a logical vector indicating whether or not an element of the first set is contained in the second set. + +```{r } +x <- 1:10 +y <- 5:15 +x %in% y +``` + +### Set theory - Part II + +* `union` +* `intersect` +* `setdiff` +* `setequal` +* `is.element` + +```{r eval = FALSE} +?union +``` + +```{r } +x <- 1900:1910 +y <- 1905:1915 +intersect(x, y) +setdiff(x, y) +setequal(x, y) +is.element(1941, y) +``` + +### Summarization + +Loads of functions take vector input and return scalar output. Translation of a large sest of numbers into a few, informative values is one of the cornerstones of statistics. + +```{r eval=FALSE} +x = 1:50 +sum(x) +mean(x) +max(x) +length(x) +var(x) +``` + +### Vectors + +Vectors are like atoms. If you understand vectors- how to create them, how to manipulate them, how to access the elements, you're well on your way to grasping how to handle other objects in R. + +Vectors may combine to form molecules, but fundamentally, _everything_ in R is a vector. + +## From data + + + +* Data types +* From vectors to matrices and lists + +### Data types + +* logical +* integer +* double +* character + +### What is it? + +```{r} +x <- 6 +y <- 6L +z <- TRUE +typeof(x) +typeof(y) +typeof(z) +is.logical(x) +is.double(x) +``` + +### Data conversion + +Most conversion is implicit. For explicit conversion, use the `as.*` functions. + +Implicit conversion alters everything to the most complex form of data present as follows: + +logical -> integer -> double -> character + +Explicit conversion usually implies truncation and loss of information. + +```{r } +# Implicit conversion +w <- TRUE +x <- 4L +y <- 5.8 +z <- w + x + y +typeof(z) + +# Explicit conversion. Note loss of data. +as.integer(z) +``` + +### Class + +A class is an extension of the basic data types. We'll see many examples of these. The class of a basic type will be equal to its type apart from 'double', whose class is 'numeric' for reasons I don't pretend to understand. + +```{r } +class(TRUE) +class(pi) +class(4L) +``` +The type and class of a vector is returned as a scalar. Remember a vector is a set of elements which all have the same type. +```{r } +class(1:4) +``` + +### Mode + +There is also a function called 'mode' which looks tempting. Ignore it. + +### Dates and times + +Dates in R can be tricky. Two basic classes: `Date` and `POSIXt`. The `Date` class does not get more granular than days. The `POSIXt` class can handle seconds, milliseconds, etc. + +My recommendation is to stick with the "Date" class. Introducing times means introducing time zones and possibility for confusion or error. Actuaries rarely need to measure things in minutes. + +```{r } +x <- as.Date('2010-01-01') +class(x) +typeof(x) +``` + +### More on dates + +The default behavior for dates is that they don't follow US conventions. + +Don't do this: +```{r error=TRUE} +x <- as.Date('06-30-2010') +``` + +But this is just fine: +```{r } +x <- as.Date('30-06-2010') +``` + +If you want to preserve your sanity, stick with year, month, day. +```{r } +x <- as.Date('2010-06-30') +``` + +### What day is it? + +To get the date and time of the computer, use the either `Sys.Date()` or `Sys.time()`. Note that `Sys.time()` will return both the day AND the time as a POSIXct object. + +```{r } +x <- Sys.Date() +y <- Sys.time() +``` + +### More reading on dates + +Worth reading the documentation about dates. Measuring time periods is a common task for actuaries. It's easy to make huge mistakes by getting dates wrong. + +The `lubridate` package has some nice convenience functions for setting month and day and reasoning about time periods. It also enables you to deal with time zones, leap days and leap seconds. Probably more than you need. + +`mondate` was written by an actuary and supports (among other things) handling time periods in terms of months. + +* [Date class](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html) +* [lubridate](http://www.jstatsoft.org/v40/i03/paper) +* [Ripley and Hornik](http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf) +* [mondate](https://code.google.com/p/mondate/) + +### Factors + +Another gotcha. Factors were necessary many years ago when data collection and storage were expensive. A factor is a mapping of a character string to an integer. Particularly when importing data, R often wants to convert character values into a factor. You will often want to convert a factor into a string. + +```{r } +myColors <- c("Red", "Blue", "Green", "Red", "Blue", "Red") +myFactor <- factor(myColors) +typeof(myFactor) +class(myFactor) +is.character(myFactor) +is.character(myColors) +``` + +### Altering factors + +```{r } +# This probably won't give you what you expect +myOtherFactor <- c(myFactor, "Orange") +myOtherFactor + +# And this will give you an error +myFactor[length(myFactor)+1] <- "Orange" + +# Must do things in two steps +myOtherFactor <- factor(c(levels(myFactor), "Orange")) +myOtherFactor[length(myOtherFactor)+1] <- "Orange" +``` + +### Avoid factors + +Now that you know what they are, you can spend the next few months avoiding factors. When R was created, there were compelling reasons to include factors and they still have some utility. More often than not, though, they're a confusing hindrance. + +If characters aren't behaving the way you expect them to, check the variables with `is.factor`. Convert them with `as.character` and you'll be back on the road to happiness. + +### Questions + +* Create a logical, integer, double and character variable. +* Can you create a vector with both logical and character values? +* What happens when you try to add a logical to an integer? An integer to a double? + +### Answers +```{r } +myLogical <- TRUE +myInteger <- 1:4 +myDouble <- 3.14 +myCharacter <- "Hello!" + +y <- myLogical + myInteger +typeof(y) +y <- myInteger + myDouble +typeof(y) +``` + +### From vectors to matrices and lists + +A matrix is a vector with higher dimensions. + +A list has both higher dimensions, but also different data types. + +### A matrix + +Two ways to construct: + +1. Use the `matrix` function. +2. Change the dimensions of a `vector`. + +```{r } +myVector <- 1:100 +myMatrix <- matrix(myVector, nrow=10, ncol=10) + +myOtherMatrix <- myVector +dim(myOtherMatrix) <- c(10,10) + +identical(myMatrix, myOtherMatrix) +``` + +### + +```{r } +myMatrix <- matrix(nrow=10, ncol=10) +``` + +```{r echo=FALSE} +library(rblocks) +block_grid(10, 10, type="matrix") +``` + +### + +```{r } +dim(myMatrix) <- c(25, 4) +``` + +```{r echo=FALSE} +block_grid(25, 4, type="matrix") +``` + +### Matrix metadata + +Possible to add metadata. This is typically a name for the columns or rows. + +```{r } +myMatrix <- matrix(nrow=10, ncol=10, data = sample(1:100)) +colnames(myMatrix) <- letters[1:10] +head(myMatrix, 3) +rownames(myMatrix) <- tail(letters, 10) +head(myMatrix, 3) +``` + +### Data access for a matrix + +Matrix access is similar to vector, but with additional dimensions. For two-dimensional matrices, the order is row first, then column. + +```{r } +myMatrix[2, ] +myMatrix[, 2] +``` + +### Data access continued + +Single index will return values by indexing along only one dimension. + +```{r } +myMatrix[2] +myMatrix[22] +``` + +### Matrix summary + +```{r } +sum(myMatrix) +colSums(myMatrix) +rowSums(myMatrix) +colMeans(myMatrix) +``` + +### More than two dimensions + +Like more than two dimensions? Shine on you crazy diamond. + + +## Exercise + +Create a vector of length 10, with years starting from 1980. + +Create a vector with values from 1972 to 2012 in increments of four (1972, 1976, 1980, etc.) + +Construct the following vectors (feel free to use the `VectorQuestion.R` script): +```{r } +FirstName <- c("Richard", "James", "Ronald", "Ronald" + , "George", "William", "William", "George" + , "George", "Barack", "Barack") +LastName <- c("Nixon", "Carter", "Reagan", "Reagan" + , "Bush", "Clinton", "Clinton", "Bush" + , "Bush", "Obama", "Obama") +ElectionYear <- seq(1972, 2012, 4) +``` + +* List the last names in alphabetical order +* List the years in order by first name. +* Create a vector of years when someone named "George" was elected. +* How many Georges were elected before 1996? +* Generate a random sample of 100 presidents. + +## Answers + +```{r } +LastName[order(LastName)] +ElectionYear[order(FirstName)] +ElectionYear[FirstName == 'George'] +myLogical <- (FirstName == 'George') & (ElectionYear < 1996) +length(which(myLogical)) +sum(myLogical) + +sample(LastName, 100, replace = TRUE) +``` diff --git a/35_Lists.Rmd b/35_Lists.Rmd new file mode 100644 index 0000000..19a80e6 --- /dev/null +++ b/35_Lists.Rmd @@ -0,0 +1,104 @@ +# Lists + +In this chapter, we're going to learn about lists. Lists can be a bit confusing the first time you begin to use them. Heaven knows it took me ages to get comfortable with them. However, they're a very powerful way to structure data and, once mastered, will give you all kinds of control over pretty much anything the world can throw at you. If vectors are R's atom, lists are molecules. + +By the end of this chapter, you will know: + +* What is a list and how are they created? +* What is the difference between a list and vector? +* When and how do I use `lapply`? + +Lists have data of arbitrary complexity. Any type, any length. Note the new `[[ ]]` double bracket operator. + +```{r } +x <- list() +typeof(x) +x[[1]] <- c("Hello", "there", "this", "is", "a", "list") +x[[2]] <- c(pi, exp(1)) +summary(x) +str(x) +``` + +### Lists + +```{r echo=FALSE} +make_block(x) +``` + +### [ vs. [[ + +`[` is (almost always) used to set and return an element of the same type as the _containing_ object. + +`[[` is used to set and return an element of the same type as the _contained_ object. + +This is why we use `[[` to set an item in a list. + +Don't worry if this doesn't make sense yet. It's difficult for most R programmers. + +### Recursive storage + +Lists can contain other lists as elements. + +```{r } +y <- list() +y[[1]] <- "Lou Reed" +y[[2]] <- 45 + +x[[3]] <- y +``` + +```{r echo=FALSE} +make_block(x) +``` + +### List metadata + +Again, typically names. However, these become very important for lists. Names are handled with the special `$` operator. `$` permits access to a single element. (A single element of a list can be a vector!) + +```{r} +y[[1]] <- c("Lou Reed", "Patti Smith") +y[[2]] <- c(45, 63) + +names(y) <- c("Artist", "Age") + +y$Artist +y$Age +``` + +### `lapply` + +`lapply` is one of many functions which may be applied to lists. Can be difficult at first, but very powerful. Applies the same function to each element of a list. + +```{r } +myList <- list(firstVector = c(1:10) + , secondVector = c(89, 56, 84, 298, 56) + , thirdVector = c(7,3,5,6,2,4,2)) +lapply(myList, mean) +lapply(myList, median) +lapply(myList, sum) +``` + +### Why `lapply`? + +Two reasons: + +1. It's expressive. A loop is a lot of code which does little to clarify intent. `lapply` indicates that we want to apply the same function to each element of a list. Think of a formula that exists as a column in a spreadsheet. +2. It's easier to type at an interactive console. In its very early days, `S` was fully interactive. Typing a `for` loop at the console is a tedius and unnecessary task. + +### Summary functions + +Because lists are arbitrary, we can't expect functions like `sum` or `mean` to work. Use `lapply` to summarize particular list elements. + +## Questions + +* Create a list with two elements. Have the first element be a vector with 100 numbers. Have the second element be a vector with 100 dates. Give your list the names: "Claim" and "AccidentDate". +* What is the average value of a claim? + +## Answers + +```{r } +myList <- list() +myList$Claims <- rlnorm(100, log(10000)) +myList$AccidentDate <- sample(seq.Date(as.Date('2000-01-01'), as.Date('2009-12-31'), length.out = 1000), 100) +mean(myList$Claims) +``` diff --git a/40_Data.Rmd b/40_Data.Rmd new file mode 100644 index 0000000..8100e1b --- /dev/null +++ b/40_Data.Rmd @@ -0,0 +1,243 @@ +# Data Frames + +Finally! This + +By the end of this chapter you will know: + +* What's a data frame and how do I create one? +* How do I read and write external data? + +## What's a data frame? + +All of that about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical operations expect one and they are the most common way to pass data in and out of R. + +Although critical to understand, this is very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it. + +A data frame is a `list` of `vectors`. Each `vector` may have a different data type, but all must be the same length. + +### Creating a data frame + +```{r } +set.seed(1234) +State = rep(c("TX", "NY", "CA"), 10) +EarnedPremium = rlnorm(length(State), meanlog = log(50000), sdlog=1) +EarnedPremium = round(EarnedPremium, -3) +Losses = EarnedPremium * runif(length(EarnedPremium), min=0.4, max = 0.9) + +df = data.frame(State, EarnedPremium, Losses, stringsAsFactors=FALSE) +``` + +### Basic properties of a data frame + +```{r } +summary(df) +str(df) +``` + +```{r } +names(df) +colnames(df) +length(df) +dim(df) +nrow(df) +ncol(df) +``` + +```{r } +head(df) +head(df, 2) +tail(df) +``` + +### Referencing + +Very similar to referencing a 2D matrix. + +```{r eval=FALSE} +df[2,3] +df[2] +df[2,] +df[2, -1] +``` + +Note the `$` operator to access named columns. A data frame uses the 'name' metadata in the same way as a list. + +```{r eval=FALSE} +df$EarnedPremium +# Columns of a data frame may be treated as vectors +df$EarnedPremium[3] +df[2:4, 1:2] +df[, "EarnedPremium"] +df[, c("EarnedPremium", "State")] +``` + +### Ordering + +```{r } +order(df$EarnedPremium) +df = df[order(df$EarnedPremium), ] +``` + +### Altering and adding columns + +```{r } +df$LossRatio = df$EarnedPremium / df$Losses +df$LossRatio = 1 / df$LossRatio +``` + +### Eliminating columns + +```{r } +df$LossRatio = NULL +df = df[, 1:2] +``` + +### rbind, cbind + +`rbind` will append rows to the data frame. New rows must have the same number of columns and data types. `cbind` must have the same number of rows as the data frame. + +```{r results='hide'} +dfA = df[1:10,] +dfB = df[11:20, ] +rbind(dfA, dfB) +dfC = dfA[, 1:2] +cbind(dfA, dfC) +``` + +### Merging + +```{r size='tiny'} +dfRateChange = data.frame(State =c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2)) +df = merge(df, dfRateChange) +``` + +Merging is VLOOKUP on steroids. Basically equivalent to a JOIN in SQL. + +### Altering column names + +```{r } +df$LossRation = with(df, Losses / EarnedPremium) +names(df) +colnames(df)[4] = "Loss Ratio" +colnames(df) +``` + +### Subsetting - The easy way + +```{r } +dfTX = subset(df, State == "TX") +dfBigPolicies = subset(df, EarnedPremium >= 50000) +``` + +### Subsetting - The hard(ish) way + +```{r } +dfTX = df[df$State == "TX", ] +dfBigPolicies = df[df$EarnedPremium >= 50000, ] +``` + +### Subsetting - Yet another way + +```{r } +whichState = df$State == "TX" +dfTX = df[whichState, ] + +whichEP = df$EarnedPremium >= 50000 +dfBigPolicies = df[whichEP, ] +``` + +I use each of these three methods routinely. They're all good. + +## Summarizing + +```{r } +sum(df$EarnedPremium) +sum(df$EarnedPremium[df$State == "TX"]) + +aggregate(df[,-1], list(df$State), sum) +``` + +### Summarizing visually - 1 + +```{r size='tiny', fig.height=5} +dfByState = aggregate(df$EarnedPremium, list(df$State), sum) +colnames(dfByState) = c("State", "EarnedPremium") +barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue") +``` + +### Summarizing visually - 2 + +```{r size='tiny', fig.height=5} +dotchart(dfByState$EarnedPremium, dfByState$State, pch=19) +``` + +### Advanced data frame tools + +* dplyr +* tidyr +* reshape2 +* data.table + +Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve. + +### Reading data + +```{r eval=FALSE} +myData = read.csv("SomeFile.csv") +``` + +### Reading from Excel + +Actually there are several ways: +* XLConnect +* xlsx +* Excelsi-r + +```{r eval=FALSE} +library(XLConnect) +wbk = loadWorkbook("myWorkbook.xlsx") +df = readWorksheet(wbk, someSheet) +``` + +### Reading from the web - 1 + +```{r eval=FALSE} +URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv" +df = read.csv(URL, stringsAsFactors = FALSE) +``` + +### Reading from the web - 2 + +```{r eval=FALSE} +library(XML) +URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm" +games = readHTMLTable(URL, stringsAsFactors = FALSE) +``` + +### Reading from a database + +```{r eval=FALSE} +library(RODBC) +myChannel = odbcConnect(dsn = "MyDSN_Name") +df = sqlQuery(myChannel, "SELECT stuff FROM myTable") +``` + +### Read some data + +```{r eval=FALSE} +df = read.csv("../data-raw/StateData.csv") +``` + +```{r eval=FALSE} +View(df) +``` + +### Questions + +* Load the data from "StateData.csv" into a data frame. +* Which state has the most premium? + +### Answer + +```{r } +``` \ No newline at end of file diff --git a/45_BasicVisualization.Rmd b/45_BasicVisualization.Rmd new file mode 100644 index 0000000..0095a87 --- /dev/null +++ b/45_BasicVisualization.Rmd @@ -0,0 +1,110 @@ +# Basic Visualization + +It's impossible to overstate the importance of visualization in data analysis. + +* Helps us explore data +* Suggest a model +* Assess the validity of a model and its parameters +* Vital for a non-technical audience + +### Visualization in R + +4 plotting engines (at least) + +* base plotting system +* lattice +* ggplot2 +* rCharts + +We'll look at the base plotting system now and ggplot2 after lunch. + +### Common geometric objects + +* scatter +* line +* hist +* density +* boxplot +* barplot +* dotplot + +plot is the most basic graphics command. There are several dozen options that you can set. Spend a lot of time reading the documentation and experimenting. + +Open your first script. + +### A basic scatter plot + +```{r } +source("./scripts/BasicScript.R") +plot(X1, Y, pch=19) +``` + +### Add lines + +The functions 'lines' and 'points' will add (wait for it) lines and points to a pre-existing plot. +```{r } +plot(X1, Y, pch=19) +lines(X1, yHat) +``` + +### Histogram + +```{r fig.height=4.5} +hist(e) +``` + +### Density plot + +```{r fig.height=4.5} +plot(density(e)) +``` + +### Boxplot + +```{r fig.height=4.5} +boxplot(e, pch=19) +``` + +### Plotting a formula +```{r fig.height=4.5} +plot(Y ~ X1, pch=19) +``` + +### Emphasizing outliers +```{r fig.height=4.5} +colors = ifelse(abs(e) > 1.0, "red", "black") +plot(Y ~ X1, pch=19, col=colors) +``` + +### Other ways to emphasize outliers +```{r fig.height=4.5} +plot(Y ~ X1, pch=19) +lines(X1, yHat, lwd=2) +lines(X1, yHat+1, lty="dotted", lwd=0.5) +lines(X1, yHat-1, lty="dotted", lwd=0.5) +``` + +## Exercise + +* Load the COTOR2 data from the raw package. +* Create a histogram, kernel density plot and box plot for the claims data + +## Answer + +```{r } +library(raw) +data(COTOR2) +hist(COTOR2$Claim) +boxplot(COTOR2$Claim) +plot(density(COTOR2$Claim)) +plot(density(log(COTOR2$Claim))) +hist(log(COTOR2$Claim)) +hist(COTOR2$Claim, breaks=80) +``` + +## Resources + +* [Nathan Yau - FlowingData.com](http://flowingdata.com/) +* [Stephen Few - PerceptualEdge.com](http://www.perceptualedge.com/) +* [Edward Tufte - edwardtufte.com](http://www.edwardtufte.com/tufte/) +* [junkcharts.typepad.com](http://junkcharts.typepad.com/) \ No newline at end of file diff --git a/50_LossDistributions.Rmd b/50_LossDistributions.Rmd new file mode 100644 index 0000000..a65d886 --- /dev/null +++ b/50_LossDistributions.Rmd @@ -0,0 +1,250 @@ +# Loss Distributions + +By the end of this chapter, you will know the following: + +* Simulation with `base` functions +* How to perform basic visualization of loss data +* How to fit a loss distributions +* Goodness of fit + +### Packages we'll use + +* `MASS` (MASS = Modern Applied Statistics in S) + * `fitdistr` will fit a distribution to a loss distribution function +* `actuar` + * `emm` calculates empirical moments + * `lev` limited expected value + * `coverage` modifies a loss distribution for coverage elements + * Contains many more distributions than are found in `base` R such as Burr, Pareto, etc. Basically, anything in "Loss Models" is likely to be found here. + * Contains the dental claims data from "Loss Models" +* Direct optimization + * `optim` function + +### Statistical distributions in R + +Function names are one of 'd', 'p', 'q', 'r' + function name + +* d - probability density +* p - cumulative distribution function +* q - quantiles +* r - random number generator + +### Examples + +```{r } +mu <- 10000 +CV <- 0.30 +sd <- mu * CV +x <- seq(mu - sd*3, mu + sd * 3, length.out = 20) +p <- seq(.05, .95, by = .05) + +dnorm(x, mu, sd) +pnorm(x, mu, sd) +qnorm(p, mu, sd) +rnorm(10, mu, sd) + +dlnorm(x, log(mu), log(sd)) +plnorm(x, log(mu), log(sd)) + +plot(function(x) {dnorm(x, 10, 4)}, 0, 20) +``` + +### Generate some loss data + +```{r } +set.seed(8910) +years <- 2001:2010 +frequency <- 1000 + +N <- rpois(length(years), frequency) + +sevShape <- 2 +sevScale <- 1000 +severity <- rgamma(sum(N), sevShape, scale = sevScale) + +summary(severity) +``` + +### Histograms + +```{r} +hist(severity) +hist(severity, breaks = 50) + +hist(log(severity), breaks = 50) +``` + +### Density + +The kernel density is effectively a smoothed histogram. + +```{r} +plot(density(severity)) + +plot(density(log(severity))) +``` + +###`fitdistr` + +```{r } +library(MASS) + +fitGamma <- fitdistr(severity, "gamma") +fitLognormal <- fitdistr(severity, "lognormal") +fitWeibull <- fitdistr(severity, "Weibull") + +fitGamma +fitLognormal +fitWeibull +``` + +###q-q plot + +```{r } +probabilities = (1:(sum(N)))/(sum(N)+1) + +weibullQ <- qweibull(probabilities, coef(fitWeibull)[1], coef(fitWeibull)[2]) +lnQ <- qlnorm(probabilities, coef(fitLognormal)[1], coef(fitLognormal)[2]) +gammaQ <- qgamma(probabilities, coef(fitGamma)[1], coef(fitGamma)[2]) + +sortedSeverity <- sort(severity) +oldPar <- par(mfrow = c(1,3)) +plot(sort(weibullQ), sortedSeverity, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Weibull Fit") +abline(0,1) + +plot(sort(lnQ), sortedSeverity, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Lognormal Fit") +abline(0,1) + +plot(sort(gammaQ), sortedSeverity, xlab = 'Theoretical Quantiles', ylab = 'Sample Quantiles', pch=19, main = "Gamma Fit") +abline(0,1) + +par(oldPar) +``` + +###Compare fit to histogram + +```{r } +sampleLogMean <- fitLognormal$estimate[1] +sampleLogSd <- fitLognormal$estimate[2] + +sampleShape <- fitGamma$estimate[1] +sampleRate <- fitGamma$estimate[2] + +sampleShapeW <- fitWeibull$estimate[1] +sampleScaleW <- fitWeibull$estimate[2] + +x <- seq(0, max(severity), length.out=500) +yLN <- dlnorm(x, sampleLogMean, sampleLogSd) +yGamma <- dgamma(x, sampleShape, sampleRate) +yWeibull <- dweibull(x, sampleShapeW, sampleScaleW) + +hist(severity, freq=FALSE, ylim=range(yLN, yGamma)) + +lines(x, yLN, col="blue") +lines(x, yGamma, col="red") +lines(x, yWeibull, col="green") +``` + +###Kolmogorov-Smirnov + +The Kolmogorov-Smirnov test measures the distance between an sample distribution and a candidate loss distribution. More formal than q-q plots. + +```{r} +sampleCumul <- seq(1, length(severity)) / length(severity) +stepSample <- stepfun(sortedSeverity, c(0, sampleCumul), f = 0) +yGamma <- pgamma(sortedSeverity, sampleShape, sampleRate) +yWeibull <- pweibull(sortedSeverity, sampleShapeW, sampleScaleW) +yLN <- plnorm(sortedSeverity, sampleLogMean, sampleLogSd) + +plot(stepSample, col="black", main = "K-S Gamma") +lines(sortedSeverity, yGamma, col = "blue") + +plot(stepSample, col="black", main = "K-S Weibull") +lines(sortedSeverity, yWeibull, col = "blue") + +plot(stepSample, col="black", main = "K-S Lognormal") +lines(sortedSeverity, yLN, col = "blue") +``` + +###More K-S + +A low value for D indicates that the selected curve is fairly close to our data. The p-value indicates the chance that D was produced by the null hypothesis. + +```{r } +testGamma <- ks.test(severity, "pgamma", sampleShape, sampleRate) +testLN <- ks.test(severity, "plnorm", sampleLogMean, sampleLogSd) +testWeibull <- ks.test(severity, "pweibull", sampleShapeW, sampleScaleW) + +testGamma +testLN +testWeibull +``` + +###Direct optimization + +The `optim` function will optimize a function. Works very similar to the Solver algorithm in Excel. `optim` takes a function as an argument, so let's create a function. + +```{r} +quadraticFun <- function(a, b, c){ + function(x) a*x^2 + b*x + c +} + +myQuad <- quadraticFun(a=4, b=-3, c=3) +plot(myQuad, -10, 10) +``` + +### Direct optimization + +8 is our initial guess. A good initial guess will speed up conversion. + +```{r } +myResult <- optim(8, myQuad) +myResult +``` + +## Direct optimization +Default is to minimize. Set the parameter `fnscale` to something negative to convert to a maximization problem. + +```{r } +myOtherQuad <- quadraticFun(-6, 20, -5) +plot(myOtherQuad, -10, 10) +myResult <- optim(8, myOtherQuad) +myResult <- optim(8, myOtherQuad, control = list(fnscale=-1)) +``` + +###Direct optimization + +Direct optimization allows us to create another objective function to maximize, or work with loss distributions for which there isn't yet support in a package like `actuar`. May be used for general purpose optimization problems, e.g. maximize rate of return for various capital allocation methods. + +Note that optimization is a general, solved problem. Things like the simplex method already have package solutions in R. You don't need to reinvent the wheel! + +###Questions + +* Plot a lognormal distribution with a mean of $10,000 and a CV of 30%. +* For that distribution, what is the probability of seeing a claim greater than $100,000? +* Generate 100 and 1,000 observations from that distribution. +* Draw a histogram for each sample. +* What are the mean, standard deviation and CV of each sample? +* Convince yourself that the sample data were not produced by a Weibull distribution. +* Assuming that losses are Poisson distributed, with expected value of 200, estimate the aggregate loss distribution. +* What is the cost of a $50,000 xs $50,000 layer of reinsurance? + +###Answers + +```{r } +severity <- 10000 +CV <- .3 +sigma <- sqrt(log(1 + CV^2)) +mu <- log(severity) - sigma^2/2 +plot(function(x) dlnorm(x), mu, sigma, ylab="LN f(x)") +``` + + \ No newline at end of file diff --git a/60_Simulation.Rmd b/60_Simulation.Rmd new file mode 100644 index 0000000..8e9627f --- /dev/null +++ b/60_Simulation.Rmd @@ -0,0 +1,127 @@ +# Simulation + +* Probability distributions +* Random samples + +### Probability distributions +All probability distributions have four basic functions: + +* d dist - Density function +* p dist - Cumulative distribution +* q dist - Quantiles +* r dist - Random number generation + +### Density function + +```{r } +plot(function(x) dnorm(x), -3, 3, ylab="Normal f(x)") +``` + +### A couple more + +```{r } +oldpar = par(mfrow = c(2,1)) +plot(function(x) dexp(x), 0, 10, ylab="Exp f(x)") +plot(function(x) dlnorm(x), 0, 10, ylab="LN f(x)") +par(oldpar) +``` + +### Distribution function + +```{r } +oldpar = par(mfrow = c(3,1)) +plot(function(x) pnorm(x), -3, 3, ylab="Normal F(x)") +plot(function(x) pexp(x), 0, 10, ylab="Exp F(x)") +plot(function(x) plnorm(x), 0, 10, ylab="LN F(x)") +par(oldpar) +``` + +### Density and distribution +```{r } +oldpar = par(mfrow = c(1,2)) +plot(function(x) dlnorm(x), 0, 10, ylab="f(x)") +plot(function(x) plnorm(x), 0, 10, ylab="F(x)") +``` + +### Random number generation +```{r eval=TRUE} +oldpar = par(mfrow = c(3,1)) +hist(rnorm(200)) +hist(rexp(200)) +hist(rlnorm(200)) +par(oldpar) +``` + +### More random number generation +```{r } +oldpar = par(mfrow = c(3,1)) +set.seed(1234) +hist(rnorm(200, mean=0, sd=1), xlim=c(-10, 10), breaks=10) +hist(rnorm(200, mean=0, sd=4), xlim=c(-10, 10), breaks=10) +hist(rnorm(200, mean=5, sd=2), xlim=c(-10, 10), breaks=10) +par(oldpar) +``` + +### `sample` revisited + +Generate a random sample of any discrete set of values. + +```{r } +set.seed(1234) +sample(1:100, 10) +``` + +Use the `prob` argument to weight the probabilities. + +```{r } +sample(1:3, prob=c(1,1,100), replace=TRUE) +``` + +To randomize the order of a vector, leave the default value for replace=FALSE. + +```{r } +set.seed(1234) +letters[sample(length(letters))] +``` + +### Compound loss distribution + +```{r } +class <- c("A", "B", "C", "D") +freq <- 1 / 1e5 +exposure <- 1e6 * c(35, 40, 55, 20) +meanSeverity <- c(8, 7, 12, 10) +set.seed(1234) +numClaims <- rpois(length(exposure), exposure * freq) +dfClass <- data.frame(class, exposure, meanSeverity = exp(meanSeverity), numClaims) + +severity <- lapply(numClaims, rlnorm, meanlog=meanSeverity) +class <- rep(class, numClaims) +dfClaims <- data.frame(class, severity = unlist(severity)) +``` + +## Questions + +* Draw a lognormal distribution with a mean of $10,000 and a CV of 30%. +* For that distribution, what is the probability of seeing a claim greater than $100,000? +* Generate 100 and 1,000 observations from that distribution. +* Draw a histogram for each sample. +* What are the mean, standard deviation and CV of each sample? + +## Answers + +```{r } +severity <- 10000 +CV <- .3 +sigma <- sqrt(log(1 + CV^2)) +mu <- log(severity) - sigma^2/2 +plot(function(x) dlnorm(x), mu, sigma, ylab="LN f(x)") +``` + +## + +```{r } +set.seed(1234) +claims = rlnorm(100, meanlog=log(30000), sdlog=1) +hist(claims, breaks=seq(1, 500000, length.out=40)) +``` \ No newline at end of file diff --git a/90_AdvancedVisualization.Rmd b/90_AdvancedVisualization.Rmd new file mode 100644 index 0000000..b90f469 --- /dev/null +++ b/90_AdvancedVisualization.Rmd @@ -0,0 +1,245 @@ +# Advanced Visualization + +* ggplot2 +* Maps + +### ggplot2 + +ggplot2 developed by Hadley Wickham, based on the "grammar of graphics" + +Particularly well suited for multi-dimensional, multivariate analysis. + +Requires 3 things: + +1. Data +2. Mapping +2. Geometric layers + +### Data + +```{r } +set.seed(1234) +N <- 30 +df <- data.frame(Cohort = c(rep("F", N), rep("M", N)) + , weight = c(rnorm(N, 150, 10), rnorm(N, 172, 10)) + , height = c(rnorm(N, 64, 10), rnorm(N, 70, 10)) +) + +library(ggplot2) +basePlot <- ggplot(df) +``` + +Notice that we assigned the result of the function call to an object called `basePlot`. This means we don't automatically get output. Take a bit of time to have a look at what's contained in the `basePlot` object. + +### Mapping + +Mappings bind data to visual elements. These are added with the `ggplot2::aes` function. We add the mapping to the plot object with the addition operator. + +```{r } +basePlot <- basePlot + aes(x = height, y = weight) +``` + +We can map other aesthetic elements, as well. We'll also map the color of the graphic elements to the `Cohort` data element. + +```{r} +basePlot <- basePlot + aes(color = Cohort) +``` + +### Adding layers + +We're almost there. Although we have data and we've mapped to elements of a coordinate system, we haven't specified what the visual elements should be. The `geom_*` family of functions add geometric shapes. + +```{r } +basePlot <- basePlot + geom_point() +basePlot +``` + +### Typical geoms + +* geom_line +* geom_point +* geom_bar +* geom_histogram +* geom_density +* geom_boxplot + +Lots of others for things like area plots, step functions, dotplots, violin plots, errorbars, etc. + +### One step + +Typically we don't do this in steps. + +```{r } +plt <- ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() +plt +``` + +### Nothing wrong with adding two layers + +```{r } +ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() + geom_rug() +ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() + geom_density2d() +``` + +### Facets + +Facets split the data into groups and draws a different plot for each group. + +```{r} +plt <- ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() + facet_wrap(~ Cohort) + +plt <- ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() + facet_wrap(~ weight) +plt +``` + +### Statistical transformations + +We can also add all manner of statistical smoothers. + +```{r} +plt <- ggplot(df, aes(x = height, y = weight, color = Cohort)) + geom_point() + geom_smooth(method = "lm") +plt +``` + +See `? stat_smooth` for more on smoothers. + +### Scales + +Scales control how things render on the plot. We must scale our numbers to the plotting device (typically a section of a computer screen.) We can also map things like color and axes to data values and control formatting. + +```{r} +plt + scale_x_continuous(labels = scales::comma) +plt + scale_x_continuous(labels = scales::unit_format(unit = "in", nsmall = 2)) +``` + +_Very_ detailed topic, particularly when we start talking about color. Your specific problem will likely require a bit of research and experimentation. StackOverflow.com is your friend. + +### Other visual elements + +Non-data elements are things like labels. Here's a sample of a few: + +* `xlab()`, `ylab()` -> `plt + ylab("This is my y label") + xlab("Here is an x label")` +* `ggtitle()` -> `plt + ggtitle("Title of my plot")` +* `labs()` -> `plt + labs(x = "x-axis title", title = "My title")` +* `theme_bw()`, `theme_minimal()` -> `plt + theme_bw()` +* The `theme()` function gives complete control over all non-data related visual elements +* Check out the `ggthemes` package + +We won't cover this here. + +## Questions + +1. Create a scatter plot for policy year and number of claims +2. Color each point based on region +2. Add a linear smoother. Which region is showing the greatest increase in claims? +4. Form the policy frequency by taking the ratio of claims to policies. Plot this. + +Extra credit: + +1. Use the state data to create a time series number of claims. Facet by region. + +## Answers + +```{r} +library(raw) +data("RegionExperience") +plt1 <- ggplot(RegionExperience, aes(x = PolicyYear, y = NumClaims)) + geom_point() +plt1 + +plt2 <- plt1 + aes(color = Region) +plt2 + +plt3 <- plt2 + stat_smooth(method = "lm") +plt3 + +RegionExperience$Frequency <- with(RegionExperience, NumClaims / NumPolicies) + +plt4 <- ggplot(RegionExperience, aes(x = PolicyYear, y = Frequency, color = Region)) + geom_point() + geom_line() + stat_smooth(method = lm) +plt4 +``` + +```{r} +data("StateExperience") +pltExtra <- ggplot(StateExperience, aes(x = PolicyYear, y = NumClaims, color = Postal)) + geom_point() + geom_line() +pltExtra + facet_wrap(~ Region) +``` + +## Maps! + +```{r message = FALSE, warning = FALSE} +library(maps) +map('state') +``` + +### Hurricanes + +```{r } +library(raw) +data(Hurricane) + +dfKatrina = subset(Hurricane, Name == 'KATRINA') +dfKatrina = dfKatrina[dfKatrina$Year == max(dfKatrina$Year), ] + +dfHugo = subset(Hurricane, Name == 'HUGO') +dfHugo = dfHugo[dfHugo$Year == max(dfHugo$Year), ] + +dfDonna = Hurricane[Hurricane$Name == 'DONNA', ] +dfDonna = dfDonna[dfDonna$Year == max(dfDonna$Year), ] +``` + +```{r } +map('state') +points(dfKatrina$Longitude, dfKatrina$Latitude, pch=19, col = 'red') +points(dfHugo$Longitude, dfHugo$Latitude, pch = 19, col = 'blue') +points(dfDonna$Longitude, dfDonna$Latitude, pch = 19, col = 'green') +``` + +### Choropleths the easy way + +```{r warning=FALSE} +library(choroplethr) +library(choroplethrMaps) + +mapData <- subset(StateExperience, PolicyYear == 2010) +mapData$value <- mapData$NumClaims +mapData$region <- mapData$Fullname +state_choropleth(mapData) +``` + +### Choropleths the harder way + +```{r} +states_map <- map_data("state") + +plt <- ggplot(subset(StateExperience, PolicyYear == 2010), aes(map_id = Fullname)) +plt <- plt + geom_map(aes(fill = NumClaims), color = "black", map = states_map) +plt <- plt + expand_limits(x = states_map$long, y = states_map$lat) +plt <- plt + coord_map() +plt +``` + +### Choropleths the really hard way + +```{r } +plt <- ggplot(StateExperience, aes(map_id = Fullname)) +plt <- plt + geom_map(aes(fill = NumClaims), color = "black", map = states_map) +plt <- plt + expand_limits(x = states_map$long, y = states_map$lat) +plt <- plt + facet_wrap( ~ PolicyYear) +plt <- plt + coord_map() +plt + +``` + + +## Summary + +* `ggplot2` is difficult at first, but will repay your investment. +* Works very well with grouped data to color/facet points. +* Fine-tuning things like axis labels can be a headache, but will get easier. Yes, Excel makes it easier to add data labels and change colors. `ggplot2` makes it easier to work with data. +* Maps tend to be a killer viz. They use the same point and area logic as any other graphic. + +## Reference + +* http://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf +* R Graphics Cookbook by Winston Chang +* http://vita.had.co.nz/papers/layered-grammar.html \ No newline at end of file