-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathDataFrames.Rmd
335 lines (232 loc) · 11.5 KB
/
DataFrames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
# Data Frames
This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What's a data frame? It's a table. That's it. No, really, that's it.
By the end of this chapter you will know the following:
* What's a data frame and how do I create one?
* How can I access and assign elements to and from a data frame?
* How do I read and write external data?
## What's a data frame?
Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it's constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there's no danger that I'll get character data when I only want dates or integers. At the same time, the list's flexibility means that we can store different data types in one single data construct.
### Creating a data frame
Although there are many functions that will return a data frame, let's start by looking at the `data.frame` function. We'll first create some vectors and then join them together.
```{r }
set.seed(1234)
State <- rep(c("TX", "NY", "CA"), 10)
EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1)
df <- data.frame(State, EarnedPremium, stringsAsFactors=FALSE)
```
We didn't have to create the vectors first. If we wanted, we could pass them in as the result of function calls within the call to `data.frame`.
```{r}
set.seed(1234)
df <- data.frame(State = rep(c("TX", "NY", "CA"), 10)
, EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3)
, stringsAsFactors = FALSE)
```
Note that I've set the "stringsAsFactors" argument to FALSE. If I hadn't, the column "State" would be a factor, rather than a character. See [Data Types: factors](#Factors) for some reasons why we might not want our data to be a factor.
### Basic properties of a data frame
Once created, it's straightforward to get some basic information about the data we have.
```{r }
summary(df)
str(df)
head(df)
tail(df)
```
We can also query metadata about our data frame. Note that, for a data frame, `names` and `colnames` will return the same result. `dim` will give the number of rows and columns, or you can use `nrow` and `ncol` directly. In particular, note what result is returned by the `length` function. If you think about the fact that a data frame is actually a list, you may be able to guess why `length` returns what the value it does.
```{r }
names(df)
colnames(df)
dim(df)
length(df)
nrow(df)
ncol(df)
```
### Merging
If you've got more than one data frame, it's possible to combine them in several different ways: `rbind`, `cbind` and `merge`.
`rbind` will append rows to the data frame. New rows must have the same number of columns and data types.
```{r results='hide'}
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
```
`cbind` must have the same number of rows as the data frame.
```{r }
dfC = dfA[, 1:2]
cbind(dfA, dfC)
```
`merge` is similar to a __JOIN__ operation in a relational database. If you're used to __VLOOKUP__ in Excel[^havent_used_vlookup], you'll love `merge`. It's possible to use multiple columns (e.g. state and effective date) when specifying how to join. If no columns are specified, `merge` will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data.
```{r }
dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
```
[^havent_used_vlookup]: If you've never used VLOOKUP, I can't imagine why you're reading this.
### `expand.grid`
Consider `expand.grid`. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame.
```{r}
dfStateClass <- expand.grid(State = c("TX", "CA", "NY")
, Class = c(1776, 1066, 1492))
```
## Access and Assignment
Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We'll often use the `[]` operator to specify which row and column we'd like. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element.
```{r eval=FALSE}
df[1, 2]
df[2, ]
df[, 2]
df[2, -1]
```
If we want columns of the data frame, we can approach this one of a few ways. For a single column, we can use either the `$` or the `[[]]` operators. These work the same as a list (because a data frame _is_ a list).
```{r eval=FALSE}
df$EarnedPremium
head(df[[1]])
head(df[["EarnedPremium"]])
```
For multiple columns, we can pass a character vector to get columns by name, rather than by position.
```{r}
df[, "EarnedPremium"]
df[, c("EarnedPremium", "State")]
```
Note the interesting case when we specify one argument but leave the other blank. I'll wrap it in a `head` function to minimize the output.
```{r}
head(df[2])
```
Again, if you bear in mind that a data frame is a list, this should makes sense. Without the additional comma, R will assume that we're dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list.
```{r}
head(df[1:2])
```
It'll probably be rare that you want this sort of behavior. It's not so rare that you'll want to return a single vector from a data frame, but in these cases, I'd recommend either using the `$` or the `[[]]` operator.
### Altering and adding columns
We can add columns by using the same operators as for access[^also_cbind].
[^also_cbind]: We can also use `cbind` to add columns.
```{r }
df$Losses <- df$EarnedPremium * runif(nrow(df), 0.4, 1.2)
df$LossRatio = df$Losses / df$EarnedPremium
```
The `transform` and `within` functions will also create a new column.
```{r}
df <- transform(df, LogLoss = log(Losses))
df <- within(df, {LogLoss = log(Losses)})
```
### Eliminating columns
We can eliminate columns in one of two ways. If we only want to remove one column, we can assign the value NULL to it.
```{r eval=FALSE}
df$LossRatio <- NULL
```
If we want to eliminate more than one column, we may construct a new data frame which only includes the columns we'd like to keep.
```{r eval = FALSE}
df <- df[, 1:2]
df <- df[, c("State", "EarnedPremium")]
```
If we'd like to remove specific columns, Hadley Wickham showed a nice means to do this by using set differences.
```{r}
df <- df[, setdiff(colnames(df), c("RateChange", "Losses"))]
```
### Subsetting
There are at least three ways to subset. The easiest way would be to use the `subset` function.
```{r }
dfTX = subset(df, State == "TX")
dfBigPolicies = subset(df, EarnedPremium >= 50000)
```
A slightly harder way would be to use logical access
```{r }
dfTX = df[df$State == "TX", ]
dfBigPolicies = df[df$EarnedPremium >= 50000, ]
```
Finally, I could assign the results of a logical expression to a variable and then use this to subset the data frame.
```{r }
whichState = df$State == "TX"
dfTX = df[whichState, ]
whichEP = df$EarnedPremium >= 50000
dfBigPolicies = df[whichEP, ]
```
I use each of these three methods routinely. They're all good.
### Ordering
For vectors, we can use `sort` to get a sorted vector. For a data frame, we'll need to use `order`. Remember that `order` functions a lot like `which`. It will return indices which may then be used to return specific contents.
```{r }
order(df$EarnedPremium)
df = df[order(df$EarnedPremium), ]
```
### Altering column names
```{r eval=FALSE}
df$LossRation = with(df, Losses / EarnedPremium)
names(df)
colnames(df)[4] = "Loss Ratio"
colnames(df)
```
### `with` and `attach`
If the name of a data frame is long, typing it to access column elements might start to seem tedious. The `attach` function will alleviate this, by attaching the data frame onto something called the "search path" (which I might have described in the section on packages). What's the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it'll proceed with whatever calculation you've asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search.
```{r}
attach(dfA)
# do some stuff
attach(dfB)
```
There are many references that suggest using `attach`. I don't. I'll actually advise against it. Why is `attach` a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis.
```{r}
mean(EarnedPremium)
```
Which EarnedPremium am I talking about? "Well to tell you the truth in all this excitement I kinda lost track myself." [^dirty_harry] `attach` won't necessarily harm you. But it is a loaded gun.
[^dirty_harry]: Copyright some film company.
## Summarizing
```{r }
sum(df$EarnedPremium)
sum(df$EarnedPremium[df$State == "TX"])
aggregate(df[,-1], list(df$State), sum)
```
### Summarizing visually - 1
```{r size='tiny', fig.height=5}
dfByState = aggregate(df$EarnedPremium, list(df$State), sum)
colnames(dfByState) = c("State", "EarnedPremium")
barplot(dfByState$EarnedPremium, names.arg=dfByState$State, col="blue")
```
### Summarizing visually - 2
```{r size='tiny', fig.height=5}
dotchart(dfByState$EarnedPremium, dfByState$State, pch=19)
```
### Advanced data frame tools
* dplyr
* tidyr
* reshape2
* data.table
Roughly 90% of your work in R will involve manipulation of data frames. There are truckloads of packages designed to make manipulation of data frames easier. Take your time getting to learn these tools. They're all powerful, but they're all a little different. I'd suggest learning the functions in `base` R first, then moving on to tools like `dplyr` and `data.table`. There's a lot to be gained from understanding the problems those packages were created to solve.
## Reading and writing external data
```{r eval=FALSE}
myData = read.csv("SomeFile.csv")
```
### Reading from Excel
Actually there are several ways:
* XLConnect
* xlsx
* Excelsi-r
```{r eval=FALSE}
library(XLConnect)
wbk = loadWorkbook("myWorkbook.xlsx")
df = readWorksheet(wbk, someSheet)
```
### Reading from the web
```{r eval=FALSE}
URL = "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df = read.csv(URL, stringsAsFactors = FALSE)
```
```{r eval=FALSE}
library(XML)
URL = "http://www.pro-football-reference.com/teams/nyj/2012_games.htm"
games = readHTMLTable(URL, stringsAsFactors = FALSE)
```
### Reading from a database
```{r eval=FALSE}
library(RODBC)
myChannel = odbcConnect(dsn = "MyDSN_Name")
df = sqlQuery(myChannel, "SELECT stuff FROM myTable")
```
### Read some data
```{r eval=FALSE}
df = read.csv("../data-raw/StateData.csv")
```
```{r eval=FALSE}
View(df)
```
## Exercises
* Create a data frame with 500 rows. Include a column for policy effective date, policy expiration date and claim count. For claim count, consider using the `rpois` function to simulate from a poisson distribution.
* Save this data to a .CSV file and then reload that file.
* Which policy had the most claims? Which policy year?
* Add a column for
### Answer
```{r }
```