index.Rmd

---
title       : Data Manipulation with R
author      : Kaushik Roy Chowdhury
framework   : deckjs
deckjs      :
  theme     : swiss
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : googlecode           # 
widgets     : [mathjax]            # {mathjax, quiz, bootstrap}
mode        : standalone # {standalone, draft}
knit        : slidify::knit2slides
---

<style>
code {
fontsize: 12pt;
}
pre {
font-size: -1em;
}
</style>

# Import data into R


---


## Importing data into R
Conforming to its' philosophy of freedom (of choice), reading data into R can be performed in various ways. 

### Reading data with Base R functions
* Widely used functions for reading data into R
* Available with base R distribution
* No packages required
* Can be slow and take up surprising amount of memory when reading large files
* `read.table()` and family

```{r setup1, include = FALSE}
library(knitr)
opts_chunk$set(comment = "", cache = FALSE, message=FALSE, tidy = FALSE)
```
__Usage__: `read.table(file, header = FALSE, sep = "", colClasses = NA, stringsAsFactors = TRUE, ...)`

---


## Is there a fast and efficient way to read-in data?

- `data.table()` package provides an alternative.
- `fread()` _fast file reader function_ is a fast and efficient way to read in data into R


__Usage__: `fread(input, ...)` where `...` takes in same arguments as that of `read.table`.

---


## Timing `read.table()` and `fread()` with a 20MB .csv file

The file `flights.csv` can be downloaded from [here](http://bit.ly/1L4IFxB)

```{r timing}
# install packages if not present
#install.packages(c("data.table", "rbenchmark")) 

# load install packages
library(data.table); library(rbenchmark)

# file saved in windows default directory (~ = C:/Users/.../Documents)
read_base <- function(x) raw <- read.csv("~/flights.csv")
read_DT   <- function(x) rawDT <<- fread("~/flights.csv")

# reading a 20MB .csv file
benchmark(read_base(), read_DT(), replications = 1,
  columns = c("test", "elapsed")) 
```

`fread()` is almost 5x as fast as `read.csv()`

---


## A look at the data

Having read the `flights.csv` data into R using `fread()` function, here are the first few rows

```{r look, echo=FALSE}
head(rawDT)
```

The data contains flights information of all planes that departed NYC (i.e. JFK, LGA or EWR airports) in 2013.

__Data Source__: [RITA, Bureau of transportation statistics](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)

---


# Data step of `data.table()`


---


## Deep-dive into `data.table()` package

`data.table()` provides __faster__ and __efficient__ manipulation of data stored in RAM

Perform the following operations:
  - Filtering relevant data/rows
  - Creating new variables, modify/delete columns
  - Group by operations
  - Summarize data
  - Ordered joins
  - Transpose data

__General form:__
> `DT[i, j, by]`: Take DT, subset rows using `i`, calculate `j` grouped by `by`


---

# Filter/Subset

---


## Filtering data 


`DT[i, j, by]`: Take DT, subset rows using `i`,<span style="opacity: 0.15;"> calculate `j`, grouped by `by` </span>

Rows can be filtered using column names satisfying conditions

- Select all flights departing from JFK airport and has a departure delay of more than 30 minutes

```{r subset2i}
head(rawDT[origin == "JFK" & dep_delay > 30])
```


---

# Manipulating columns (adding/updating)

---

## Create new variables (standalone)

`DT[i, j, by]`: Take DT, <span style="opacity: 0.2;"> subset rows using `i`, </span> calculate `j` <span style="opacity: 0.2;"> grouped by `by` </span>

New variables can be created in the `j` argument of data table operation

- Calculate air speed for each flight (= distance/air_time)

```{r calc_speed}
head(rawDT[, .(air_speed = distance/air_time)])
```

- `.()` is an alias for `list()` to perform multiple operations separated by ','
- If `.()` is not used, the result is a vector, else the result is a `data.table`

---

## Adding new variables to `data.table`

To add the `air_speed` variable in the `rawDT data.table`, use `:=` operator


```{r calc_speed1}
head(rawDT[, air_speed := distance/air_time])
```

---

## Adding multiple new variables to `data.table`

- To add multiple variables, `air_speed` and `total_delay (=dep_delay + arr_delay)` use a chained operation
- Chaining operations improves readability and avoids intermediate assignments

```{r calc_speed2}
# in rawDT[, create var1][, create var2][print rows 1:3]
rawDT[, air_speed := distance/air_time][, total_delay := dep_delay + arr_delay][1:3]
```


---

# Group by operations

---

## Grouped Operations

`DT[i, j, by]`: Take DT, <span style="opacity: 0.2;"> subset rows using `i`, calculate `j` </span> grouped by `by`

- Calculate the average air speed for each carrier
- This can be achieved by calculating the air speed variable and take an average across all flights __grouped by__ carrier
- Make `na.rm = TRUE` which removes the `NA` (missing values) from the data when calculating the `mean`

```{r calc_speed3}
# calculate average air speed by carrier and print rows 1 to 5
rawDT[, .(avg_air_speed = mean(distance/air_time, na.rm = TRUE)), by = carrier][1:5]
```


---

# Summarize

---


## Data summary

Summarize data using necessary arguments of `i`, `j` and `by` of `data.table`

__Examples:__
- Calculate daily count of all flights departing from JFK airport (`origin == "JFK`)
- For all flights flying out of JFK airport (`origin == "JFK`), find the carrier  with maximum average departure delay 
- In which month of the year does flights departing from JFK airport has the maximum departure delay?


---


### Daily count of flights departing from JFK airport

```{r smry1}
head(rawDT[origin == "JFK", .N, by = day])
```


---


### Carrier with maximum average departure delay for flights departing from JFK airport

```{r smry2}
head(rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = carrier][order(-avg_dep_delay)])
```


---


### Month with maximum average departure delay for flights departing from JFK airport


```{r smry3}
smryDT <- rawDT[origin == "JFK", .(avg_dep_delay = mean(dep_delay, na.rm = TRUE)), by = month]

# setorder works much faster than base::order from previous example
setorder(smryDT, -avg_dep_delay)
head(smryDT)
```


---

# Join/Merge

---


## Ordered Joins

Joins in `data.table` are performed using `merge.data.table()` function. However, `data.tables` need to be sorted by `key(s)` which are established using `setkey()` for an existing `data.table` or `key` argument while creating one

- Create two `data.tables` with same `key` value to join

```{r joins}
dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"); head(dt1)
dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"); head(dt2)
```


---


### Left Outer Join

```{r loj}
# left outer join
merge.data.frame(x = dt1, y = dt2, all.x=TRUE)
```

---

### Right Inner Join

```{r rij}
# right inner join
merge.data.frame(x = dt2, y = dt1, all.x=FALSE)
```

---

### Full Outer Join

```{r foj}
# full outer join
merge.data.frame(x = dt1, y = dt2, all = TRUE)
```

---

# Transposing data

---

## Reshaping data `melt`

`data.tables` can be reshaped using the `melt` and `dcast` functions:

- __`melt`__: Wide-to-long (melting)

__Usage__: `melt(data, id.vars, measure.vars, variable.name = "variable", value.name = "value", ...)` 

where,

`data` A `data.table` to melt

`id.vars` vector of id variables; if missing, all non-id columns are assigned

`measure.vars` vector of measure variables; if missing, all non-id columns are assigned

`variable.name` name for the measured variable names column

`value.name` name for the molten data values column

`...` advanced argument for `melt` functions

---

### Example

Create the data to melt

```{r melt1}
library(reshape2)
DT <- data.table(
      i1 = c(1:3, NA), 
      i2 = c(5, 6, 7, 8), 
      f1 = c("A", "C", "D", "Q"), 
      c1 = c("XY", "FE", "AA", "GG"))
DT
```


---


### Melt the data

```{r melt2}
(DT.melt1 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1")))

#rename variable and value columns
(DT.melt2 <- melt(DT, id = c("i1", "i2"), measure = c("f1", "c1"), variable.name = "Factors", value.name = "data_value"))
```


---

## Reshaping data `dcast`

- __`dcast.data.table`__: Long-to-wide (casting)

__Usage__: `dcast.data.table(data, formula, fun.aggregate = NULL, ...)` 

where,

`data` A molten data.table

`formula` A formula of the form LHS ~ RHS to cast, eg: var1 + var2 ~ var3. The first varies slowest, and the last fastest.  "..." represents all other variables not used in the formula and "." represents no variable

`fun.aggregate` Aggregation function needed if variables do not identify a single observation for each output cell

`...` other advanced arguments

---

### `dcast` the molten `data.table` 

```{r dcast1}
(DT.dcast <- dcast.data.table(DT.melt2, i1+i2~Factors))
```