forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
159 lines (120 loc) · 5.69 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r warning = F}
#kill scientific notation
options(scipen=999)
if (!file.exists('activity.csv')) {
unzip('activity.zip', exdir='../')
}
df <- read.csv('activity.csv')
#convert the intervals into date-time strings
#this adds today's date to every time, which is incorrect, but will be ignored
#converting into full date-time objects removes jumps on the x-axis e.g. 855 to 900
df$interval <- as.POSIXct(strptime(sprintf('%04d', df$interval), '%H%M'))
#lubridate for date-time evaluation
#loading packages near first call for to explain use
library(lubridate)
df$date <- ymd(df$date)
```
## What is mean total number of steps taken per day?
```{r warning = F, message = F}
#chain operations with dplyr
library(dplyr)
df.daily <- df %>%
group_by(date) %>%
summarize(steps.total = sum(steps))
steps.mean <- mean(df.daily$steps.total, na.rm = T)
steps.median <- median(df.daily$steps.total, na.rm = T)
library(ggplot2)
g.daily <- ggplot(df.daily, aes(date, steps.total)) +
geom_histogram(stat = 'identity') +
xlab('Date') + ylab('Steps') +
theme(axis.text.x = element_text(angle = 300, hjust = 0)) +
#because the mean and median are so close, overlap dashed blue on top of solid red to see both
geom_hline(yintercept = steps.mean, linetype = 'solid', color = 'red') +
geom_hline(yintercept = steps.median, linetype = 'dashed', color = 'blue')
g.daily
```
Most devices registered between 0 and 22,000 steps per day. The mean total of steps taken per day is `r steps.mean` (red line) and the median total is `r steps.median` (blue line).
## What is the average daily activity pattern?
```{r}
df.intvl <- df %>%
group_by(interval) %>%
summarize(steps.mean = mean(steps, na.rm = T))
interval.max <- df.intvl[which.max(df.intvl$steps.mean),]
#Question asked for a plot, the code below works, but I like using ggplot so that is what I present.
#plot(df.intvl, type = 'l')
#scales package for better control of the x-axis labels
library(scales)
g.intvl <- ggplot(df.intvl, aes(interval, steps.mean)) +
geom_line() +
scale_x_datetime(breaks = date_breaks("2 hours"),
labels=date_format("%H:%M")) +
xlab('Time') + ylab('Steps') +
annotate('text', x = interval.max[[1]], y = interval.max[[2]] + 3,
label = paste('Maximum steps:', interval.max[[2]],'at 8:35'))
g.intvl
```
The interval with the maximum number of steps is 8:35 (going to work). There are also local maxima at 12:05 (lunch), 3:55 (leaving work?), and 6:50 (going out for the evening?)
## Imputing missing values
```{r}
df.nas <- sum(is.na(df))
```
There are `r df.nas` intervals with missing values in the dataset.
```{r}
#create new data frame with a NA's replaced by interval averages
df.replace <- df %>%
group_by(interval) %>%
mutate(steps = ifelse(is.na(steps),
#i've seen as.integer recommended here, but I think fractional steps are acceptable.
mean(steps, na.rm=TRUE),
steps))
```
If we set every missing value equal to the mean of its interval, it changes our previous average daily activity patterns.
```{r}
#same code from above with new data frame
df.replace.daily <- df.replace %>%
group_by(date) %>%
summarize(steps.total = sum(steps))
steps.replace.mean <- mean(df.daily$steps.total, na.rm = T)
steps.replace.median <- median(df.daily$steps.total, na.rm = T)
g.replace.daily <- ggplot(df.replace.daily, aes(date, steps.total)) +
geom_histogram(stat = 'identity') +
xlab('Date') + ylab('Steps') +
theme(axis.text.x = element_text(angle = 300, hjust = 0)) +
geom_hline(yintercept = steps.replace.mean, linetype = 'solid', color = 'red') +
geom_hline(yintercept = steps.replace.median, linetype = 'dashed', color = 'blue')
g.replace.daily
```
However, this doesn't change the mean total of steps taken per day at `r steps.replace.mean` (red line) or the median total at `r steps.replace.median` (blue line) since we added mean values for every interval.
## Are there differences in activity patterns between weekdays and weekends?
To find out, first we assign whether a datapoint happened on a weekday or weekend, and then we aggregrate the means according to both interval and whether it's a weekday or weekend.
```{r}
df.replace$date <- as.POSIXct(df.replace$date)
df.replace$Day <- as.factor(wday(df.replace$date, label = T))
levels(df.replace$Day) <- list('Weekend' = c('Sat', 'Sun'),
'Weekday' = c('Mon', 'Tues', 'Wed', 'Thurs', 'Fri'))
df.replace.intvl <- df.replace %>%
group_by(Day, interval) %>%
summarize(steps = mean(steps))
head(df.replace.intvl)
```
```{r}
#Instructions ask for lattice plot, but I find the side by sides hard to read. Code below does work though.
#library(lattice)
#xyplot(steps ~ interval | Day, df.replace.intvl, type = 'l')
g.replace.intvl <- ggplot(df.replace.intvl, aes(x = interval, y = steps, col = Day)) +
geom_line() +
scale_x_datetime(breaks = date_breaks("2 hours"),
labels=date_format("%H:%M")) +
xlab('Time') + ylab('Steps') +
#I think this makes it harder to read, but the question really wants it.
facet_grid(. ~ Day)
g.replace.intvl
```
In some respects, weekend activity mirrors weekday activity. Local maxima are found around 8:30, 12:00, and 4:00. On the other hand, weekday activity skews strongly towards the morning spike, while weekend activity is distributed more evenly throughout the day.