The purpose of this project is to demonstrate the ability to collect, work with, and clean a data set. Please refer to the CodeBook.md for more details about input data sets and their variables.
The script run_analysis.R
is logicaly devided into 6 steps (Step 0 - Step 5).
Below you can find a detailed description of how all of the scripts work.
Sets working directory
Note! In order to avoid any errors you need to change the path to UCI HAR Dataset
folder according to the location on your computer
setwd("C:\\Users\\ekaraseva\\Desktop\\data\\UCI HAR Dataset")
Reads files required for analysis
activity_labels<-read.table("activity_labels.txt")
features<-read.table("features.txt")
subject_train<-read.table("./train/subject_train.txt")
X_train<-read.table("./train/X_train.txt")
y_train<-read.table("./train/y_train.txt")
subject_test<-read.table("./test/subject_test.txt")
X_test<-read.table("./test/X_test.txt")
y_test<-read.table("./test/y_test.txt")`
'train/X_train.txt': Training set
and'train/y_train.txt': Training labels
are merged together in order to identify types of activities tracked'test/X_test.txt': Test set
and'test/y_test.txt': Test labels
are merged together in order to identify types of activities tracked- Data Sets from points 1 and 2 above were binded together in order to create one data set
test_and_train_df
Renames variables in test and train data sets according to the list of all features
names(X_test)<-features[,2]
names(X_train)<-features[,2]
names(subject_test)<-"subject"
names(subject_train)<-"subject"
names(y_test)<-"activity"
names(y_train)<-"activity"
Binds data sets with the lables
test_df<-cbind.data.frame(subject_test, y_test, X_test)
train_df<-cbind.data.frame(subject_train, y_train, X_train)
test_and_train_df<-rbind.data.frame(test_df, train_df)
Regular expressions are used within grep
FUN to identify required variable, i.e. ones that included mean()
and std()
obesrvations.
Returned vector was applied to test_and_train_df
to filter out other variables. The resulting data frame named as filtered_full_df1
.
Locates column numbers that contain mean or standard deviation measurements by exact match to
mean()
andstd()
required_columns_vector<-grep("subject|activity|\\bmean()\\b|\\bstd()\\b", names(test_and_train_df),ignore.case = TRUE)
Creates a filterted data frame that contains only required columns/measurments
filtered_full_df1<-test_and_train_df[,required_columns_vector]
filtered_full_df1
is merged with activity_labels
in order to match numbers with activity names. The names were populated but the numbers were dropped, resulting in creation of another intermediate data frame called human_activities_full
.
library(dplyr)
human_activities_full <- filtered_full_df1 %>%
merge(activity_labels,by.x="activity",by.y="V1",all=TRUE) %>%
select(2,69,3:68) %>%
rename(activity=V2) %>%
arrange(subject,activity)
A series of editing actions was applied to variable names in human_activities_full
. Included: elaboration on abbriviated terms, getting rid of special symbols like "-" and "(), lowering all cases.
NewVarNames <- gsub("-","",names(human_activities_full))
NewVarNames <- gsub("^t","timesignalof",NewVarNames)
NewVarNames <- gsub("^f","frequencysignalof",NewVarNames)
NewVarNames <- gsub("[()]","",NewVarNames)
NewVarNames <- gsub("mean","average",NewVarNames)
NewVarNames <- gsub("std","standarddeviation",NewVarNames)
NewVarNames <- gsub("Body","bodymotion",NewVarNames)
NewVarNames <- gsub("Gravity","gravitationalmotion",NewVarNames)
NewVarNames <- gsub("Acc","fromaccelerometer",NewVarNames)
NewVarNames <- gsub("Gyro","fromgyroscope",NewVarNames)
NewVarNames <- gsub("Mag","magnitude",NewVarNames)
NewVarNames <- tolower (NewVarNames)
names(human_activities_full)<-NewVarNames
From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject
human_activities_full
has been grouped by activity types. The data frame has been summarized by activity types using mean FUN. The resulting and final data frame was called human_activities_summary
.
human_activities_summary<- human_activities_full %>%
group_by(subject, activity) %>%
summarise_all(funs(mean(., na.rm=TRUE))) %>%
as.data.frame()
print(human_activities_summary)
#or
write.table(human_activities_summary,file="run_analysis.txt",row.names = FALSE)