Course Project for Gettting & Celaning Data based on Human Activity Recognition Using Smartphones Dataset
This CodeBook that describes the variables, the data, and any transformations or work that was performed to clean up
the source data to create a tidy dataset as per requirements of course project.
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years.
Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING)
wearing
a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial
linear acceleration and 3-axial angular velocity at a constant rate of 50Hz.
The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned
into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
Human Activity Recognition Using Smartphones Dataset
Data for analysis is downloaded from the below URL
README.txt
: Details of all the files in downloaded folderfeatures_info.txt
: Shows information about the variables used on the feature vector.features.txt
: List of all features.i.e list of all measurement variablesactivity_labels.txt
: Lists the activity Id with their corresponding activity name.train/X_train.txt
: Training set.train/y_train.txt
: Training activity Id Labelstrain/subject_train.txt
: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.test/X_test.txt
: Test set.test/y_test.txt
: Test activity Id Labelstest/subject_train.txt
: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.
The following files are available for the train and test data. Their descriptions are equivalent.
train/Inertial Signals/total_acc_x_train.txt
: The acceleration signal from the smartphone accelerometer X axis in standard gravity unitsg
. Every row shows a 128 element vector. The same description applies for the
total_acc_x_train.txt
andtotal_acc_z_train.txt
files for the Y and Z axis.train/Inertial Signals/body_acc_x_train.txt
: The body acceleration signal obtained by subtracting the gravity from
the total acceleration.train/Inertial Signals/body_gyro_x_train.txt
: The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second.
Note: All the files in train/Inertial Signals
and test/Inertial Signals
will not be used for in this analysis
Common Files
features.txt
: 561 rows of 2 varibles (feature Identifier and feature Name)activity_labels.txt
: 6 rows of 2 variables (activity identifier and activity name)
Test Dataset
xTest.txt
: 2947 rows of 561 measurement variables. These are measurement variables listed in features.txtyTest.txt
: 2947 rows of 1 variables. This is the activity IdentifiersubjectTest.txt
: 2497 rows of 1 variable (subject Identifier)
Training Dataset
xTrain.txt
: 7352 rows of 561 measurement variables. These are measurement variables listed in features.txtyTrain.txt
: 7352 rows of 1 variables. This is the activity IdentifiersubjectTrain.txt
: 7352 rows of 1 variable (subject Identifier)
Variable Names | subjectId | activityId | (variable names from features.txt ) |
---|---|---|---|
Data | subjectTest.txt |
yTest.txt |
xTest.txt |
Data | subjectTrain.txt |
yTrain.txt |
xTrain.txt |
run_analysis.R
script has the following requirements to perform transformation on UCI HAR Dataset.
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive activity names.
- Creates a second, independent tidy data set with the average of each variable for each activity and each subject.
- Downloads the dataset from the URL mentioned above and unzips it to create UCI HAR Dataset folder
- Imports "test" and "train" datsets and creates data frames from then and then Merges the training and the test sets
to create one data frame. - Extracts a subset of data with only the measurements on the mean "mean()" and standard deviation "std()" for each measurement. Also excludes meanFreq()-X measurements or angle measurements where the term mean exists resulting in
66 measurement variables. - Updates the variable names in dataframe variable names for data fame to improve readibility
- Appropriately labels the data set with descriptive activity names in place of activity Ids
- Reshapes dataset to create a data frame with average of each measurement variable for each activity and each subject
- Writes new tidy data frame to a text file to create the required tidy data set file of 180 observations and 68 columns (2 columns for activityName and subjectID and 66 columns for measurement variables)
- Download the dataset from the URL mentioned above and unzip it to create UCI HAR Dataset folder.
- Script Imports
test
andtrain
datsets and creates data frames from then and then merges the training and the test sets to create one data frame.
All files to be used as listed above are imported to created data frames and column variables names are updated as follows
data.frame | Variable Names |
---|---|
featureVariables |
"varId", "varName" |
activityLabels |
"activityId", "activityName" |
xTest |
featureVariables$varName |
yTest |
"activityId" |
subjectTest |
"subjectId" |
xTrain |
featureVariables$varName |
yTrain |
"activityId" |
subjectTrain |
"subjectId" |
subjectTest
, yTest
, xTest
were column bind using cbind
function to create testData
data frame which added
"subjectId"
, "activityId"
to dataset making it 563
column data.frame with 2947
rows
subjectTrain
, yTrain
, xTrain
were column bind using cbind
function to create trainData
data.frame which added "subjectId"
, "activityId"
to dataset making it 563
column data.frame with 7352
rows
testData
and trainData
data.frame were rowbound using rbind
function to create final aggregated dataset/data.frame AggregateData
with 10299
rows and 563
columns
Below code shows few details of AggregateData
> head(names(AggregateData),10)
[1] "subjectId" "activityId" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
[5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X" "tBodyAcc-std()-Y" "tBodyAcc-std()-Z"
[9] "tBodyAcc-mad()-X" "tBodyAcc-mad()-Y"
>
> head(AggregateData[1:5])
subjectId activityId tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 2 5 0.2571778 -0.02328523 -0.01465376
2 2 5 0.2860267 -0.01316336 -0.11908252
3 2 5 0.2754848 -0.02605042 -0.11815167
4 2 5 0.2702982 -0.03261387 -0.11752018
5 2 5 0.2748330 -0.02784779 -0.12952716
6 2 5 0.2792199 -0.01862040 -0.11390197
>
- Extract a subset of data with only the measurements on the mean
mean()
and standard deviationstd()
for
each measurement
grep
functions are used to search for occurance of meanmean()
and standard deviationstd()
inAggregateData
variable Names using escape characters.- Using escape characters to search exactly for
mean()
andstd()
occurance helps to excludemeanFreq()-X
measurements
and/or angle measurements where the termmean
exists - The resulting selection would have only
66
measurement variables.
Below code shows search using grep
functions in column names of AggregateData
data frame.
> head(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ], 10)
varId varName
3 3 tBodyAcc-mean()-Z
4 4 tBodyAcc-std()-X
5 5 tBodyAcc-std()-Y
6 6 tBodyAcc-std()-Z
7 7 tBodyAcc-mad()-X
8 8 tBodyAcc-mad()-Y
43 43 tGravityAcc-mean()-Z
44 44 tGravityAcc-std()-X
45 45 tGravityAcc-std()-Y
46 46 tGravityAcc-std()-Z
>
>
>
> tail(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ])
varId varName
518 518 fBodyBodyAccJerkMag-mad()
519 519 fBodyBodyAccJerkMag-max()
531 531 fBodyBodyGyroMag-mad()
532 532 fBodyBodyGyroMag-max()
544 544 fBodyBodyGyroJerkMag-mad()
545 545 fBodyBodyGyroJerkMag-max()
>
- In order to extract a subset of only measurements on the mean
mean()
and standard deviationstd()
grep
is used
to create index of matched column numbers. - This index of column numbers is used to create subset of data based on
mean()
andstd()
column indices and also includng first 2 columns which havesubjectId
andactivityId
values.
Below code show the Index of occurance for mean()
and std()
> head(grep("mean\\(\\)|std\\(\\)", names(AggregateData)), 10)
[1] 3 4 5 6 7 8 43 44 45 46
>
This stage crates a data.frame subsetAggregateData
of 10299
observations and 68
variables
- Update the variable names in dataframe for data fame to improve readibility
camelcase
has been used for R objects created and variable names for data fame to improve readibility.gsub
function is used to remove instances of"-"
and"()"
from variables names in the extracted subset
subsetAggregateData
data.frame- Instances of
mean
are replaced byMean
andstd
byStd
to obtain proper camelcase in variable names
e.g."tBodyAcc-mean()-X"
is changed to"tBodyAccMeanX"
Usage of gsub
function
gsub("-", "", names(subsetAggregateData))
Below code shows the changes in variable names after cleaning up variable names.
> head(names(AggregateData))
[1] "subjectId" "activityId" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
[5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X"
>
> head(names(subsetAggregateData))
[1] "subjectId" "activityId" "tBodyAccMeanX" "tBodyAccMeanY" "tBodyAccMeanZ" "tBodyAccStdX"
>
- Appropriately label the data set with descriptive activity names in place of
activity Ids
activityLabels
andsubsetAggregateData
data frames are merged usingmerge
function to add a new column
and create a new data.framesubFinalData
with corresponding"activityName"
for each"activityId"
in each row of the
dataset
The below code snippet shows updated activityName
in the dataset
> head(subFinalData[1:5])
activityId activityName subjectId tBodyAccMeanX tBodyAccMeanY
1 1 WALKING 26 0.2314146 -0.017722438
2 1 WALKING 29 0.3312213 -0.018502366
3 1 WALKING 29 0.3755700 -0.024728610
4 1 WALKING 29 0.2332297 -0.034451457
5 1 WALKING 29 0.2362494 -0.014396940
6 1 WALKING 29 0.2645428 0.002484389
>
"activityId"
column which is no longer needed because we mappedacitityName
toactivityId
in the dataset"activityId"
column is dropped to create final data.frame calledfinalData
- This data frame has
10299
observations and68
columns. 2
columns for"activityName"
and"subjectId"
and remaing 66 for measurement variables with measurements
on themean()
andstd()
- Reshape dataset to create a data frame with average of each measurement variable for each activity and each subject
- Writes new tidy data frame to a text file to create the required tidy data
reshape2
package is leveraged for reshaping the dataset.library(reshape2)
melt
function ofreshare2
package is leveraged to reshape data based on id variables"activityName"
and
"subjectId"
against all measurement values variables to createfinalDataMelt
data frame.melt
takes wide-format data and melts it into long-format data.finalDataMelt
data frame has 679734 observations of 4 variables
The below code shows data transformation done by using melt
function to create finalDataMelt
data frame
> head(finalDataMelt)
activityName subjectId variable value
1 WALKING 26 tBodyAccMeanX 0.2314146
2 WALKING 29 tBodyAccMeanX 0.3312213
3 WALKING 29 tBodyAccMeanX 0.3755700
4 WALKING 29 tBodyAccMeanX 0.2332297
5 WALKING 29 tBodyAccMeanX 0.2362494
6 WALKING 29 tBodyAccMeanX 0.2645428
>
- lastly,
dcast
function ofreshap2
package is leveraged for creating new tidy data.frame calledavgSujectActivities
- This dataset provides average of each measurement variable for each activity and each subject
dcast
takes long-format data and casts it into wide-format data.
- The result od
dcast
operation is the tidy data frameavgSujectActivities
- This data frame has
180
observations/rows and68
columns/variables 68
columns(2
columns foractivityName
andsubjectID
and66
columns for measurement variables)- Each measurement variable columns
[3 to 68]
is average value for each combination ofsubjectId
andactivityName
The below code shows few details of the Tidy data frame
> head(avgSujectActivities[1:7])
activityName subjectId tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ tBodyAccStdX tBodyAccStdY
1 LAYING 1 0.2215982 -0.04051395 -0.1132036 -0.9280565 -0.8368274
2 LAYING 2 0.2813734 -0.01815874 -0.1072456 -0.9740595 -0.9802774
3 LAYING 3 0.2755169 -0.01895568 -0.1013005 -0.9827766 -0.9620575
4 LAYING 4 0.2635592 -0.01500318 -0.1106882 -0.9541937 -0.9417140
5 LAYING 5 0.2783343 -0.01830421 -0.1079376 -0.9659345 -0.9692956
6 LAYING 6 0.2486565 -0.01025292 -0.1331196 -0.9340494 -0.9246448
>
- The
avgSujectActivities
data frame is written to a file usingwrite.table
function with"\t"
separator to
createavgSujectActivities.txt
file - By default column names are kept in file. Row Names have to be explicity excluded using
row.names=FALSE
argument
inwrite.table
function
Variable Name | Details |
---|---|
activityName |
Factor with 6 levels WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING |
subjectId |
Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30 |
tBodyAccMeanX |
Average of Mean Value time doman Body Accelration in x direction |
tBodyAccMeanY |
Average of Mean Value time doman Body Accelration in Y direction |
tBodyAccMeanZ |
Average of Mean Value time doman Body Accelration in Z direction |
tBodyAccStdX |
Average of Standard deviation time doman Body Accelration in x direction |
tBodyAccStdY |
Average of Standard deviation time doman Body Accelration in Y direction |
tBodyAccStdZ |
Average of Standard deviation time doman Body Accelration in Z direction |
tGravityAccMeanX |
Average of Mean Value time doman Gravity Accelrationin x direction |
tGravityAccMeanY |
Average of Mean Value time doman Gravity Accelrationin Y direction |
tGravityAccMeanZ |
Average of Mean Value time doman Gravity Accelrationin Z direction |
tGravityAccStdX |
Average of Standard deviation time doman Gravity Accelrationin x direction |
tGravityAccStdY |
Average of Standard deviation time doman Gravity Accelrationin Y direction |
tGravityAccStdZ |
Average of Standard deviation time doman Gravity Accelrationin Z direction |
tBodyAccJerkMeanX |
Average of Mean Value time doman Body Accelration Jerk in x direction |
tBodyAccJerkMeanY |
Average of Mean Value time doman Body Accelration Jerk in Y direction |
tBodyAccJerkMeanZ |
Average of Mean Value time doman Body Accelration Jerk in Z direction |
tBodyAccJerkStdX |
Average of Standard deviation time doman Body Accelration Jerk in x direction |
tBodyAccJerkStdY |
Average of Standard deviation time doman Body Accelration Jerk in Y direction |
tBodyAccJerkStdZ |
Average of Standard deviation time doman Body Accelration Jerk in Z direction |
tBodyGyroMeanX |
Average of Mean Value time doman Body Gyro in x direction |
tBodyGyroMeanY |
Average of Mean Value time doman Body Gyro in Y direction |
tBodyGyroMeanZ |
Average of Mean Value time doman Body Gyro in Z direction |
tBodyGyroStdX |
Average of Standard deviation time doman Body Gyro in x direction |
tBodyGyroStdY |
Average of Standard deviation time doman Body Gyro in Y direction |
tBodyGyroStdZ |
Average of Standard deviation time doman Body Gyro in Z direction |
tBodyGyroJerkMeanX |
Average of Mean Value time doman Body Gyro Jerk signal in x direction |
tBodyGyroJerkMeanY |
Average of Mean Value time doman Body Gyro Jerk signal in Y direction |
tBodyGyroJerkMeanZ |
Average of Mean Value time doman Body Gyro Jerk signal in Z direction |
tBodyGyroJerkStdX |
Average of Standard deviation time doman Body Gyro Jerk signal in x direction |
tBodyGyroJerkStdY |
Average of Standard deviation time doman Body Gyro Jerk signal in Y direction |
tBodyGyroJerkStdZ |
Average of Standard deviation time doman Body Gyro Jerk signal in Z direction |
tBodyAccMagMean |
Average of Mean Value time doman Body Accelration magnitude |
tBodyAccMagStd |
Average of Standard deviation time doman Body Accelration magnitude |
tGravityAccMagMean |
Average of Mean Value time doman Gravity Accelration magnitude |
tGravityAccMagStd |
Average of Standard deviation time doman Gravity Accelration magnitude |
tBodyAccJerkMagMean |
Average of Mean Value time doman Body Accelration jerk magnitude |
tBodyAccJerkMagStd |
Average of Standard deviation time doman Body Accelration jerk magnitude |
tBodyGyroMagMean |
Average of Mean Value time doman Body Gyro magnitude |
tBodyGyroMagStd |
Average of Standard deviation time doman Body Gyro magnitude |
tBodyGyroJerkMagMean |
Average of Mean Value time doman Body Gyro Jerk magnitude |
tBodyGyroJerkMagStd |
Average of Standard deviation time doman Body Gyro Jerk magnitude |
fBodyAccMeanX |
Average of Mean Value frequency domainBody Accelration in x direction |
fBodyAccMeanY |
Average of Mean Value frequency domainBody Accelration in Y direction |
fBodyAccMeanZ |
Average of Mean Value frequency domainBody Accelration in Z direction |
fBodyAccStdX |
Average of Standard deviation frequency domainBody Accelration in x direction |
fBodyAccStdY |
Average of Standard deviation frequency domainBody Accelration in Y direction |
fBodyAccStdZ |
Average of Standard deviation frequency domainBody Accelration in Z direction |
fBodyAccJerkMeanX |
Average of Mean Value frequency domainBody Accelration Jerk in x direction |
fBodyAccJerkMeanY |
Average of Mean Value frequency domainBody Accelration Jerk in Y direction |
fBodyAccJerkMeanZ |
Average of Mean Value frequency domainBody Accelration Jerk in Z direction |
fBodyAccJerkStdX |
Average of Standard deviation frequency domainBody Accelration Jerk in x direction |
fBodyAccJerkStdY |
Average of Standard deviation frequency domainBody Accelration Jerk in Y direction |
fBodyAccJerkStdZ |
Average of Standard deviation frequency domainBody Accelration Jerk in Z direction |
fBodyGyroMeanX |
Average of Mean Value frequency domainBody Gyro in x direction |
fBodyGyroMeanY |
Average of Mean Value frequency domainBody Gyro in Y direction |
fBodyGyroMeanZ |
Average of Mean Value frequency domainBody Gyro in Z direction |
fBodyGyroStdX |
Average of Standard deviation frequency domainBody Gyro in x direction |
fBodyGyroStdY |
Average of Standard deviation frequency domainBody Gyro in Y direction |
fBodyGyroStdZ |
Average of Standard deviation frequency domainBody Gyro in Z direction |
fBodyAccMagMean |
Average of Mean Value frequency domainBody Accelration magnitude |
fBodyAccMagStd |
Average of Standard deviation frequency domainBody Accelration magnitude |
fBodyBodyAccJerkMagMean |
Average of Mean Value frequency domainBody Accelration jerk magnitude |
fBodyBodyAccJerkMagStd |
Average of Standard deviation frequency domainBody Accelration jerk magnitude |
fBodyBodyGyroMagMean |
Average of Mean Value frequency domainBody Body Gyro magnitude |
fBodyBodyGyroMagStd |
Average of Standard deviation frequency domainBody Body Gyro magnitude |
fBodyBodyGyroJerkMagMean |
Average of Mean Value frequency domainBody Body Gyro jerk magnitude |
fBodyBodyGyroJerkMagStd |
Average of Standard deviation frequency domainBody Body Gyro jerk magnitude |