This is a report showing the process and results for the creation of a model for human activity recognition (herafter HAR).
This is the capstone project of the Data Science course pursued by the author in HarvardX.
The main goal of this project is to use machine learning techniques in order to predict the human activity based on the data extracted from on-body sensors. We will compare our results to those obtained by the main contributor [1].
We will also try to observe if all 4 sensors are really necessary and if we can use less sensors in order to predict the activity.
The dataset used in this project is the HAR Dataset for benchmarking [1]
The dataset includes measurements of inertial sensors attached to several people while doing normal activities during the day. It also includes data related to the person such as weight, height, etc.
A more detailed description of the dataset can be found here.
In this section we will prepare the data to work with and explore some important characteristics of the dataset.
The created har dataset has the following structure:
Name | Type | Description |
---|---|---|
user | Factor | w/ 4 levels “debora”,“jose_carlos” |
gender | Factor | w/ 2 levels “Man”,“Woman”: |
age | int | user age |
how_tall_in_meters | num | height |
weight | int | weight in kg |
body_mass_index | num | body mass |
x1 | int | sensor 1 axis x in m/s2 |
y1 | int | sensor 1 axis y in m/s2 |
z1 | int | sensor 1 axis z in m/s2 |
x2 | int | sensor 2 axis x in m/s2 |
y2 | int | sensor 2 axis x in m/s2 |
z2 | int | sensor 2 axis x in m/s2 |
x3 | int | sensor 3 axis x in m/s2 |
y3 | int | sensor 3 axis x in m/s2 |
z3 | int | sensor 3 axis x in m/s2 |
x4 | int | sensor 4 axis x in m/s2 |
y4 | int | sensor 4 axis x in m/s2 |
z4 | int | sensor 4 axis x in m/s2 |
class | Factor | Activity type. w/ 5 levels |
See the different activites listed in the dataset:
levels(har$class)
## [1] "sitting" "sittingdown" "standing" "standingup" "walking"
These classes are associated to the values obtained by the sensors (1 to 4) expressed in m/s2. X, Y, Z are the values obtained for each axis.
In order to understand how the values were taken we will view the sampled data across the time for sensor 1 values (x,y,z). The data was taken during 8 hours. During this time the sensed subject was doing different activites.
This is the sensor located at the waist of the subjects.
Note the unit time is not specified. We have included a non-scaled time unit for creating the chart.
Some additional checkings have been done in order to verify the consistency of data.
# Is there any NAs ?
nas <- apply(har, MARGIN = 2, function(x) any(is.na(x) | is.infinite(x)))
if (any(nas) == FALSE)
{
cat("Great, no NAs or Infinite value in the dataset")
}else{
cat("Attention, some NA or Infinite value was found in the dataset")
}
## Great, no NAs or Infinite value in the dataset
# proportion of classes
har %>% group_by(class) %>% summarize(n = n(), p = n()/nrow(.)) %>% arrange(desc(p))
# check proportion of classes and users
har %>% group_by(class, user) %>% summarize(n = n(), p = n()/nrow(.)) %>% arrange(desc(p))
Now, we are sure there are not NA values in our dataset. Also we are sure all users (4) have been measured in all the different activities (5). Note the prevalance in some user/activity. This also can be observed in the time-based chart above.
We have splitted the har original dataset in several sets for training, tuning and validatiaon.
Dataset | Observations | Proportions | Description |
---|---|---|---|
har | 165.633 | 1 | Original |
har_val | 16.565 | 1/10 of har | Set for final validation |
har_set | 149.068 | 9/10 of har | Set for model analysis |
har_set_train | 134.158 | 9/10 of har_set | Set for model training |
har_set_test | 16.565 | 1/10 of har_set | Set for model testing while optiomizing |
Since the outcome of the dataset is multi-categorical variable (class: sitting, sittingdown, standing, standingup, walking) with 5 possible values we have been focused on classifcation algorithms that support such variables. The algorithm chosen is the decission trees and its derivative random forest.
We will only focus on features provided by the sensors (x1..z4).
The results are measured by the accuracy when classifying the sensor observations.
As final resut we will provide the confusion matrix so it can be compared to hte results in [1].
By using the functions found in the caret package we have created the model for the decision tree.
fit_part <- train(class ~ x1 + y1 + z1 + x2 + y2 + z2 + x3 + y3 + z3 + x4 + y4 + z4 ,
method = "rpart",
tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 10)),
data = har_set_train)
plot(fit_part)
Since the created tree (the one for max accuracy) is pretty complex and dense (with 690 splits) we will ommit it. Below we will show a more simple tree as example.
When predicting over the har_set_test we obtain following accuracy:
In order to view a real example of a tree we will create a model with few branches by pruning the original tree.
Note accuracy is lower with a higher Complexity Parameter (CP) as observed in the chart above. A higher CP means that lesser branches will be used to compose the tree.
A more optimized algorithm based on classification trees is the random forest. This algorithm creates several trees with randomly selected features and the outcome is calculated by voting for the most probable outcome across all trees.
set.seed(14)
mtry <- seq(1,12,1) # number of variables randomly selected for each tree
fit_forest <- randomForest(class ~ x1 + y1 + z1 + x2 + y2 + z2 + x3 + y3 + z3 + x4 + y4 + z4,
data = har_set_train,
ntree = 100,
do.trace = 10,
tuneGrid = data.frame(mtry = mtry))
plot(fit_forest)
We choose 50 trees as number of trees to grow and 3 as the number of features to be randomly selected for each tree (parameter called mtry).
fit_forest1$mtry
## [1] 3
Some train control is needed so as to speed up the run time of the model training. As we can see the accuracy improves in 2% with respect to a single tree algorithm.
Let´s analyze how important is each feature in the model.
varImp(fit_forest)
## rf variable importance
##
## Overall
## z1 100.00
## z2 89.65
## y2 84.98
## y3 72.84
## x4 47.95
## y1 44.35
## z4 43.09
## z3 29.86
## x2 28.78
## x3 13.41
## y4 6.83
## x1 0.00
According to the table above the importance of the sensors is ordered as follwos (from most to least): S1 (waist), S3 (right angkle), S2(left thigh), S4(rigth upper arm).
Let´s observe how the model would perform in case of selecting the most important sensors.
Result with 2 sensors is too poor, but if we use 3 sensors the results get close to values obatined with all 4 sensors.
Find below the confusion matrix obtained from the prediction on the har_val set using the randm forest algorithm overa all 4 sensor values. Note this set is composed by 16.565 observations.
## Reference
## Prediction sitting sittingdown standing standingup walking
## sitting 5059 1 0 0 0
## sittingdown 2 1165 0 9 4
## standing 0 2 4720 10 6
## standingup 3 6 3 1205 4
## walking 0 9 14 18 4325
We can observe how the accuracy is lower on those classes with lesser observations.
The accuracy results obatined on the validation set (har_val) across the different algorithms are shown below:
METHOD | TUNING | ACCURACY |
---|---|---|
1 Classification tree (rpart) | CP = 0 (690 splits) | 0.97609 |
2 Classification tree (rpart) | CP = 0.01 (10 splits) | 0.84618 |
3 Classification tree (Random Forest) | Trees = 50, mtry = 3 | 0.99451 |
Note all 4 sensors are used for this predictions.
Regarding to the importance of the sensors. We have observed how the sensor 1 (placed on waist) is the most important, the sensor 4 (placed on right upper arm) is the one that provides less important information. By omitting the use of this sensor we could get an accuracy close to 0.991.
Future work. The features related to the user are not included in the described models. The user effect is something to be analysed. The testbench was used on the data extracted from the same subjects. But what would happen if we test our model on values from different users ? what about different sensors?
Read more: http://groupware.les.inf.puc-rio.br/har#sbia_paper_section#ixzz65cgnrXLU