ML_project/Test_PreProcess.Rmd at master · MariekeDirk/ML_project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: "Testing the preProcess function"
author: "E Kleingeld"
output: html_document
---


This script is used to test how the preProcess function from the caret package transforms a set of predictors when you apply several different methods

### Read in the predictors

First we read in a set of predictors
```{r}
# Empty environment
rm(list=ls())

# Install packages if needed
#install.packages("gridExtra")

# Read in the predictors
load("C:/Users/Eva/Documents/Stage/ML_project/Predictors_Test_6h.Rda")


```
This is what the data looks like:

```{r echo=FALSE}
head(Predictors_Test)

```

Here we test how many NA values there are in the file.
As the code below shows, there are NA values in the air temperature (TL) and dew point temperature (TD) columns.
**Note: Go back to file where you build Predictors_Test to see which stations contain NA's for TL and TD. Also test WHEN you have these NA values**

```{r}
# For the continuous variables, test how many NA values there are per variable
colSums(is.na(Predictors_Test[, 1:6]))

```


Next, we build a few functions in advance of the analysis. These functions wil plot histogram/point plots of each of the predictors and save each plot in a list.

**IMPORTANT NOTE: The print statement in both functions is currently switched off. If you want to print the histograms/point plots of the distributions of the predictors after each transformation: switch it on. This will mean that this file takes longer to render (but not much longer)**


```{r}
library(ggplot2)

# Custom function to plot the distribution of the predictors as a histogram
Plot_histograms <- function(The_Data, The_Main){

  List_hist <- list()
  Cont_vars <- colnames(The_Data)

  for (i in seq_along(Cont_vars)){
    List_hist[[i]] <- ggplot(data.frame(x=The_Data[ , i]), aes(x))+ geom_histogram()+ xlab(Cont_vars[i]) + ggtitle(The_Main)
    #print(List_hist[[i]])
  }
    return(List_hist)
}

# Custom function to plot the distribution of the predictors as a point plot with y = predictor and x = index number
Plot_Points <- function(The_Data, The_Main){

  List_point <- list()
  Cont_vars <- colnames(The_Data)

  for (i in seq_along(Cont_vars)){
    List_point[[i]] <- ggplot(data.frame(x=seq_along(The_Data[ , i]), y = The_Data[ , i]), aes(x, y))+ geom_point()+ ylab(Cont_vars[i]) + xlab("row index") + ggtitle(The_Main)
    #print(List_point[[i]])
  }
    return(List_point)
}


```


### Analyse data before applying the preProcess function

First, plot the correlation between all of the continuous valriables:
```{r}
library(corrplot)

corTest <- cor(Predictors_Test[, 1:6], use = "complete.obs")
corrplot(corTest, method = "number")

```

Next, plot the raw data

```{r}
Predictors_Unchanged_hist <- Plot_histograms(Predictors_Test[1:6], "Unchanged")
Predictors_Unchanged_point <- Plot_Points(Predictors_Test[1:6], "Unchanged")

````

### Test removal of zero variance and BoxCox

Below we apply the preProcess function so that all zero variance predictors are removed and so that a BoxCox transform is applied. The BoxCox transform should unskew the data. This is necessary because a PCA fails when you apply it on skewed data. The most skewed variable is altitude.

After transforming with BoxCox, altitude is no longer skewed. What is surprising is that TL and TD are not transformed in the same way: TL becomes very small whilst TD becomes very large. ** why does this happen? **

After the BoxCox transform the correlations between the predictors is identical to that before the transform.

```{r}
library(caret)
xTrans <- preProcess(Predictors_Test[, 1:6], method = c("zv", "BoxCox"),
                     na.remove = TRUE)

Predictors_BoxCox <- predict(xTrans, Predictors_Test[ ,1:6])

# Plot each of the BoxCox transformed predictors
Predictors_BoxCox_hist <- Plot_histograms(Predictors_BoxCox, "BoxCox")
Predictors_BoxCox_point <- Plot_Points(Predictors_BoxCox, "BoxCox")


# Test the correlations between the BoxCox transformed predictors
corBoxCox <- cor(Predictors_BoxCox, use = "complete.obs")
corrplot(corBoxCox, method = "number")

```

### Test removal of zero variance and BoxCox + centering and scaling
When we center the data the mean of the predictor's data is subtracted from the predictor values.
When we scale the data the predictor values are divided by the standard deviation.

When you center and scale predictors the predictor values vary over a smaller range. For some ML algorithms, such as neural networks or support vector machines, this is especially useful because they will run faster on scaled/centered data.


After the BoxCox transform and centering/scaling the correlations between the predictors is identical to that before the transform.


```{r}
xTrans <- preProcess(Predictors_Test[, 1:6], method = c("zv", "BoxCox",  "center", "scale"),
                     na.remove = TRUE)

Predictors_centerScale <- predict(xTrans, Predictors_Test[ ,1:6])

# Plot each of the transformed variables
Predictors_CS_hist <- Plot_histograms(Predictors_centerScale, "BC + CS")
Predictors_CS_point <- Plot_Points(Predictors_centerScale, "BC + CS")


# Test the correlations between the BoxCox + centered/scaled transformed predictors
corBoxCox_CS <- cor(Predictors_centerScale, use = "complete.obs")
corrplot(corBoxCox_CS, method = "number")
```

### Test removal of zero variance and BoxCox + centering and scaling + PCA

By default, the pca applied here will retain 95% of the variance
To set another value see:
[link](http://stackoverflow.com/questions/32557524/principal-component-analysis-with-caret)

To examine whether the pca was performed correctly we again plot the predictors after this transformation. However, we also analyse the variable loadings (: how much did the 'old' predictor contribute to the principal component). Furthermore, a scree plot is produced which shows how much of the total variance is explained by the principal components. To build the scree plot I adapted a method from: [link](https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/)
Important to note!!! By applying the method to the PC's that you get after transformation you can only see the percentage variance explained for the PC's that came out of the preProcessing prediction. In other words: Only the PC's that together explain 95% of the variance are shown. Unfortunately, you cannot extract the rest of the PC's when you use caret to perform a PCA.

We also plot a scatterplot matrix of each of the PC's. This takes a while to plot (approximatly 1 minute). You can see that there are outliers in the PC's. Outliers can have a negative impact on the reliability of your calculated PC's [link](http://www.math.umn.edu/~lerman/Meetings/SIAM2012_Sujay.pdf) **Note: Do we need to solve this? (Probably yes, but...)**

```{r}

xTrans <- preProcess(Predictors_Test[, 1:6], method = c("zv", "BoxCox",  "center", "scale", "pca"),
                     na.remove = TRUE)

Predictors_PCA <- predict(xTrans, Predictors_Test[ ,1:6])

# Basic information on the transformation:
print(xTrans)

# The rotation column stores the variable loadings
# Each principal component after the PCA is a linear combination of the original predictors.
# The coefficient for each predictor is called loading.
# A variable loading close to 0 indicates that a predictor did not contribute much to the principal component
xTrans$rotation

# Scree plot of the explained variance

# Get the standard deviation per PC from the transformation object
PC_stdev <- apply(Predictors_PCA, 2, sd, na.rm = TRUE)

# Get the variance per PC by taking the square of the standard deviation per PC
PC_var <- PC_stdev^2

# Calculate proportion of total variance explained per PC by dividing PC_var by the total variance explained
PC_prop_var <- PC_var/(sum(PC_var))

# scree plot
plot(y = PC_prop_var, x =1:5, type = "b", xlab = "Principal Components", ylab = "Proportion of variance explained")


# Plot the PC's versus each other in a scatterplot matrix
pairs(Predictors_PCA)


# Plot each of the transformed variables
Predictors_PCA_hist <- Plot_histograms(Predictors_PCA, "BC +CS +PCA")
Predictors_PCA_point <- Plot_Points(Predictors_PCA, "BC +CS +PCA")

# Test the correlations between the BoxCox + centered/scaled + PCA transformed predictors
corBoxCox_CS_PCA <- cor(Predictors_PCA, use = "complete.obs")
corrplot(corBoxCox_CS_PCA, method = "number")


```

### Side-by-side comparison

Next, we want to plot the unchanged data and all of the subsequent transformations side-by-side so that you can see the effect of each of the transformations.

Before transformation, after BoxCox and after centering/scaling we have 6 predictors. After PCA we have 5 principal components.This means that in the last step (:nr 6) the principal component (PC) 5 is plotted.

```{r warning=FALSE, message=FALSE}
library(gridExtra)

for (i in seq_along(Predictors_Unchanged_point)) {
    print(i)
    Plot_1_point <- Predictors_Unchanged_point[[i]]
    Plot_2_point <- Predictors_BoxCox_point[[i]]
    Plot_3_point <- Predictors_CS_point[[i]]
    if(i < 6){Plot_4_point <- Predictors_PCA_point[[i]]}
    grid.arrange(Plot_1_point, Plot_2_point, Plot_3_point, Plot_4_point, ncol = 2)

    Plot_1_hist <- Predictors_Unchanged_hist[[i]]
    Plot_2_hist <- Predictors_BoxCox_hist[[i]]
    Plot_3_hist <- Predictors_CS_hist[[i]]
    if(i < 6){Plot_4_hist <- Predictors_PCA_hist[[i]]}
    grid.arrange(Plot_1_hist, Plot_2_hist, Plot_3_hist, Plot_4_hist, ncol = 2)

}
# Important to note:
# The geom_point and stat_bin remove 767 rows due to missing values
# Bin size is automatically set to 30

# Test missing values in each of the transformed predictor sets:
colSums(is.na(Predictors_BoxCox))
colSums(is.na(Predictors_centerScale))
colSums(is.na(Predictors_PCA))

```

```{r}
# Test grid arrange:

# Plot_1 <-  Predictors_PCA_point[[1]]
# Plot_2 <-  Predictors_PCA_point[[2]]
# Plot_3 <-  Predictors_PCA_point[[3]]
# Plot_4 <-  Predictors_PCA_point[[4]]
#
# grid.arrange(Plot_1, Plot_2, Plot_3, Plot_4, ncol = 4)

```