Basic statistics - Regression analysis

1. Objectives

In this last part, we want to learn how to build a regression model in order to make predictions on certain quantitative variables using other quantitative variables. The important steps here are:

  • learning the model
  • testing the model to evaluate its performances, and also check that the assumptions of the linearity are given
  • predict values based on new data points

We will use again the diabetes data set, and build (1) simple linear regression models with a single explanatory variable, and (2) multiple regression models in which we use several variables. we will focus on predicting the cholesterol level.

2. Load the data and first analysis

We will limit the dataset to the numerical variables

  chol stab.glu hdl glyhb age height weight bp.1s bp.1d waist hip
1  203       82  56  4.31  46     62    121   118    59    29  38
2  165       97  24  4.44  29     64    218   112    68    46  48
3  228       92  37  4.64  58     61    256   190    92    49  57
4   78       93  12  4.63  67     67    119   110    50    33  38
5  249       90  28  7.72  64     68    183   138    80    44  41
6  248       94  69  4.81  34     71    190   132    86    36  42

We can clean the data to avoid the missing NA values

[1] 377  11

Let us first repeat the analysis we did in the first week to identify the correlation structure of the dataset. This will be of importance when we include multiple variables into a regression model, as we have seen that correlation among the explanatory variables can be detrimental.

What are the strongest correlations? Do they make sense?

Appart from displaying the raw correlation values, we can test if any of those are statistically significant, i.e. if they are significantly positive (one-sided test), negative (one-sided test), or non zero (two-sided test).

For example, there seems to be a positive correlation between stabilized glucose (stab.glu) and hip circonference.

[1] 0.1339642

    Pearson's product-moment correlation

data:  dat$stab.glu and dat$hip
t = 2.6178, df = 375, p-value = 0.009208
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.03341458 0.23182918
sample estimates:
      cor 
0.1339642 

read carefully the report, and make sure you understand this output. Check other pairs! Can you determine exactly which correlation level with n=377 patients leads to a p-value lower than 0.05? (check the formular in the slides…)

3. Univariate linear regression

What is the most promising variable to predict cholesterol level?

We will use glycosilated hemoglobin (glyhb) as a predictor of the cholesterol level:


Call:
lm(formula = chol ~ glyhb, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-124.467  -28.089   -4.574   23.899  187.647 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  177.171      6.031   29.38  < 2e-16 ***
glyhb          5.463      1.002    5.45 9.15e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.03 on 375 degrees of freedom
Multiple R-squared:  0.07339,   Adjusted R-squared:  0.07092 
F-statistic:  29.7 on 1 and 375 DF,  p-value: 9.15e-08

For a simple lineare regression (with only one explanatory variable), the p-value of the t-test for the slope is identical with the p-value of the F-test for the global model. This will no longer be the case when including several explanatory variables!

By the way, have we checked that a linear regression makes sense in this case? Remember that we have to check that:

  • the residuals are normally distributed
  • there is no correlation between the residuals and the explanatory variable

[1] 4.90302e-18

What is your overall opinion about the validity of the regression model here?

We can now use the model to predict the cholesterol values, and compare them to the real cholesterol values, since we have the information in this dataset

Not really super convincing, right? Let’s put more information into the model!

4. Multiple regression model

Let us include all the information to try to predict cholesterol level: that is, we use all variable available to predict cholesterol levels!


Call:
lm(formula = chol ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.063 -27.966  -6.429  25.625 196.987 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 157.46827   59.38866   2.651 0.008362 ** 
stab.glu     -0.04609    0.05905  -0.781 0.435565    
hdl           0.59386    0.12971   4.578 6.43e-06 ***
glyhb         5.50595    1.45980   3.772 0.000189 ***
age           0.43562    0.16380   2.659 0.008170 ** 
height       -1.05791    0.69719  -1.517 0.130029    
weight        0.14380    0.14132   1.018 0.309560    
bp.1s        -0.02447    0.13605  -0.180 0.857376    
bp.1d         0.42121    0.20952   2.010 0.045131 *  
waist         0.40867    0.85288   0.479 0.632111    
hip          -0.67331    0.88311  -0.762 0.446293    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.91 on 366 degrees of freedom
Multiple R-squared:  0.1825,    Adjusted R-squared:  0.1602 
F-statistic: 8.171 on 10 and 366 DF,  p-value: 5.831e-12

Do you note something unexpected in this report?

We can see that the inclusion of several explanatory variables improves the regression. Check the \(R^2\) values for example!

We can see that the variables weight, waist and hip do not reach the significance level. Two explanations are possible

  1. either these variables are indeed non-informative regarding the prediction of the cholesterol level
  2. or the mutual correlation between these 3 variables interferes with the model.

We can remove waist and hip for example, the redo the regression


Call:
lm(formula = chol ~ stab.glu + hdl + glyhb + age + height + weight + 
    bp.1s + bp.1d, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-90.992 -28.772  -6.665  25.338 196.300 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 134.14448   39.83769   3.367 0.000839 ***
stab.glu     -0.04492    0.05889  -0.763 0.446088    
hdl           0.58714    0.12900   4.551 7.25e-06 ***
glyhb         5.52089    1.45314   3.799 0.000170 ***
age           0.45938    0.15629   2.939 0.003498 ** 
height       -0.82648    0.56024  -1.475 0.141006    
weight        0.10953    0.05794   1.891 0.059471 .  
bp.1s        -0.03132    0.13491  -0.232 0.816579    
bp.1d         0.42848    0.20878   2.052 0.040849 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.84 on 368 degrees of freedom
Multiple R-squared:  0.1811,    Adjusted R-squared:  0.1633 
F-statistic: 10.17 on 8 and 368 DF,  p-value: 7.475e-13

We can see how the weight variable seems indeed to contribute to the prediction of the cholesterol; the removal of the strongly correlated variables has increased its significance. It now almost reaches significance at the 5% level!

Check the result of the F-test to compare both models!

We can now check if the prediction are better that with the univariate model

Better? I would say so… To determine the accuracy, we can compute the so called root mean squared error (RMSE)

\[ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\hat{x_i})^2} \]

[1] 40.34489

Of course, this is cheating, right? We are predicting the values on exactly the same data we used to learn the model. In real machine learning, we need to perform cross-validation, i.e. learn the model on one part of the data (training set) and validate it on another set (test set).

Let us split the dataset in a test and training set randomly

We now learn a new model on the train dataset:


Call:
lm(formula = chol ~ stab.glu + hdl + glyhb + age + height + weight + 
    bp.1s + bp.1d, data = dat.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-94.393 -27.084  -7.834  23.873 194.699 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 142.50126   57.84878   2.463  0.01465 * 
stab.glu     -0.08642    0.09194  -0.940  0.34843   
hdl           0.60728    0.21907   2.772  0.00612 **
glyhb         6.13502    2.24097   2.738  0.00677 **
age           0.64350    0.23248   2.768  0.00620 **
height       -0.72302    0.77128  -0.937  0.34972   
weight        0.04252    0.08098   0.525  0.60016   
bp.1s        -0.25648    0.18783  -1.366  0.17370   
bp.1d         0.69748    0.32414   2.152  0.03267 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 42.67 on 191 degrees of freedom
Multiple R-squared:  0.1553,    Adjusted R-squared:  0.1199 
F-statistic:  4.39 on 8 and 191 DF,  p-value: 6.872e-05

We can compute the RMSE for the training dataset

[1] 41.69499

Let use use that model to predict cholesterol values for the left out test set

and compute the rmse:

[1] 39.74716

In general, the RMSE is higher on the test dataset, since this does not include the data used for the establishment of the regression model; however, this is a more realistic estimation of the validity of the model, as it indicates how well the model could be extended to novel, independent data!

An important topic here is feature selection, i.e. finding the optimal and minimal set of explanatory variables that allow to predict well the outpu variable.

Can you imagine a strategy to reduce the set of explanatory variables to an optimal subset?

4. Using PCA to determine independent variables

To solve the problem of correlated variables, we can compute principal components on all explanatory variables

Importance of components:
                           PC1     PC2     PC3      PC4      PC5     PC6
Standard deviation     55.5885 40.0626 25.6268 16.31710 14.39464 8.76905
Proportion of Variance  0.5206  0.2704  0.1106  0.04486  0.03491 0.01296
Cumulative Proportion   0.5206  0.7911  0.9017  0.94655  0.98146 0.99442
                           PC7     PC8     PC9    PC10
Standard deviation     4.54647 2.50033 2.02915 1.44278
Proportion of Variance 0.00348 0.00105 0.00069 0.00035
Cumulative Proportion  0.99790 0.99896 0.99965 1.00000

We can display the definition of the principal components in terms of the original variables:

Now perform the multiple linear regression model using the PCs as explanatory variables rather than the original variables.

Check that the PCs are not correlated to each other any more by reproducing the correlation analysis we did at the beginning and display the correlation heatmap.

Let us now run the linear regression with all PCs:


Call:
lm(formula = dat$chol ~ pca$x)

Residuals:
    Min      1Q  Median      3Q     Max 
-91.063 -27.966  -6.429  25.625 196.987 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 207.74005    2.10700  98.595  < 2e-16 ***
pca$xPC1      0.14808    0.03795   3.902 0.000114 ***
pca$xPC2     -0.02815    0.05266  -0.535 0.593235    
pca$xPC3     -0.38695    0.08233  -4.700 3.69e-06 ***
pca$xPC4     -0.51130    0.12930  -3.954 9.21e-05 ***
pca$xPC5      0.39542    0.14657   2.698 0.007303 ** 
pca$xPC6     -0.48629    0.24060  -2.021 0.043988 *  
pca$xPC7     -0.59378    0.46405  -1.280 0.201510    
pca$xPC8      0.07809    0.84381   0.093 0.926315    
pca$xPC9      1.22151    1.03975   1.175 0.240834    
pca$xPC10     5.48824    1.46232   3.753 0.000203 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.91 on 366 degrees of freedom
Multiple R-squared:  0.1825,    Adjusted R-squared:  0.1602 
F-statistic: 8.171 on 10 and 366 DF,  p-value: 5.831e-12

Produce the barplots to display the definition of the significant principal components in this analysis

Reproduce this analysis for the variable glyhb

Ashwini Kumar Sharma, Carl Herrmann

2019-12-20