Exploratory Data Analysis

1. Recap of Part I

A. You learnt about different types of data like –

  • Continuous
  • Discrete
  • Nominal
  • Ordinal etc

B. You learnt different ways in which data can be plotted –

  • Histograms
  • Density plots
  • Box plots
  • Heatmaps etc

C. You learnt about how the distribution/spread of data can be measured –

  • Mean
  • Median
  • Quantiles

D. Finally you learnt how the dependence or independence between two variables can be measured and plotted

  • Correlation
  • Co-variance
  • Scatter plots

2. Objectives

In this practical session you will learn how to implements the ideas taught in the first lesson using the R programming language. Throughout this course we will use a common diabetes dataset to practically implement the concepts taught during the lectures.

Diabetes is a collection of metabolic disorders where the blood glucose levels increases drastically due to defective insulin secretion and/or insulin resistance. There are many risk factors associated with diabetes like obesity, old age etc. Diabetes in turn majorly increases the risk of cardiovascular disease. For more information see here. Our dataset has various pathophysiological measurements of >400 individuals. Clinical parameters like blood glucose levels, cholesterol levels, age, body size, weight, blood pressure etc have been measured. We will use this dataset to explore the statistical properties of each variable (like its distribution, mean value etc) and question relations between variables (like is blood glucose level correlated with obesity?).

3. Lets get started - Load the data

The diabetes dataset, which we will be using throughout this practical will be downloaded from an online repository. We will load that into R and have a sneak peek into how it looks like.

dat = read.delim("https://tinyurl.com/y4fark9g", stringsAsFactors = FALSE)
head(dat, 10)  # look at the first 10 lines of the table
    id chol stab.glu hdl ratio glyhb   location age gender height weight
1 1000  203       82  56   3.6  4.31 Buckingham  46 female     62    121
2 1001  165       97  24   6.9  4.44 Buckingham  29 female     64    218
3 1002  228       92  37   6.2  4.64 Buckingham  58 female     61    256
4 1003   78       93  12   6.5  4.63 Buckingham  67   male     67    119
5 1005  249       90  28   8.9  7.72 Buckingham  64   male     68    183
6 1008  248       94  69   3.6  4.81 Buckingham  34   male     71    190
   frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 medium   118    59    NA    NA    29  38      720
2  large   112    68    NA    NA    46  48      360
3  large   190    92   185    92    49  57      180
4  large   110    50    NA    NA    33  38      480
5 medium   138    80    NA    NA    44  41      300
6  large   132    86    NA    NA    36  42      195
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

Lets explore and understand our dataset more

1. What is the dimension of our dataset (i.e. how many rows/columns are there in our data)

# Dimension
dim(dat)
[1] 403  19
# Number of columns
ncol(dat)

# Number of rows
nrow(dat)

2. What are the column names of our dataset

colnames(dat)  #similarly rownames for rows
 [1] "id"       "chol"     "stab.glu" "hdl"      "ratio"    "glyhb"   
 [7] "location" "age"      "gender"   "height"   "weight"   "frame"   
[13] "bp.1s"    "bp.1d"    "bp.2s"    "bp.2d"    "waist"    "hip"     
[19] "time.ppn"

Probably you are confused about what these column names mean. For more description on these values look here

3. How do we extract the minimum and maximum age of patients in our dataset

min(dat$age)
max(dat$age)
range(dat$age)
# Try to find out the same for height and weight

4. How does the overall summary of our entire dataset looks like?

summary(dat)  #Can you explain what you see?
       id             chol          stab.glu          hdl        
 Min.   : 1000   Min.   : 78.0   Min.   : 48.0   Min.   : 12.00  
 1st Qu.: 4792   1st Qu.:179.0   1st Qu.: 81.0   1st Qu.: 38.00  
 Median :15766   Median :204.0   Median : 89.0   Median : 46.00  
 Mean   :15978   Mean   :207.8   Mean   :106.7   Mean   : 50.45  
 3rd Qu.:20336   3rd Qu.:230.0   3rd Qu.:106.0   3rd Qu.: 59.00  
 Max.   :41756   Max.   :443.0   Max.   :385.0   Max.   :120.00  
     ratio            glyhb         location              age       
 Min.   : 1.500   Min.   : 2.68   Length:403         Min.   :19.00  
 1st Qu.: 3.200   1st Qu.: 4.38   Class :character   1st Qu.:34.00  
 Median : 4.200   Median : 4.84   Mode  :character   Median :45.00  
 Mean   : 4.522   Mean   : 5.59                      Mean   :46.85  
 3rd Qu.: 5.400   3rd Qu.: 5.60                      3rd Qu.:60.00  
 Max.   :19.300   Max.   :16.11                      Max.   :92.00  
    gender              height          weight         frame          
 Length:403         Min.   :52.00   Min.   : 99.0   Length:403        
 Class :character   1st Qu.:63.00   1st Qu.:151.0   Class :character  
 Mode  :character   Median :66.00   Median :172.5   Mode  :character  
                    Mean   :66.02   Mean   :177.6                     
                    3rd Qu.:69.00   3rd Qu.:200.0                     
                    Max.   :76.00   Max.   :325.0                     
     bp.1s           bp.1d            bp.2s           bp.2d       
 Min.   : 90.0   Min.   : 48.00   Min.   :110.0   Min.   : 60.00  
 1st Qu.:121.2   1st Qu.: 75.00   1st Qu.:138.0   1st Qu.: 84.00  
 Median :136.0   Median : 82.00   Median :149.0   Median : 92.00  
 Mean   :136.9   Mean   : 83.32   Mean   :152.4   Mean   : 92.52  
 3rd Qu.:146.8   3rd Qu.: 90.00   3rd Qu.:161.0   3rd Qu.:100.00  
 Max.   :250.0   Max.   :124.00   Max.   :238.0   Max.   :124.00  
     waist           hip           time.ppn     
 Min.   :26.0   Min.   :30.00   Min.   :   5.0  
 1st Qu.:33.0   1st Qu.:39.00   1st Qu.:  90.0  
 Median :37.0   Median :42.00   Median : 240.0  
 Mean   :37.9   Mean   :43.04   Mean   : 341.2  
 3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.: 517.5  
 Max.   :56.0   Max.   :64.00   Max.   :1560.0  
 [ reached getOption("max.print") -- omitted 1 row ]

4. Data cleanup

Very often the first thing one needs to do before any data science project is to clean up the raw data and transform it into a format that is readily understood and easy to use for all downstream analysis. This process usually involves –

  • Removing empty value rows/columns
  • Removing unused or unnecessary rows/columns
  • Reordering the data matrix
  • Keeping columns uniformly numeric (age, weight etc) or string (names, places etc) or logical (TRUE/FALSE, 1/0)
  • Handling strange caveats which are data specific like replacing , or ., or ; from numbers etc

Lets do some clean up of our own diabetes data –

  1. We will make the id column the row names for the dataset.
  2. We will remove the bp.2s and bp.2d columns as it has mostly missing values (see summary above)
  3. We will also remove the column time.ppn which will not be required in our analysis
  4. We will reorder the columns of the data such that all the qualitative and quantitative values are separated. Among the quantitative values we will keep related variables together
rownames(dat) = dat$id
dat = dat[, -which(colnames(dat) %in% c("id", "bp.2s", "bp.2d", "time.ppn"))]
dat = dat[, c(8, 6, 11, 9, 10, 14, 15, 2, 5, 1, 3, 4, 12, 13)]

Now lets look at our cleaned data

summary(dat)
    gender            location            frame               height     
 Length:403         Length:403         Length:403         Min.   :52.00  
 Class :character   Class :character   Class :character   1st Qu.:63.00  
 Mode  :character   Mode  :character   Mode  :character   Median :66.00  
                                                          Mean   :66.02  
                                                          3rd Qu.:69.00  
                                                          Max.   :76.00  
                                                          NA's   :5      
     weight          waist           hip           stab.glu    
 Min.   : 99.0   Min.   :26.0   Min.   :30.00   Min.   : 48.0  
 1st Qu.:151.0   1st Qu.:33.0   1st Qu.:39.00   1st Qu.: 81.0  
 Median :172.5   Median :37.0   Median :42.00   Median : 89.0  
 Mean   :177.6   Mean   :37.9   Mean   :43.04   Mean   :106.7  
 3rd Qu.:200.0   3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.:106.0  
 Max.   :325.0   Max.   :56.0   Max.   :64.00   Max.   :385.0  
 NA's   :1       NA's   :2      NA's   :2                      
     glyhb            chol            hdl             ratio       
 Min.   : 2.68   Min.   : 78.0   Min.   : 12.00   Min.   : 1.500  
 1st Qu.: 4.38   1st Qu.:179.0   1st Qu.: 38.00   1st Qu.: 3.200  
 Median : 4.84   Median :204.0   Median : 46.00   Median : 4.200  
 Mean   : 5.59   Mean   :207.8   Mean   : 50.45   Mean   : 4.522  
 3rd Qu.: 5.60   3rd Qu.:230.0   3rd Qu.: 59.00   3rd Qu.: 5.400  
 Max.   :16.11   Max.   :443.0   Max.   :120.00   Max.   :19.300  
 NA's   :13      NA's   :1       NA's   :1        NA's   :1       
     bp.1s           bp.1d       
 Min.   : 90.0   Min.   : 48.00  
 1st Qu.:121.2   1st Qu.: 75.00  
 Median :136.0   Median : 82.00  
 Mean   :136.9   Mean   : 83.32  
 3rd Qu.:146.8   3rd Qu.: 90.00  
 Max.   :250.0   Max.   :124.00  
 NA's   :5       NA's   :5       

The ordering and selection of columns looks right, however it seems that there are certain rows that has missing values (like glyhb column has 13 NA values). Lets remove all rows with any missing value. Also lets make the the columns Location and Gender nominal and the frame ordinal.

Before we do so, let’s have a look at the is.na function

is.na(dat$glyhb)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [56] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[111] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [ reached getOption("max.print") -- omitted 283 entries ]

Understand how it works? Let’s go one step further:

sum(is.na(dat$glyhb))
[1] 13

Now we can apply the same principle on rows instead of columns, and repeat that for all rows of the data frame:

rmv.rows = apply(dat, 1, function(x) {
    sum(is.na(x))
})  # Go through each row and sum up all missing values
rmv.rows[1:20]  # The number of missing values in the first 20 row
1000 1001 1002 1003 1005 1008 1011 1015 1016 1022 1024 1029 1030 1031 1035 
   0    0    0    0    0    0    0    2    0    0    0    0    0    2    0 
1036 1037 1041 1045 1250 
   0    0    0    0    0 
which(rmv.rows > 0)  # The rows where there is atleast 1 missing value
 1015  1031  1281  1314  1326  2754  2774  2780  2784  2794  4760  4805 
    8    14    28    38    44    51    60    64    65    70    87   109 
 4808  4813  4826  4827 13500 15012 15542 15757 15805 15813 15827 16016 
  110   111   117   118   153   162   193   196   216   218   225   232 
20258 20279 20762 20787 20790 21284 21321 21357 41023 41253 41501 41756 
  274   283   318   327   328   333   337   349   381   394   397   403 
dat = dat[-which(rmv.rows > 0), ]  # Removing any row with 1 or more missing values
rm(rmv.rows)

dat$location = factor(dat$location)  #making data nominal
dat$gender = factor(dat$gender)  #making data nominal
dat$frame = factor(dat$frame, levels = c("small", "medium", "large"))  #making data ordinal

Now our cleaned data has no missing values, columns are cleanly ordered and each column is in the right format

summary(dat)
    gender          location      frame         height     
 female:214   Buckingham:176   small : 98   Min.   :52.00  
 male  :153   Louisa    :191   medium:173   1st Qu.:63.00  
                               large : 96   Median :66.00  
                                            Mean   :66.05  
                                            3rd Qu.:69.00  
                                            Max.   :76.00  
     weight          waist            hip           stab.glu    
 Min.   : 99.0   Min.   :26.00   Min.   :30.00   Min.   : 48.0  
 1st Qu.:151.0   1st Qu.:33.00   1st Qu.:39.00   1st Qu.: 81.0  
 Median :174.0   Median :37.00   Median :42.00   Median : 90.0  
 Mean   :178.1   Mean   :37.93   Mean   :43.04   Mean   :107.3  
 3rd Qu.:200.0   3rd Qu.:41.50   3rd Qu.:46.00   3rd Qu.:108.0  
 Max.   :325.0   Max.   :56.00   Max.   :64.00   Max.   :385.0  
     glyhb             chol            hdl             ratio       
 Min.   : 2.680   Min.   : 78.0   Min.   : 12.00   Min.   : 1.500  
 1st Qu.: 4.390   1st Qu.:179.0   1st Qu.: 38.00   1st Qu.: 3.200  
 Median : 4.860   Median :204.0   Median : 46.00   Median : 4.200  
 Mean   : 5.602   Mean   :207.5   Mean   : 50.28   Mean   : 4.536  
 3rd Qu.: 5.630   3rd Qu.:229.0   3rd Qu.: 59.00   3rd Qu.: 5.400  
 Max.   :16.110   Max.   :443.0   Max.   :120.00   Max.   :19.300  
     bp.1s           bp.1d      
 Min.   : 90.0   Min.   : 48.0  
 1st Qu.:121.5   1st Qu.: 75.0  
 Median :136.0   Median : 82.0  
 Mean   :137.1   Mean   : 83.4  
 3rd Qu.:148.0   3rd Qu.: 92.0  
 Max.   :250.0   Max.   :124.0  

Can you identify which types of data (continuous, discrete etc) each column above represents and why?

5. Visualizing data distribution

Histograms

summary(dat$stab.glu)  # What are these values? replace summary by quantile and see what happens
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   48.0    81.0    90.0   107.3   108.0   385.0 
hist(dat$stab.glu, xlab = "Stabilized Glucose concentration in blood", main = "")
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"), 
    lty = 2)

Add the parameter breaks = 50 in the above lines of code and see what happens. Try different values for breaks like 10, 20, 75, 100 and try to interpret the differences. Is this a good or bad thing about histograms in general? Type help("hist") to see more details about the histogram function in R. Also try plotting histograms and summaries for other continuous numeric data in our diabetes dataset.

Density plots

plot(density(dat$stab.glu), xlab = "Stabilized Glucose concentration in blood", 
    main = "")
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"), 
    lty = 2)

Using the function abline plot a vertical line highlighting the glucose concerntration of 300 units.

Boxplots

boxplot(dat$stab.glu, xlab = "Stabilized Glucose concentration in blood", horizontal = T)
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"), 
    lty = 2)

Using the abline function draw lines passing through the whiskers of the boxplot

Integrate histograms, density plots, violin plots and boxplots

[1]   0 400

Which do you think is a better plot for visualizing data distribution? What are the advantages and disadvantages of each of these plots?

6. Measuring the centrality in data

We have already seen that the summary and quantile functions in R can compute the mean, median and quantiles of any given data. Let now use functions in R that explicitly measure these values.

Mean

mean(dat$stab.glu)
[1] 107.3297

Median

median(dat$stab.glu)
[1] 90

Calculate the mean and median of other continuous numeric data in the diabetes dataset and measure the difference between them. (a) Why is there a difference between the mean and median? (b) Why do you think there are larger differences for some and almost no difference for others?

Quantiles

quantile(dat$stab.glu)
  0%  25%  50%  75% 100% 
  48   81   90  108  385 

By default, the quantile function gives you the min, 25th, 50th, 75th quantiles and max values. How would you calculate say the 35th or 47th or 72nd quantiles. See ?quantile for hints.

7. Association between variables

Often a common step during any data analysis project is to find associations between variables present in the dataset. Such associations helps us to decipher the underlying structure in the data. For instance, in our diabetes dataset we would expect a high correlation between free blood glucose levels and glycosylated blood levels or between waist and hip sizes. On one hand, prior knowledge helps us to know what to expect and if this expectation does not hold true it may hint at some issues with the data. On the other hand, novel associations helps to discover new knowledge. One of the most common ways of measuring associations is correlations.

Let us start by producing a scatter plot between some pairs of variables:

plot(dat$stab.glu, dat$glyhb, pch = 20, xlab = "Stabilized glucose", ylab = "Glycosylated hemoglobin")

Do you suspect that the two variables have a relationship?

Do the scatter plot for other pairs of numerical variables!

We now can compute the correlation of the two variables; remember that we can compute the Pearson correlation or the Spearman correlation:

## compute the Pearson correlaton
cor(dat$stab.glu, dat$glyhb, method = "pearson")
[1] 0.7411355
## compute the Pearson correlaton
cor(dat$stab.glu, dat$glyhb, method = "spearman")
[1] 0.5214866

The Spearman correlation seems much lower! To understand why, we can do a scatter plot between the ranks of the two variables:

plot(rank(dat$stab.glu), rank(dat$glyhb), pch = 20, xlab = "Ranks Stabilized glucose", 
    ylab = "Ranks Glycosylated hemoglobin")

When looking at the ranks, the relationship is no longer so obvious. However, these correlations still represent an association between the variables!

Pairwise correlations

Lets calculate and visualize the correlations among all our variables in the diabetes dataset

cor.mat = cor(dat[, 4:ncol(dat)], method = "pearson")  # First three columns are discrete data
round(cor.mat, 2)  # all paiwise correlations
         height weight waist   hip stab.glu glyhb  chol   hdl ratio bp.1s
height     1.00   0.24  0.04 -0.12     0.08  0.05 -0.06 -0.07  0.07 -0.04
weight     0.24   1.00  0.85  0.83     0.19  0.17  0.08 -0.28  0.28  0.10
waist      0.04   0.85  1.00  0.83     0.23  0.25  0.14 -0.28  0.32  0.21
hip       -0.12   0.83  0.83  1.00     0.15  0.15  0.10 -0.22  0.21  0.15
stab.glu   0.08   0.19  0.23  0.15     1.00  0.74  0.16 -0.18  0.30  0.15
glyhb      0.05   0.17  0.25  0.15     0.74  1.00  0.27 -0.17  0.35  0.20
chol      -0.06   0.08  0.14  0.10     0.16  0.27  1.00  0.17  0.48  0.20
hdl       -0.07  -0.28 -0.28 -0.22    -0.18 -0.17  0.17  1.00 -0.69  0.03
ratio      0.07   0.28  0.32  0.21     0.30  0.35  0.48 -0.69  1.00  0.11
bp.1s     -0.04   0.10  0.21  0.15     0.15  0.20  0.20  0.03  0.11  1.00
         bp.1d
height    0.04
weight    0.18
waist     0.18
hip       0.16
stab.glu  0.02
glyhb     0.05
chol      0.16
hdl       0.07
ratio     0.03
bp.1s     0.62
 [ reached getOption("max.print") -- omitted 1 row ]
par(mar = c(3, 3, 0.5, 0.5), mgp = c(1.5, 0.5, 0), las = 2)
pairs(dat[, 4:ncol(dat)], pch = 20, cex = 0.5, col = "grey")

Change method="spearman" while calculating correlations and observe the differences. Which correlations did you expect to see and what were novel? Can you explain the relation you observe between hdl, chol and ratio. Remember that ratio = chol/hdl see here

Ashwini Kumar Sharma, Carl Herrmann

2019-10-25