Exploratory Data Analysis
1. Recap of Part I
A. You learnt about different types of data like –
- Continuous
- Discrete
- Nominal
- Ordinal etc
B. You learnt different ways in which data can be plotted –
- Histograms
- Density plots
- Box plots
- Heatmaps etc
C. You learnt about how the distribution/spread of data can be measured –
- Mean
- Median
- Quantiles
D. Finally you learnt how the dependence or independence between two variables can be measured and plotted
- Correlation
- Co-variance
- Scatter plots
2. Objectives
In this practical session you will learn how to implements the ideas taught in the first lesson using the R programming language. Throughout this course we will use a common diabetes dataset to practically implement the concepts taught during the lectures.
Diabetes is a collection of metabolic disorders where the blood glucose levels increases drastically due to defective insulin secretion and/or insulin resistance. There are many risk factors associated with diabetes like obesity, old age etc. Diabetes in turn majorly increases the risk of cardiovascular disease. For more information see here. Our dataset has various pathophysiological measurements of >400 individuals. Clinical parameters like blood glucose levels, cholesterol levels, age, body size, weight, blood pressure etc have been measured. We will use this dataset to explore the statistical properties of each variable (like its distribution, mean value etc) and question relations between variables (like is blood glucose level correlated with obesity?).
3. Lets get started - Load the data
The diabetes dataset, which we will be using throughout this practical will be downloaded from an online repository. We will load that into R and have a sneak peek into how it looks like.
dat = read.delim("https://tinyurl.com/y4fark9g", stringsAsFactors = FALSE)
head(dat, 10) # look at the first 10 lines of the table
id chol stab.glu hdl ratio glyhb location age gender height weight
1 1000 203 82 56 3.6 4.31 Buckingham 46 female 62 121
2 1001 165 97 24 6.9 4.44 Buckingham 29 female 64 218
3 1002 228 92 37 6.2 4.64 Buckingham 58 female 61 256
4 1003 78 93 12 6.5 4.63 Buckingham 67 male 67 119
5 1005 249 90 28 8.9 7.72 Buckingham 64 male 68 183
6 1008 248 94 69 3.6 4.81 Buckingham 34 male 71 190
frame bp.1s bp.1d bp.2s bp.2d waist hip time.ppn
1 medium 118 59 NA NA 29 38 720
2 large 112 68 NA NA 46 48 360
3 large 190 92 185 92 49 57 180
4 large 110 50 NA NA 33 38 480
5 medium 138 80 NA NA 44 41 300
6 large 132 86 NA NA 36 42 195
[ reached 'max' / getOption("max.print") -- omitted 4 rows ]
Lets explore and understand our dataset more
1. What is the dimension of our dataset (i.e. how many rows/columns are there in our data)
# Dimension
dim(dat)
[1] 403 19
# Number of columns
ncol(dat)
# Number of rows
nrow(dat)
2. What are the column names of our dataset
colnames(dat) #similarly rownames for rows
[1] "id" "chol" "stab.glu" "hdl" "ratio" "glyhb"
[7] "location" "age" "gender" "height" "weight" "frame"
[13] "bp.1s" "bp.1d" "bp.2s" "bp.2d" "waist" "hip"
[19] "time.ppn"
Probably you are confused about what these column names mean. For more description on these values look here
3. How do we extract the minimum and maximum age of patients in our dataset
min(dat$age)
max(dat$age)
range(dat$age)
# Try to find out the same for height and weight
4. How does the overall summary of our entire dataset looks like?
summary(dat) #Can you explain what you see?
id chol stab.glu hdl
Min. : 1000 Min. : 78.0 Min. : 48.0 Min. : 12.00
1st Qu.: 4792 1st Qu.:179.0 1st Qu.: 81.0 1st Qu.: 38.00
Median :15766 Median :204.0 Median : 89.0 Median : 46.00
Mean :15978 Mean :207.8 Mean :106.7 Mean : 50.45
3rd Qu.:20336 3rd Qu.:230.0 3rd Qu.:106.0 3rd Qu.: 59.00
Max. :41756 Max. :443.0 Max. :385.0 Max. :120.00
ratio glyhb location age
Min. : 1.500 Min. : 2.68 Length:403 Min. :19.00
1st Qu.: 3.200 1st Qu.: 4.38 Class :character 1st Qu.:34.00
Median : 4.200 Median : 4.84 Mode :character Median :45.00
Mean : 4.522 Mean : 5.59 Mean :46.85
3rd Qu.: 5.400 3rd Qu.: 5.60 3rd Qu.:60.00
Max. :19.300 Max. :16.11 Max. :92.00
gender height weight frame
Length:403 Min. :52.00 Min. : 99.0 Length:403
Class :character 1st Qu.:63.00 1st Qu.:151.0 Class :character
Mode :character Median :66.00 Median :172.5 Mode :character
Mean :66.02 Mean :177.6
3rd Qu.:69.00 3rd Qu.:200.0
Max. :76.00 Max. :325.0
bp.1s bp.1d bp.2s bp.2d
Min. : 90.0 Min. : 48.00 Min. :110.0 Min. : 60.00
1st Qu.:121.2 1st Qu.: 75.00 1st Qu.:138.0 1st Qu.: 84.00
Median :136.0 Median : 82.00 Median :149.0 Median : 92.00
Mean :136.9 Mean : 83.32 Mean :152.4 Mean : 92.52
3rd Qu.:146.8 3rd Qu.: 90.00 3rd Qu.:161.0 3rd Qu.:100.00
Max. :250.0 Max. :124.00 Max. :238.0 Max. :124.00
waist hip time.ppn
Min. :26.0 Min. :30.00 Min. : 5.0
1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 90.0
Median :37.0 Median :42.00 Median : 240.0
Mean :37.9 Mean :43.04 Mean : 341.2
3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.: 517.5
Max. :56.0 Max. :64.00 Max. :1560.0
[ reached getOption("max.print") -- omitted 1 row ]
4. Data cleanup
Very often the first thing one needs to do before any data science project is to clean up the raw data and transform it into a format that is readily understood and easy to use for all downstream analysis. This process usually involves –
- Removing empty value rows/columns
- Removing unused or unnecessary rows/columns
- Reordering the data matrix
- Keeping columns uniformly numeric (age, weight etc) or string (names, places etc) or logical (TRUE/FALSE, 1/0)
- Handling strange caveats which are data specific like replacing
,
or.
, or;
from numbers etc
Lets do some clean up of our own diabetes data –
- We will make the
id
column the row names for the dataset. - We will remove the
bp.2s
andbp.2d
columns as it has mostly missing values (see summary above) - We will also remove the column
time.ppn
which will not be required in our analysis - We will reorder the columns of the data such that all the qualitative and quantitative values are separated. Among the quantitative values we will keep related variables together
rownames(dat) = dat$id
dat = dat[, -which(colnames(dat) %in% c("id", "bp.2s", "bp.2d", "time.ppn"))]
dat = dat[, c(8, 6, 11, 9, 10, 14, 15, 2, 5, 1, 3, 4, 12, 13)]
Now lets look at our cleaned data
summary(dat)
gender location frame height
Length:403 Length:403 Length:403 Min. :52.00
Class :character Class :character Class :character 1st Qu.:63.00
Mode :character Mode :character Mode :character Median :66.00
Mean :66.02
3rd Qu.:69.00
Max. :76.00
NA's :5
weight waist hip stab.glu
Min. : 99.0 Min. :26.0 Min. :30.00 Min. : 48.0
1st Qu.:151.0 1st Qu.:33.0 1st Qu.:39.00 1st Qu.: 81.0
Median :172.5 Median :37.0 Median :42.00 Median : 89.0
Mean :177.6 Mean :37.9 Mean :43.04 Mean :106.7
3rd Qu.:200.0 3rd Qu.:41.0 3rd Qu.:46.00 3rd Qu.:106.0
Max. :325.0 Max. :56.0 Max. :64.00 Max. :385.0
NA's :1 NA's :2 NA's :2
glyhb chol hdl ratio
Min. : 2.68 Min. : 78.0 Min. : 12.00 Min. : 1.500
1st Qu.: 4.38 1st Qu.:179.0 1st Qu.: 38.00 1st Qu.: 3.200
Median : 4.84 Median :204.0 Median : 46.00 Median : 4.200
Mean : 5.59 Mean :207.8 Mean : 50.45 Mean : 4.522
3rd Qu.: 5.60 3rd Qu.:230.0 3rd Qu.: 59.00 3rd Qu.: 5.400
Max. :16.11 Max. :443.0 Max. :120.00 Max. :19.300
NA's :13 NA's :1 NA's :1 NA's :1
bp.1s bp.1d
Min. : 90.0 Min. : 48.00
1st Qu.:121.2 1st Qu.: 75.00
Median :136.0 Median : 82.00
Mean :136.9 Mean : 83.32
3rd Qu.:146.8 3rd Qu.: 90.00
Max. :250.0 Max. :124.00
NA's :5 NA's :5
The ordering and selection of columns looks right, however it seems that there are certain rows that has missing values (like glyhb
column has 13 NA
values). Lets remove all rows with any missing value. Also lets make the the columns Location
and Gender
nominal and the frame
ordinal.
Before we do so, let’s have a look at the is.na
function
is.na(dat$glyhb)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[111] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[ reached getOption("max.print") -- omitted 283 entries ]
Understand how it works? Let’s go one step further:
sum(is.na(dat$glyhb))
[1] 13
Now we can apply the same principle on rows instead of columns, and repeat that for all rows of the data frame:
rmv.rows = apply(dat, 1, function(x) {
sum(is.na(x))
}) # Go through each row and sum up all missing values
rmv.rows[1:20] # The number of missing values in the first 20 row
1000 1001 1002 1003 1005 1008 1011 1015 1016 1022 1024 1029 1030 1031 1035
0 0 0 0 0 0 0 2 0 0 0 0 0 2 0
1036 1037 1041 1045 1250
0 0 0 0 0
which(rmv.rows > 0) # The rows where there is atleast 1 missing value
1015 1031 1281 1314 1326 2754 2774 2780 2784 2794 4760 4805
8 14 28 38 44 51 60 64 65 70 87 109
4808 4813 4826 4827 13500 15012 15542 15757 15805 15813 15827 16016
110 111 117 118 153 162 193 196 216 218 225 232
20258 20279 20762 20787 20790 21284 21321 21357 41023 41253 41501 41756
274 283 318 327 328 333 337 349 381 394 397 403
dat = dat[-which(rmv.rows > 0), ] # Removing any row with 1 or more missing values
rm(rmv.rows)
dat$location = factor(dat$location) #making data nominal
dat$gender = factor(dat$gender) #making data nominal
dat$frame = factor(dat$frame, levels = c("small", "medium", "large")) #making data ordinal
Now our cleaned data has no missing values, columns are cleanly ordered and each column is in the right format
summary(dat)
gender location frame height
female:214 Buckingham:176 small : 98 Min. :52.00
male :153 Louisa :191 medium:173 1st Qu.:63.00
large : 96 Median :66.00
Mean :66.05
3rd Qu.:69.00
Max. :76.00
weight waist hip stab.glu
Min. : 99.0 Min. :26.00 Min. :30.00 Min. : 48.0
1st Qu.:151.0 1st Qu.:33.00 1st Qu.:39.00 1st Qu.: 81.0
Median :174.0 Median :37.00 Median :42.00 Median : 90.0
Mean :178.1 Mean :37.93 Mean :43.04 Mean :107.3
3rd Qu.:200.0 3rd Qu.:41.50 3rd Qu.:46.00 3rd Qu.:108.0
Max. :325.0 Max. :56.00 Max. :64.00 Max. :385.0
glyhb chol hdl ratio
Min. : 2.680 Min. : 78.0 Min. : 12.00 Min. : 1.500
1st Qu.: 4.390 1st Qu.:179.0 1st Qu.: 38.00 1st Qu.: 3.200
Median : 4.860 Median :204.0 Median : 46.00 Median : 4.200
Mean : 5.602 Mean :207.5 Mean : 50.28 Mean : 4.536
3rd Qu.: 5.630 3rd Qu.:229.0 3rd Qu.: 59.00 3rd Qu.: 5.400
Max. :16.110 Max. :443.0 Max. :120.00 Max. :19.300
bp.1s bp.1d
Min. : 90.0 Min. : 48.0
1st Qu.:121.5 1st Qu.: 75.0
Median :136.0 Median : 82.0
Mean :137.1 Mean : 83.4
3rd Qu.:148.0 3rd Qu.: 92.0
Max. :250.0 Max. :124.0
Can you identify which types of data (continuous, discrete etc) each column above represents and why?
5. Visualizing data distribution
Histograms
summary(dat$stab.glu) # What are these values? replace summary by quantile and see what happens
Min. 1st Qu. Median Mean 3rd Qu. Max.
48.0 81.0 90.0 107.3 108.0 385.0
hist(dat$stab.glu, xlab = "Stabilized Glucose concentration in blood", main = "")
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"),
lty = 2)
Add the parameter
breaks = 50
in the above lines of code and see what happens. Try different values forbreaks
like10, 20, 75, 100
and try to interpret the differences. Is this a good or bad thing about histograms in general? Typehelp("hist")
to see more details about the histogram function in R. Also try plotting histograms and summaries for other continuous numeric data in our diabetes dataset.
Density plots
plot(density(dat$stab.glu), xlab = "Stabilized Glucose concentration in blood",
main = "")
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"),
lty = 2)
Using the function
abline
plot a vertical line highlighting the glucose concerntration of300 units
.
Boxplots
boxplot(dat$stab.glu, xlab = "Stabilized Glucose concentration in blood", horizontal = T)
abline(v = summary(dat$stab.glu)[2:5], col = c("blue", "red", "black", "orange"),
lty = 2)
Using the
abline
function draw lines passing through the whiskers of the boxplot
Integrate histograms, density plots, violin plots and boxplots
[1] 0 400
Which do you think is a better plot for visualizing data distribution? What are the advantages and disadvantages of each of these plots?
6. Measuring the centrality in data
We have already seen that the summary
and quantile
functions in R can compute the mean, median and quantiles of any given data. Let now use functions in R that explicitly measure these values.
Mean
mean(dat$stab.glu)
[1] 107.3297
Median
median(dat$stab.glu)
[1] 90
Calculate the mean and median of other continuous numeric data in the diabetes dataset and measure the difference between them. (a) Why is there a difference between the mean and median? (b) Why do you think there are larger differences for some and almost no difference for others?
Quantiles
quantile(dat$stab.glu)
0% 25% 50% 75% 100%
48 81 90 108 385
By default, the
quantile
function gives you themin, 25th, 50th, 75th quantiles and max
values. How would you calculate say the35th or 47th or 72nd quantiles
. See?quantile
for hints.
7. Association between variables
Often a common step during any data analysis project is to find associations between variables present in the dataset. Such associations helps us to decipher the underlying structure in the data. For instance, in our diabetes dataset we would expect a high correlation between free blood glucose levels and glycosylated blood levels or between waist and hip sizes. On one hand, prior knowledge helps us to know what to expect and if this expectation does not hold true it may hint at some issues with the data. On the other hand, novel associations helps to discover new knowledge. One of the most common ways of measuring associations is correlations.
Let us start by producing a scatter plot between some pairs of variables:
plot(dat$stab.glu, dat$glyhb, pch = 20, xlab = "Stabilized glucose", ylab = "Glycosylated hemoglobin")
Do you suspect that the two variables have a relationship?
Do the scatter plot for other pairs of numerical variables!
We now can compute the correlation of the two variables; remember that we can compute the Pearson correlation or the Spearman correlation:
## compute the Pearson correlaton
cor(dat$stab.glu, dat$glyhb, method = "pearson")
[1] 0.7411355
## compute the Pearson correlaton
cor(dat$stab.glu, dat$glyhb, method = "spearman")
[1] 0.5214866
The Spearman correlation seems much lower! To understand why, we can do a scatter plot between the ranks of the two variables:
plot(rank(dat$stab.glu), rank(dat$glyhb), pch = 20, xlab = "Ranks Stabilized glucose",
ylab = "Ranks Glycosylated hemoglobin")
When looking at the ranks, the relationship is no longer so obvious. However, these correlations still represent an association between the variables!
Pairwise correlations
Lets calculate and visualize the correlations among all our variables in the diabetes dataset
cor.mat = cor(dat[, 4:ncol(dat)], method = "pearson") # First three columns are discrete data
round(cor.mat, 2) # all paiwise correlations
height weight waist hip stab.glu glyhb chol hdl ratio bp.1s
height 1.00 0.24 0.04 -0.12 0.08 0.05 -0.06 -0.07 0.07 -0.04
weight 0.24 1.00 0.85 0.83 0.19 0.17 0.08 -0.28 0.28 0.10
waist 0.04 0.85 1.00 0.83 0.23 0.25 0.14 -0.28 0.32 0.21
hip -0.12 0.83 0.83 1.00 0.15 0.15 0.10 -0.22 0.21 0.15
stab.glu 0.08 0.19 0.23 0.15 1.00 0.74 0.16 -0.18 0.30 0.15
glyhb 0.05 0.17 0.25 0.15 0.74 1.00 0.27 -0.17 0.35 0.20
chol -0.06 0.08 0.14 0.10 0.16 0.27 1.00 0.17 0.48 0.20
hdl -0.07 -0.28 -0.28 -0.22 -0.18 -0.17 0.17 1.00 -0.69 0.03
ratio 0.07 0.28 0.32 0.21 0.30 0.35 0.48 -0.69 1.00 0.11
bp.1s -0.04 0.10 0.21 0.15 0.15 0.20 0.20 0.03 0.11 1.00
bp.1d
height 0.04
weight 0.18
waist 0.18
hip 0.16
stab.glu 0.02
glyhb 0.05
chol 0.16
hdl 0.07
ratio 0.03
bp.1s 0.62
[ reached getOption("max.print") -- omitted 1 row ]
par(mar = c(3, 3, 0.5, 0.5), mgp = c(1.5, 0.5, 0), las = 2)
pairs(dat[, 4:ncol(dat)], pch = 20, cex = 0.5, col = "grey")
Change
method="spearman"
while calculating correlations and observe the differences. Which correlations did you expect to see and what were novel? Can you explain the relation you observe betweenhdl, chol and ratio
. Remember thatratio = chol/hdl
see here