Unsupervised learning
1. Recap of Part II
A. Unsupervised learning –
- We use unsupervised learning to learn de novo patterns in our data.
- We use unsupervised learning when we don’t have a response variable to predict or lack a priori knowledge about our data.
- Simply put, unsupervised learning is used to group similar observations together in any given dataset
- One of the very first steps in understanding data is measuring distances between observations in the data.
- There are different ways in which distances can be measured - euclidean, correlation, Manhattan etc
- Hierarchical clustering is one of the methods used for unsupervised learning
- A clustering algorithm, tries to group together observations with smaller distances between them.
B. Dimensionality reduction –
- High p low n problem. Having much more variables/features than observations/samples p >> n
- Imagine a gene expression matrix with p = ~20,000 genes and n = 500 samples, this is a high dimensional data.
- Now you can visualize the gene expression profiles between any two samples using a 2D scatter plot, with expression values of one sample in the x-axis and the expression values of the other sample in the y-axis. You can even visualize the expression profile of a 3rd sample by plotting it in the z-axis (3D scatter plot).
- How would you visualize this data for >3 samples?
- Dimensionality reduction algorithms essentially tries to reduce the number of features while still retaining as much information as possible about the original data. *For example, imagine out of our p = ~20,000 genes, 10,000 genes show no variation across samples (like house keeping or essential genes). Now lets say there are 50 genes each of which is highly correlated to 200 other genes and not with each other. Then the information of this large 20000 x 500 matrix could be reduced to 50 x 500 matrix because - – removing the 10,000 non-varying genes won’t reduce the information content of the dataset – each of the 50 genes sufficiently captures the information of each 200 genes sets with which it is correlated. Thus, retaining only these 50 genes will be sufficient to capture the information of the entire data set.
- One of the most common method for dimensionality reduction is principal component analysis.
2. Objectives
In this practical session you will learn how to implement the ideas taught in part II of this lecture series using the R programming language. Throughout this course we will use a common gene expression dataset for acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) to practically implement the concepts taught during this part of the lecture. This data has gene expression values from 72 samples (ALL = 47 and AML = 25) for 7129 genes. The samples have been annotated as follows
- Samples: These are the patient sample numbers/IDs.
- ALL.AML: Type of tumor, whether the patient had acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL).
- BM.PB: Whether the sample was taken from bone marrow (BM) or from peripheral blood (PB).
- T.B.cell: ALL arises from two different types of lymphocytes (T-cell and B-cell). This specifies which for the ALL patients; it is NA for the AML samples.
3. Lets get started - Load data
Load the required data into Rstudio and explore it
We will start by reading the gene expression data. The columns are the samples and the rows are the genes.
all.aml.exp = read.delim("https://www.dropbox.com/s/ulckmlj5k1edywx/ALLcancerdata.txt?dl=1",
header = T, check.names = F)
dim(all.aml.exp)
[1] 7129 72
all.aml.exp[1:10, 1:10]
39 40 42 47 48 49 41 43 44 45
AFFX-BioB-5_at -342 -87 22 -243 -130 -256 -62 86 -146 -187
AFFX-BioB-M_at -200 -248 -153 -218 -177 -249 -23 -36 -74 -187
AFFX-BioB-3_at 41 262 17 -163 -28 -410 -7 -141 170 312
AFFX-BioC-5_at 328 295 276 182 266 24 142 252 174 142
AFFX-BioC-3_at -224 -226 -211 -289 -170 -535 -233 -201 -32 114
AFFX-BioDn-5_at -427 -493 -250 -268 -326 -810 -284 -384 -318 -148
AFFX-BioDn-3_at -656 367 55 -285 -222 709 -167 -420 8 -184
AFFX-CreX-5_at -292 -452 -141 -172 -93 -316 -97 -197 -152 -133
AFFX-CreX-3_at 137 194 0 52 10 27 -12 -60 -148 12
AFFX-BioB-5_st -144 162 500 -134 159 14 -70 -468 17 97
WARNING: If you have problem loading the data, please download this file, store it on your disk, and open it with the following command:
# all.aml.exp = readRDS(xxxx) # xxxx should be replaced with the path to the
# downloaded file!
Use the ideas you learnt in the last exploratory data analysis session to
- (a) calculate the min, max and median expression level for each sample and genes
- (b) plot a histogram/density plot for the median expression values from all samples and genes
- (c) plot a histogram/density plot of the entire dataset. what do you think about this shape.
- HINT - use the
apply
function
# For all genes
min.val.gene = apply(all.aml.exp, 1, min)
max.val.gene = apply(all.aml.exp, 1, max)
median.val.gene = apply(all.aml.exp, 1, median)
# For all samples
min.val.samples = apply(all.aml.exp, 2, min)
max.val.samples = apply(all.aml.exp, 2, max)
median.val.samples = apply(all.aml.exp, 2, median)
hist(median.val.gene, breaks = "FD")
hist(median.val.samples, breaks = "FD")
plot(density(as.matrix(all.aml.exp)))
You can check what the breaks='FD'
option means by checking this article.
Next we will load the clinical annotation file for this gene expression data and explore it
all.aml.anno = read.delim("https://www.dropbox.com/s/ejgf6mu5ca9uhv2/ALLannotation.txt?dl=1",
header = T)
head(all.aml.anno)
Samples ALL.AML BM.PB T.B.cell Gender
39 39 ALL BM B-cell F
40 40 ALL BM B-cell F
42 42 ALL BM B-cell F
47 47 ALL BM B-cell M
48 48 ALL BM B-cell F
49 49 ALL BM B-cell M
summary(all.aml.anno)
Samples ALL.AML BM.PB T.B.cell Gender
Min. : 1.00 ALL:47 BM:62 B-cell:38 F :23
1st Qu.:18.75 AML:25 PB:10 T-cell: 9 M :26
Median :36.50 NA's :25 NA's:23
Mean :36.50
3rd Qu.:54.25
Max. :72.00
# a sanity check to make sure that the order of patients in the annotation
# and expression matrix is the same
sum(colnames(all.aml.exp) == rownames(all.aml.anno))
[1] 72
4. Data transformation
You see that the distribution of the data is extremely squeezed due to outliers with very high or low values. We will need to make the data more homogeneous, such that our downstream analysis is not affected by these very large magnitude numbers.
We will carry out the following data processing steps. Some of these steps use rather arbitrary values, which come from visually inspecting the data!
- Thresholding: fix the lowest and highest values to 100 and 16,000 respectively
- Filtering: remove genes with
- max / min ≤ 5 or
- (max − min) ≤ 500, where max and min is the maximum and minimum intensities for a particular gene across mRNA samples
- Homogenization: base–10 logarithmic transformation of the entire dataset
- Scaling: standardize the data so that across genes the mean = 0 and variance = 1.
Before we start modifying the data, we will store the original data frame into a variable, so that in case of problems we can revert back to the initial data!!
all.aml.exp.original = all.aml.exp
Thresholding
all.aml.exp[all.aml.exp < 100] = 100
all.aml.exp[all.aml.exp > 16000] = 16000
min(all.aml.exp)
[1] 100
max(all.aml.exp)
[1] 16000
Filtering
min.val.gene = apply(all.aml.exp, 1, min)
max.val.gene = apply(all.aml.exp, 1, max)
genes.remove = which(max.val.gene/min.val.gene <= 5 | max.val.gene - min.val.gene <=
500)
length(genes.remove)
[1] 3558
all.aml.exp = all.aml.exp[-genes.remove, ]
dim(all.aml.exp)
[1] 3571 72
Homogenization and Scaling
all.aml.exp = log10(all.aml.exp)
all.aml.exp = scale(all.aml.exp)
plot(density(as.matrix(all.aml.exp)))
As you see from the annotation file above the dataset has samples from ALL and AML patients. Can you split the gene expression matrix into AML and ALL sets and individually find the gene/probe with the highest median expression in the two cancer types
Why do you think the distribution above looks bi-modal
all.patients = which(colnames(all.aml.exp) %in% all.aml.anno$Samples[all.aml.anno$ALL.AML ==
"ALL"])
#
all.dat = all.aml.exp[, all.patients] #Similarly do for AML
med.exp.all = apply(all.dat, 1, median)
med.exp.all[med.exp.all == max(med.exp.all)]
U06155_s_at
3.41037
5. k-means clustering
Another widely used method for grouping observations is k-means clustering. Now we will cluster our dataset using kmeans and explore the underlying structure of the data.
Using the most variable, thus informative genes
We first reduce our dataset to take the most variable gene, which are expected to carry most of the information about the samples:
## compute variance over all rows of the data.frame (hence genes)
topVar = apply(all.aml.exp, 1, var)
summary(topVar)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0383 0.1460 0.2233 0.3093 0.3498 3.4163
## filtering to keep genes which have a variance greater than the 75%
## percentile
all.aml.exp.topVar = all.aml.exp[topVar > quantile(topVar, probs = 0.75), ]
dim(all.aml.exp.topVar)
[1] 893 72
performing k-means
We use the function kmeans
in R. You can check the options and usage in the help panel on the right. The parameter centers
indicates how many clusters are requested.
km = kmeans(x = t(all.aml.exp.topVar), centers = 2, nstart = 10)
km$cluster
39 40 42 47 48 49 41 43 44 45 46 70 71 72 68 69 67 55 56 59 52 53 51 50 54
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1
57 58 60 61 65 66 63 64 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
17 18 19 20 21 22 23 24 25 26 27 34 35 36 37 38 28 29 30 31 32 33
2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
str(km)
List of 9
$ cluster : Named int [1:72] 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:72] "39" "40" "42" "47" ...
$ centers : num [1:2, 1:893] 0.00225 0.38336 -0.50474 0.10282 -0.41109 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:893] "AFFX-HUMRGE/M10098_5_at" "AFFX-HUMRGE/M10098_M_at" "AFFX-HUMRGE/M10098_3_at" "AFFX-HUMGAPDH/M33197_M_at" ...
$ totss : num 42439
$ withinss : num [1:2] 12243 23980
$ tot.withinss: num 36223
$ betweenss : num 6216
$ size : int [1:2] 25 47
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
Just type
km
in your console and check all results generated. Play around with thecenters
parameter. See cluster assignments by typingtable(km$cluster)
quality of the clustering
We can judge the quality of the clustering by computing the intra-cluster distances, i.e. the sum (squared) of all distances between pairs of objects belonging to the same cluster. This is called the within sum of squares (WSS). The better the clustering, the smaller WSS should be. However, it also automatically decreases with increasing k
What would be WSS if we request a number of clusters equal to the number of data points?
You can check what the WSS is for a particular clustering by typing
km$tot.withinss
[1] 36223.07
run k-means for k=2 to k=7 clusters, and for each k check the WSS value. How does WSS evolve with increasing k?
We can also run a little loop to do this:
wss = sapply(2:7, function(k) {
kmeans(x = t(all.aml.exp.topVar), centers = k)$tot.withinss
})
plot(2:7, wss, type = "b", pch = 19, xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares")
do you see an obvious “elbow” or “kink” in the curve??
Another criteria for the quality of the clustering is the silhouette method (check the slides of the lecture!).
To run the silhouette method, we need to compute the pairwise distances between all objects (i.e. patients) in the data matrix. This is done with the dist
function, which can take different metrics (euclidean, …)
## compute the patient-patient distance matrix (this is why we transpose)
D = dist(t(all.aml.exp.topVar))
We now compute the silhouette for a specific kmeans clustering:
library(cluster)
km = kmeans(x = t(all.aml.exp.topVar), centers = 2, nstart = 10)
s = silhouette(km$cluster, D)
plot(s)
Check the average silhouette values at the bottom; repeat this for other values of k
6. Hierarchical clustering
Clustering is a method by which we group together similar observations while separating out the dissimilar ones. We will cluster our samples from the cancer dataset to see which samples cluster together or separately.
Using the most variable, thus informative genes
topVar = apply(all.aml.exp, 1, var)
summary(topVar)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0383 0.1460 0.2233 0.3093 0.3498 3.4163
all.aml.exp.topVar = all.aml.exp[topVar > quantile(topVar, probs = 0.75), ]
dim(all.aml.exp.topVar)
[1] 893 72
rm(topVar)
Variability of genes among samples can also be measured using standard deviation or median absolute deviation.
Creating a correlation based distance matrix
cor.mat = cor(all.aml.exp.topVar, method = "spearman")
cor.dist = as.dist(1 - cor.mat)
cor.hc = hclust(cor.dist, method = "ward.D2")
cor.hc = as.dendrogram(cor.hc)
Plotting the dendrogram along with clinical annotations
library(dendextend) # we will need this package to draw dendrograms
plot(cor.hc, las = 2, cex.lab = 0.7) # Its nice, but what does it mean ??
# Creating annotation bars to show along with dendrogram to improve our
# interpretation
cb1 <- ifelse(all.aml.anno$ALL.AML == "ALL", "forestgreen", "firebrick")
cb2 <- ifelse(all.aml.anno$BM.PB == "BM", "grey", "pink")
cb3 <- ifelse(all.aml.anno$T.B.cell == "B-cell", "cyan", "violet")
cb4 <- ifelse(all.aml.anno$Gender == "F", "grey", "black")
cb <- cbind(`ALL/AML` = cb1, `BM/PB` = cb2, `T/B cell` = cb3, Gender = cb4)
plot(cor.hc, las = 2, cex.lab = 0.7)
colored_bars(colors = cb, dend = cor.hc)
How would you interpret this dendrogram. Does the clusters you observe make any sense? What are the parameters by which the samples cluster together? How many meaningful clusters can you observe? Do you see any relation between the distribution of the data and your clusters ?
Play around with the parameters of clustering and see how the results differs. Change your variability measure, change your distance metric (say euclidean), or linkage methods (say average). HINT - use
?dist
and?hclust
for different available methods.
Why does the dendrogram of average linkage look so different from complete linkage
7. Dimentionality reduction - PCA
We will now use principal component analysis to explore the same dataset, and identify directions (i.e. principal components) with maximal variance.
pca = prcomp(t(all.aml.exp.topVar), center = F, scale. = F) # We set these as false as we have already scaled our data
print(pca)
Standard deviations (1, .., p=72):
[1] 21.858584 10.172947 7.163008 6.434863 5.827320 5.233852 4.891237
[8] 4.436312 4.046577 3.696212 3.513138 3.335838 3.271605 3.219171
[15] 3.034724 2.916060 2.857842 2.803584 2.755637 2.671489 2.561401
[22] 2.503069 2.441615 2.415896 2.368323 2.282909 2.259064 2.216639
[29] 2.187112 2.145885 2.119765 2.104553 2.070477 2.049643 1.980785
[36] 1.949340 1.931123 1.920901 1.908356 1.872003 1.861793 1.853526
[43] 1.826156 1.775465 1.761265 1.732545 1.728901 1.687898 1.651187
[50] 1.641938 1.635463 1.627482 1.590959 1.559534 1.531345 1.522194
[57] 1.511371 1.484712 1.472253 1.445925 1.432564 1.429161 1.404952
[64] 1.385672 1.354657 1.343017 1.302371 1.270059 1.235583 1.216153
[71] 1.184771 1.122984
Rotation (n x k) = (893 x 72):
PC1 PC2 PC3
AFFX-HUMRGE/M10098_5_at -1.107890e-02 0.0022593773 1.155177e-01
PC4 PC5 PC6
AFFX-HUMRGE/M10098_5_at 7.000917e-02 -7.041968e-02 2.977055e-02
PC7 PC8 PC9
AFFX-HUMRGE/M10098_5_at 1.547209e-02 1.713529e-01 1.443374e-01
PC10 PC11 PC12
AFFX-HUMRGE/M10098_5_at -6.187497e-02 5.665142e-05 -0.0699290225
PC13 PC14 PC15
AFFX-HUMRGE/M10098_5_at 0.1091315387 4.721160e-02 9.801792e-02
PC16 PC17 PC18
AFFX-HUMRGE/M10098_5_at 6.528944e-02 -5.977307e-02 -1.405922e-01
PC19 PC20 PC21
AFFX-HUMRGE/M10098_5_at 8.055925e-02 -0.0004072550 4.318161e-02
PC22 PC23 PC24
AFFX-HUMRGE/M10098_5_at -2.487338e-02 -4.416702e-02 -0.0430508695
PC25 PC26 PC27
AFFX-HUMRGE/M10098_5_at -7.654326e-03 1.641513e-02 4.209262e-02
PC28 PC29 PC30
AFFX-HUMRGE/M10098_5_at 2.618725e-02 7.772970e-03 -0.0098368026
PC31 PC32 PC33
AFFX-HUMRGE/M10098_5_at -1.554847e-02 7.598622e-03 -1.437895e-02
PC34 PC35 PC36
AFFX-HUMRGE/M10098_5_at 3.529688e-02 0.0185850116 4.470388e-02
PC37 PC38 PC39
AFFX-HUMRGE/M10098_5_at 0.0338400102 0.0233030280 -7.265860e-05
PC40 PC41 PC42
AFFX-HUMRGE/M10098_5_at -2.059172e-02 -5.594101e-02 4.235136e-03
PC43 PC44 PC45
AFFX-HUMRGE/M10098_5_at 1.233894e-04 3.915832e-02 3.763869e-02
PC46 PC47 PC48
AFFX-HUMRGE/M10098_5_at -2.671777e-02 6.767790e-02 0.0268888760
PC49 PC50 PC51
AFFX-HUMRGE/M10098_5_at -1.098911e-02 -1.071161e-02 -4.502149e-03
PC52 PC53 PC54
AFFX-HUMRGE/M10098_5_at -2.599236e-03 3.198787e-03 -0.0509601336
PC55 PC56 PC57
AFFX-HUMRGE/M10098_5_at -6.600117e-02 1.058799e-02 2.052763e-02
PC58 PC59 PC60
AFFX-HUMRGE/M10098_5_at 1.414175e-02 -2.820650e-02 1.346128e-02
PC61 PC62 PC63
AFFX-HUMRGE/M10098_5_at -2.420685e-02 1.215780e-02 0.0514679364
PC64 PC65 PC66
AFFX-HUMRGE/M10098_5_at -9.356622e-03 7.165168e-03 3.895651e-02
PC67 PC68 PC69
AFFX-HUMRGE/M10098_5_at 2.448192e-02 4.815108e-03 4.265601e-02
PC70 PC71 PC72
AFFX-HUMRGE/M10098_5_at 0.0456278549 3.358280e-03 -5.252387e-04
[ reached getOption("max.print") -- omitted 892 rows ]
How many principal components do you obtain? Compare this to the dimension of the matrix using the
dim
function!
Principal components are ranked by the amount of variance that they explain. This can be visualized using a scree plot, indicating how much variance each PC explains:
plot(pca, type = "l") # First two componets explain most of the variability in the data
We can now display the data points (i.e. patients) in the first two principal components. In addition, we can color the dots according to certain clinical parameters:
## Color patients accordint to ALL / AML
plot(pca$x[, 1], pca$x[, 2], col = cb[, 1], pch = 19, xlab = "PC1", ylab = "PC2")
Color the patients according to the distinction between bone marrow samples and peripheral blood (second column of the
cb
data frame)
Redo the plot by including some other PCs (e.g. PC1 vs. PC3)