Unsupervised learning

1. Recap of Part II

A. Unsupervised learning –

  • We use unsupervised learning to learn de novo patterns in our data.
  • We use unsupervised learning when we don’t have a response variable to predict or lack a priori knowledge about our data.
  • Simply put, unsupervised learning is used to group similar observations together in any given dataset
  • One of the very first steps in understanding data is measuring distances between observations in the data.
  • There are different ways in which distances can be measured - euclidean, correlation, Manhattan etc
  • Hierarchical clustering is one of the methods used for unsupervised learning
  • A clustering algorithm, tries to group together observations with smaller distances between them.

B. Dimensionality reduction –

  • High p low n problem. Having much more variables/features than observations/samples p >> n
  • Imagine a gene expression matrix with p = ~20,000 genes and n = 500 samples, this is a high dimensional data.
  • Now you can visualize the gene expression profiles between any two samples using a 2D scatter plot, with expression values of one sample in the x-axis and the expression values of the other sample in the y-axis. You can even visualize the expression profile of a 3rd sample by plotting it in the z-axis (3D scatter plot).
  • How would you visualize this data for >3 samples?
  • Dimensionality reduction algorithms essentially tries to reduce the number of features while still retaining as much information as possible about the original data. *For example, imagine out of our p = ~20,000 genes, 10,000 genes show no variation across samples (like house keeping or essential genes). Now lets say there are 50 genes each of which is highly correlated to 200 other genes and not with each other. Then the information of this large 20000 x 500 matrix could be reduced to 50 x 500 matrix because - – removing the 10,000 non-varying genes won’t reduce the information content of the dataset – each of the 50 genes sufficiently captures the information of each 200 genes sets with which it is correlated. Thus, retaining only these 50 genes will be sufficient to capture the information of the entire data set.
  • One of the most common method for dimensionality reduction is principal component analysis.

2. Objectives

In this practical session you will learn how to implement the ideas taught in part II of this lecture series using the R programming language. Throughout this course we will use a common gene expression dataset for acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) to practically implement the concepts taught during this part of the lecture. This data has gene expression values from 72 samples (ALL = 47 and AML = 25) for 7129 genes. The samples have been annotated as follows

  • Samples: These are the patient sample numbers/IDs.
  • ALL.AML: Type of tumor, whether the patient had acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL).
  • BM.PB: Whether the sample was taken from bone marrow (BM) or from peripheral blood (PB).
  • T.B.cell: ALL arises from two different types of lymphocytes (T-cell and B-cell). This specifies which for the ALL patients; it is NA for the AML samples.

3. Lets get started - Load data

Load the required data into Rstudio and explore it

We will start by reading the gene expression data. The columns are the samples and the rows are the genes.

all.aml.exp = read.delim("https://www.dropbox.com/s/ulckmlj5k1edywx/ALLcancerdata.txt?dl=1", 
    header = T, check.names = F)
dim(all.aml.exp)
[1] 7129   72
all.aml.exp[1:10, 1:10]
                  39   40   42   47   48   49   41   43   44   45
AFFX-BioB-5_at  -342  -87   22 -243 -130 -256  -62   86 -146 -187
AFFX-BioB-M_at  -200 -248 -153 -218 -177 -249  -23  -36  -74 -187
AFFX-BioB-3_at    41  262   17 -163  -28 -410   -7 -141  170  312
AFFX-BioC-5_at   328  295  276  182  266   24  142  252  174  142
AFFX-BioC-3_at  -224 -226 -211 -289 -170 -535 -233 -201  -32  114
AFFX-BioDn-5_at -427 -493 -250 -268 -326 -810 -284 -384 -318 -148
AFFX-BioDn-3_at -656  367   55 -285 -222  709 -167 -420    8 -184
AFFX-CreX-5_at  -292 -452 -141 -172  -93 -316  -97 -197 -152 -133
AFFX-CreX-3_at   137  194    0   52   10   27  -12  -60 -148   12
AFFX-BioB-5_st  -144  162  500 -134  159   14  -70 -468   17   97

WARNING: If you have problem loading the data, please download this file, store it on your disk, and open it with the following command:

# all.aml.exp = readRDS(xxxx) # xxxx should be replaced with the path to the
# downloaded file!

Use the ideas you learnt in the last exploratory data analysis session to

  • (a) calculate the min, max and median expression level for each sample and genes
  • (b) plot a histogram/density plot for the median expression values from all samples and genes
  • (c) plot a histogram/density plot of the entire dataset. what do you think about this shape.
  • HINT - use the apply function

# For all genes
min.val.gene = apply(all.aml.exp, 1, min)
max.val.gene = apply(all.aml.exp, 1, max)
median.val.gene = apply(all.aml.exp, 1, median)

# For all samples
min.val.samples = apply(all.aml.exp, 2, min)
max.val.samples = apply(all.aml.exp, 2, max)
median.val.samples = apply(all.aml.exp, 2, median)

hist(median.val.gene, breaks = "FD")

hist(median.val.samples, breaks = "FD")

plot(density(as.matrix(all.aml.exp)))

You can check what the breaks='FD' option means by checking this article.

Next we will load the clinical annotation file for this gene expression data and explore it

all.aml.anno = read.delim("https://www.dropbox.com/s/ejgf6mu5ca9uhv2/ALLannotation.txt?dl=1", 
    header = T)
head(all.aml.anno)
   Samples ALL.AML BM.PB T.B.cell Gender
39      39     ALL    BM   B-cell      F
40      40     ALL    BM   B-cell      F
42      42     ALL    BM   B-cell      F
47      47     ALL    BM   B-cell      M
48      48     ALL    BM   B-cell      F
49      49     ALL    BM   B-cell      M
summary(all.aml.anno)
    Samples      ALL.AML  BM.PB     T.B.cell   Gender  
 Min.   : 1.00   ALL:47   BM:62   B-cell:38   F   :23  
 1st Qu.:18.75   AML:25   PB:10   T-cell: 9   M   :26  
 Median :36.50                    NA's  :25   NA's:23  
 Mean   :36.50                                         
 3rd Qu.:54.25                                         
 Max.   :72.00                                         
# a sanity check to make sure that the order of patients in the annotation
# and expression matrix is the same
sum(colnames(all.aml.exp) == rownames(all.aml.anno))
[1] 72

4. Data transformation

You see that the distribution of the data is extremely squeezed due to outliers with very high or low values. We will need to make the data more homogeneous, such that our downstream analysis is not affected by these very large magnitude numbers.

We will carry out the following data processing steps. Some of these steps use rather arbitrary values, which come from visually inspecting the data!

  1. Thresholding: fix the lowest and highest values to 100 and 16,000 respectively
  2. Filtering: remove genes with
    • max / min ≤ 5 or
    • (max − min) ≤ 500, where max and min is the maximum and minimum intensities for a particular gene across mRNA samples
  3. Homogenization: base–10 logarithmic transformation of the entire dataset
  4. Scaling: standardize the data so that across genes the mean = 0 and variance = 1.

Before we start modifying the data, we will store the original data frame into a variable, so that in case of problems we can revert back to the initial data!!

all.aml.exp.original = all.aml.exp

Thresholding

all.aml.exp[all.aml.exp < 100] = 100
all.aml.exp[all.aml.exp > 16000] = 16000
min(all.aml.exp)
[1] 100
max(all.aml.exp)
[1] 16000

Filtering

min.val.gene = apply(all.aml.exp, 1, min)
max.val.gene = apply(all.aml.exp, 1, max)
genes.remove = which(max.val.gene/min.val.gene <= 5 | max.val.gene - min.val.gene <= 
    500)
length(genes.remove)
[1] 3558
all.aml.exp = all.aml.exp[-genes.remove, ]
dim(all.aml.exp)
[1] 3571   72

Homogenization and Scaling

all.aml.exp = log10(all.aml.exp)
all.aml.exp = scale(all.aml.exp)
plot(density(as.matrix(all.aml.exp)))

As you see from the annotation file above the dataset has samples from ALL and AML patients. Can you split the gene expression matrix into AML and ALL sets and individually find the gene/probe with the highest median expression in the two cancer types

Why do you think the distribution above looks bi-modal

all.patients = which(colnames(all.aml.exp) %in% all.aml.anno$Samples[all.aml.anno$ALL.AML == 
    "ALL"])
# 
all.dat = all.aml.exp[, all.patients]  #Similarly do for AML
med.exp.all = apply(all.dat, 1, median)
med.exp.all[med.exp.all == max(med.exp.all)]
U06155_s_at 
    3.41037 

5. k-means clustering

Another widely used method for grouping observations is k-means clustering. Now we will cluster our dataset using kmeans and explore the underlying structure of the data.

Using the most variable, thus informative genes

We first reduce our dataset to take the most variable gene, which are expected to carry most of the information about the samples:

## compute variance over all rows of the data.frame (hence genes)
topVar = apply(all.aml.exp, 1, var)
summary(topVar)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0383  0.1460  0.2233  0.3093  0.3498  3.4163 
## filtering to keep genes which have a variance greater than the 75%
## percentile
all.aml.exp.topVar = all.aml.exp[topVar > quantile(topVar, probs = 0.75), ]
dim(all.aml.exp.topVar)
[1] 893  72

performing k-means

We use the function kmeans in R. You can check the options and usage in the help panel on the right. The parameter centers indicates how many clusters are requested.

km = kmeans(x = t(all.aml.exp.topVar), centers = 2, nstart = 10)

km$cluster
39 40 42 47 48 49 41 43 44 45 46 70 71 72 68 69 67 55 56 59 52 53 51 50 54 
 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2  2  2  1  1  1  1  1 
57 58 60 61 65 66 63 64 62  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
 1  1  1  1  1  2  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 
17 18 19 20 21 22 23 24 25 26 27 34 35 36 37 38 28 29 30 31 32 33 
 2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1  1  1  1  1 
str(km)
List of 9
 $ cluster     : Named int [1:72] 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "names")= chr [1:72] "39" "40" "42" "47" ...
 $ centers     : num [1:2, 1:893] 0.00225 0.38336 -0.50474 0.10282 -0.41109 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:893] "AFFX-HUMRGE/M10098_5_at" "AFFX-HUMRGE/M10098_M_at" "AFFX-HUMRGE/M10098_3_at" "AFFX-HUMGAPDH/M33197_M_at" ...
 $ totss       : num 42439
 $ withinss    : num [1:2] 12243 23980
 $ tot.withinss: num 36223
 $ betweenss   : num 6216
 $ size        : int [1:2] 25 47
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

Just type km in your console and check all results generated. Play around with the centers parameter. See cluster assignments by typing table(km$cluster)

quality of the clustering

We can judge the quality of the clustering by computing the intra-cluster distances, i.e. the sum (squared) of all distances between pairs of objects belonging to the same cluster. This is called the within sum of squares (WSS). The better the clustering, the smaller WSS should be. However, it also automatically decreases with increasing k

What would be WSS if we request a number of clusters equal to the number of data points?

You can check what the WSS is for a particular clustering by typing

km$tot.withinss
[1] 36223.07

run k-means for k=2 to k=7 clusters, and for each k check the WSS value. How does WSS evolve with increasing k?

We can also run a little loop to do this:

wss = sapply(2:7, function(k) {
    kmeans(x = t(all.aml.exp.topVar), centers = k)$tot.withinss
})
plot(2:7, wss, type = "b", pch = 19, xlab = "Number of clusters K", ylab = "Total within-clusters sum of squares")

do you see an obvious “elbow” or “kink” in the curve??

Another criteria for the quality of the clustering is the silhouette method (check the slides of the lecture!).

To run the silhouette method, we need to compute the pairwise distances between all objects (i.e. patients) in the data matrix. This is done with the dist function, which can take different metrics (euclidean, …)

## compute the patient-patient distance matrix (this is why we transpose)
D = dist(t(all.aml.exp.topVar))

We now compute the silhouette for a specific kmeans clustering:

library(cluster)
km = kmeans(x = t(all.aml.exp.topVar), centers = 2, nstart = 10)
s = silhouette(km$cluster, D)
plot(s)

Check the average silhouette values at the bottom; repeat this for other values of k

6. Hierarchical clustering

Clustering is a method by which we group together similar observations while separating out the dissimilar ones. We will cluster our samples from the cancer dataset to see which samples cluster together or separately.

Using the most variable, thus informative genes

topVar = apply(all.aml.exp, 1, var)
summary(topVar)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0383  0.1460  0.2233  0.3093  0.3498  3.4163 
all.aml.exp.topVar = all.aml.exp[topVar > quantile(topVar, probs = 0.75), ]
dim(all.aml.exp.topVar)
[1] 893  72
rm(topVar)

Variability of genes among samples can also be measured using standard deviation or median absolute deviation.

Creating a correlation based distance matrix

cor.mat = cor(all.aml.exp.topVar, method = "spearman")
cor.dist = as.dist(1 - cor.mat)
cor.hc = hclust(cor.dist, method = "ward.D2")
cor.hc = as.dendrogram(cor.hc)

Plotting the dendrogram along with clinical annotations

library(dendextend)  # we will need this package to draw dendrograms

plot(cor.hc, las = 2, cex.lab = 0.7)  # Its nice, but what does it mean ??

# Creating annotation bars to show along with dendrogram to improve our
# interpretation
cb1 <- ifelse(all.aml.anno$ALL.AML == "ALL", "forestgreen", "firebrick")
cb2 <- ifelse(all.aml.anno$BM.PB == "BM", "grey", "pink")
cb3 <- ifelse(all.aml.anno$T.B.cell == "B-cell", "cyan", "violet")
cb4 <- ifelse(all.aml.anno$Gender == "F", "grey", "black")
cb <- cbind(`ALL/AML` = cb1, `BM/PB` = cb2, `T/B cell` = cb3, Gender = cb4)

plot(cor.hc, las = 2, cex.lab = 0.7)
colored_bars(colors = cb, dend = cor.hc)

How would you interpret this dendrogram. Does the clusters you observe make any sense? What are the parameters by which the samples cluster together? How many meaningful clusters can you observe? Do you see any relation between the distribution of the data and your clusters ?

Play around with the parameters of clustering and see how the results differs. Change your variability measure, change your distance metric (say euclidean), or linkage methods (say average). HINT - use ?dist and ?hclust for different available methods.

Why does the dendrogram of average linkage look so different from complete linkage

7. Dimentionality reduction - PCA

We will now use principal component analysis to explore the same dataset, and identify directions (i.e. principal components) with maximal variance.

pca = prcomp(t(all.aml.exp.topVar), center = F, scale. = F)  # We set these as false as we have already scaled our data
print(pca)
Standard deviations (1, .., p=72):
 [1] 21.858584 10.172947  7.163008  6.434863  5.827320  5.233852  4.891237
 [8]  4.436312  4.046577  3.696212  3.513138  3.335838  3.271605  3.219171
[15]  3.034724  2.916060  2.857842  2.803584  2.755637  2.671489  2.561401
[22]  2.503069  2.441615  2.415896  2.368323  2.282909  2.259064  2.216639
[29]  2.187112  2.145885  2.119765  2.104553  2.070477  2.049643  1.980785
[36]  1.949340  1.931123  1.920901  1.908356  1.872003  1.861793  1.853526
[43]  1.826156  1.775465  1.761265  1.732545  1.728901  1.687898  1.651187
[50]  1.641938  1.635463  1.627482  1.590959  1.559534  1.531345  1.522194
[57]  1.511371  1.484712  1.472253  1.445925  1.432564  1.429161  1.404952
[64]  1.385672  1.354657  1.343017  1.302371  1.270059  1.235583  1.216153
[71]  1.184771  1.122984

Rotation (n x k) = (893 x 72):
                                    PC1           PC2           PC3
AFFX-HUMRGE/M10098_5_at   -1.107890e-02  0.0022593773  1.155177e-01
                                    PC4           PC5           PC6
AFFX-HUMRGE/M10098_5_at    7.000917e-02 -7.041968e-02  2.977055e-02
                                    PC7           PC8           PC9
AFFX-HUMRGE/M10098_5_at    1.547209e-02  1.713529e-01  1.443374e-01
                                   PC10          PC11          PC12
AFFX-HUMRGE/M10098_5_at   -6.187497e-02  5.665142e-05 -0.0699290225
                                   PC13          PC14          PC15
AFFX-HUMRGE/M10098_5_at    0.1091315387  4.721160e-02  9.801792e-02
                                   PC16          PC17          PC18
AFFX-HUMRGE/M10098_5_at    6.528944e-02 -5.977307e-02 -1.405922e-01
                                   PC19          PC20          PC21
AFFX-HUMRGE/M10098_5_at    8.055925e-02 -0.0004072550  4.318161e-02
                                   PC22          PC23          PC24
AFFX-HUMRGE/M10098_5_at   -2.487338e-02 -4.416702e-02 -0.0430508695
                                   PC25          PC26          PC27
AFFX-HUMRGE/M10098_5_at   -7.654326e-03  1.641513e-02  4.209262e-02
                                   PC28          PC29          PC30
AFFX-HUMRGE/M10098_5_at    2.618725e-02  7.772970e-03 -0.0098368026
                                   PC31          PC32          PC33
AFFX-HUMRGE/M10098_5_at   -1.554847e-02  7.598622e-03 -1.437895e-02
                                   PC34          PC35          PC36
AFFX-HUMRGE/M10098_5_at    3.529688e-02  0.0185850116  4.470388e-02
                                   PC37          PC38          PC39
AFFX-HUMRGE/M10098_5_at    0.0338400102  0.0233030280 -7.265860e-05
                                   PC40          PC41          PC42
AFFX-HUMRGE/M10098_5_at   -2.059172e-02 -5.594101e-02  4.235136e-03
                                   PC43          PC44          PC45
AFFX-HUMRGE/M10098_5_at    1.233894e-04  3.915832e-02  3.763869e-02
                                   PC46          PC47          PC48
AFFX-HUMRGE/M10098_5_at   -2.671777e-02  6.767790e-02  0.0268888760
                                   PC49          PC50          PC51
AFFX-HUMRGE/M10098_5_at   -1.098911e-02 -1.071161e-02 -4.502149e-03
                                   PC52          PC53          PC54
AFFX-HUMRGE/M10098_5_at   -2.599236e-03  3.198787e-03 -0.0509601336
                                   PC55          PC56          PC57
AFFX-HUMRGE/M10098_5_at   -6.600117e-02  1.058799e-02  2.052763e-02
                                   PC58          PC59          PC60
AFFX-HUMRGE/M10098_5_at    1.414175e-02 -2.820650e-02  1.346128e-02
                                   PC61          PC62          PC63
AFFX-HUMRGE/M10098_5_at   -2.420685e-02  1.215780e-02  0.0514679364
                                   PC64          PC65          PC66
AFFX-HUMRGE/M10098_5_at   -9.356622e-03  7.165168e-03  3.895651e-02
                                   PC67          PC68          PC69
AFFX-HUMRGE/M10098_5_at    2.448192e-02  4.815108e-03  4.265601e-02
                                   PC70          PC71          PC72
AFFX-HUMRGE/M10098_5_at    0.0456278549  3.358280e-03 -5.252387e-04
 [ reached getOption("max.print") -- omitted 892 rows ]

How many principal components do you obtain? Compare this to the dimension of the matrix using the dim function!

Principal components are ranked by the amount of variance that they explain. This can be visualized using a scree plot, indicating how much variance each PC explains:

plot(pca, type = "l")  # First two componets explain most of the variability in the data

We can now display the data points (i.e. patients) in the first two principal components. In addition, we can color the dots according to certain clinical parameters:

## Color patients accordint to ALL / AML
plot(pca$x[, 1], pca$x[, 2], col = cb[, 1], pch = 19, xlab = "PC1", ylab = "PC2")

Color the patients according to the distinction between bone marrow samples and peripheral blood (second column of the cb data frame)

Redo the plot by including some other PCs (e.g. PC1 vs. PC3)

Ashwini Kumar Sharma, Carl Herrmann

2019-11-15