Basic statistics - Hypothesis testing
1. Objectives
In this section, we want to learn how to use and interpret hypothesis tests. We will mainly focus on test on mean values, and will later (if time permits…) continue with tests on proportions.
2. Load the data and first analysis
We will work again on the ALL/AML expression dataset that we worked on for the exploratory data analysis. We cleaned up the dataset to remove extreme values, and to include gene symbols for better interpretation of the results.
all.aml = read.delim('http://bioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/practicals/data/all.aml.cleaned.csv',header=TRUE)
all.aml.anno = read.delim("http://bioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/practicals/data/all.aml.anno.cleaned.csv", header=TRUE)
#
# we convert all.aml into a data matrix rather than a data.frame
all.aml = data.matrix(all.aml)
We first check how the values are distributed for a gene across all the samples (= patients). For that, we take a random gene, and plot the histogram of the values across the samples. You should run this code several times, as at each run, another random gene is chosen!
# sample a random line from the expression table
i = sample(1:nrow(all.aml),1)
#
# plot the histogram
hist(all.aml[i,],main=rownames(all.aml)[i],col='lightblue',breaks=20)
We can now verify the normality of the distribution using a QQ-plot
# sample a random line from the expression table
i = sample(1:nrow(all.aml),1)
#
# plot the histogram
qqnorm(all.aml[i,],main=rownames(all.aml)[i]); qqline(all.aml[i,])
If the data is normaly distributed, the dots representing the quantiles should lie on the straight line. Here again, run the code several times, and check the qq-plot for several random genes!
Search for the line corresponding to the gene TP53 using the
grep
function (check the help!) inside the vector of row namesrownames(all.aml)
and check the normality of the distribution for this important oncogene.
The QQ-plot is a usefull tool to visualy inspect the normality; however, it does not give a quantitative measure of the normality!! We will see later on how we can determine in a more quantitative manner the normality.
3. first statistical test
We want to use a statistical test to check if, for a given gene,
- there is a significant difference in the mean expression of this gene between two groups (for example ALL vs AML, 2-sample test); or
- this gene has an expression which is significantly above a certain value (1-sample test)
This is exactly what mean tests such as the t-test or the Wilcoxon tests are designed for!
two sample t-test
We will now compare the expression of certain genes between ALL and AML patients. For a gene X, the question is :
Is there a significance difference in expression of gene X between ALL and AML patients?
Formulate the \(H_0\) hypothesis in this case!
Take the gene EIF4B as an example, an let us check if there is a significant different in gene expression between the 2 groups.
# get the column number of the ALL patients
i.all = grep('ALL', all.aml.anno$ALL.AML)
#
# find the line number of the gene SF1
i.gene = grep('CCND3',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
boxplot(list(ALL=all.aml[i.gene,i.all],AML=all.aml[i.gene,-i.all]),main=rownames(all.aml)[i.gene])
Does that look like a big expression difference? Redo this analysis for the gene CCND3 and OS9. What is your impression?
Now let us perform a statistical test to check if the H0 hypothesis for each of these 3 genes is valid or not.
- Compute the t-value for a 2-sample t.test for each of the genes:
# find the line number of the gene EIF4B
i.gene = grep('EIF4B',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
t.obs=t.test(all.aml[i.gene,i.all],all.aml[i.gene,-i.all])$statistic
t.obs
t
0.9454213
Is this a large value (in absolute)? What value would be typically get if there was no difference (that is, if \(H_0\) is true)?
Building the H0 distribution
Let’s compute the t-value for EIF4B for 2 randomly chosen groups within the cohort. The rationale here: there is not reason why between 2 random groups, there should be a significant difference, right?
We create 2 random groups of patients, each one having the size of the ALL / AML group
# sample random columns
i.randomALL = sample(1:ncol(all.aml),length(i.all))
#
# compute t-test between the 2 random groups
t.h0=t.test(all.aml[i.gene,i.randomALL],
all.aml[i.gene,-i.randomALL])$statistic
t.h0
t
1.559719
repeat this several times to get various t-values over various random groups
We can repeat this 10000 times over 10000 random splits of the cohorts
t.values = sapply(1:10000,function(i) {
#random columns
i.randomALL = sample(1:ncol(all.aml),length(i.all))
#
# compute t-test between the 2 random groups
t.test(all.aml[i.gene,i.randomALL],all.aml[i.gene,-i.randomALL])$statistic
})
#
# plot histogram
hist(t.values);abline(v=t.obs,lwd=3,col='red')
This distribution approximates the \(H_0\) distribution of the test statistics.
What does the distribution look like?? Compute the mean and standard deviation of the t.values; plot a qq-plot
[1] 0.0562028
[1] 1.031468
This is a standard normal distribution (actually, a t-distribution with degrees of freedom the number of patients -2)!
We can overlay the empirical \(H_0\) distribution that we have just determined with theoretical \(H_0\) distribution, which is a t-distribution with 70 degrees of freedom.
x = seq(-3,3,by=0.2)
y = dt(x,df=ncol(all.aml)-2)
hist(t.values,freq=FALSE);lines(x,y,type='l',col= 'red',lwd=3)
Pretty close, no?
Computing the p-value
We can place the value of the observed t-statistics between the real ALL and AML subgroups on this histogram:
#i.gene = grep('CCND3',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
#t.obs=t.test(all.aml[i.gene,i.all],all.aml[i.gene,-i.all])$statistic
##
hist(t.values,xlim=c(-10,10));abline(v=t.obs,lty=3,lwd=4,col='red')
Since we are performing a 2-sided t-test here (remember the \(H_0\) hypothesis formulation!), we have to check what the probability is to get more extreme t-values under \(H_0\).
Check in how many of these 10000 random splits we get a t-value that is more extreme than the observed, hence : lower that -abs(t.obs) or larger than abs(t.obs). Normalize this number to the 10000 random splits to obtain the p-value!!
[1] 0.3651
Instead of using the empirical \(H_0\) distribution, we can use the theoretical distribution (t-distribution):
df = ncol(all.aml)-2
##
## this is the one-sided p-value for the upper tail test
pt(t.obs, df =df, lower.tail=FALSE)
t
0.1738479
Finally, we can compare these computed p-value with the p-value returned by the function t.test
:
Welch Two Sample t-test
data: all.aml[i.gene, i.all] and all.aml[i.gene, -i.all]
t = 0.94542, df = 52.374, p-value = 0.3488
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-334.3810 930.3572
sample estimates:
mean of x mean of y
2767.468 2469.480
Do you consider this to be significant?
Compute the p-value for a one-sided upper-tail test. Formulate the question and the \(H_0\) hypothesis.
Redo this whole analysis for CCND3!
4. What if the data is not normally distributed?
At the beginning, we checked if the distribution of the data in the samples corresponds to a normal distribution.
Check again the distribution of the expression values for the gene FOSB
This looks everything but normal! In that case, we cannot apply the t.test, but need to apply a non-parametric test called the Wilcoxon test (check the lecture notes!). This test is performed not on the values (like the t-test) but on the ranks of these values (remember the difference between the Pearson and the Spearman correlation!)
Wilcoxon rank sum test with continuity correction
data: exp.all and exp.aml
W = 451, p-value = 0.1077
alternative hypothesis: true location shift is not equal to 0
Compare the obtained p-value with the p-value obtained if we would have used the t-test:
Welch Two Sample t-test
data: exp.all and exp.aml
t = -2.3055, df = 32.06, p-value = 0.02776
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2878.6785 -178.1539
sample estimates:
mean of x mean of y
1809.064 3337.480
The p-values are very different!! So is the difference of expression between ALL and AML patients for this gene significant or not taking \(\alpha=0.05\)?
Here, we cannot trust the t-test due to the non-normality of the data! Hence, the correct p-value is the one from the Wilcoxon test.
Check the normality of the expression of the RING1 gene; compare the two test for the RING1 gene
# expression of GRM4 over all samples
expression = all.aml['RING1',]
qqnorm(expression);qqline(expression)
#
#
exp.all = all.aml['RING1',i.all]
exp.aml = all.aml['RING1',-i.all]
wilcox.test(exp.all,exp.aml);t.test(exp.all,exp.aml)
Wilcoxon rank sum test with continuity correction
data: exp.all and exp.aml
W = 517.5, p-value = 0.411
alternative hypothesis: true location shift is not equal to 0
Welch Two Sample t-test
data: exp.all and exp.aml
t = -0.67503, df = 43.919, p-value = 0.5032
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-349.9799 174.3646
sample estimates:
mean of x mean of y
1372.872 1460.680
5. testing proportions
The proportion test (Fisher Extact Test or chi2 test) are used to investigate the relationship between 2 categorical variables, starting from a contingency table. We will use a dataset with clinical informations about breast cancer patients.
Check which variables in this dataset are categorical/ordinal/numerical. We can now check if there is a significant relationship between some variables. For example, we can verify if the choice of treatment with tamoxifen (variable hormon
) is related to the pre-/post-monopausal status (variable meno
)
First, we can build the contingency table table for these 2 variables:
had tamoxifen no tamoxifen
Postmenopausal 187 209
premenopausal 59 231
Compute the odds-ratio (OR) for these two variables. How would the odds-ratio look like if you would transpose the matrix?
Now we can run the Fisher Exact Test (FET), or the chi2-test.
Fisher's Exact Test for Count Data
data: tab
p-value = 2.475e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.444410 5.049548
sample estimates:
odds ratio
3.496639
Check if your computation of the odds-ratio is right! Compute also the two 1-sided test.
Now we want to verify the impact of age on the grade of the tumor. We categorize the patients in under and over 40 year groups, and perform a chi-squared test:
1 2 3
FALSE 6 41 26
TRUE 75 403 135
Pearson's Chi-squared test
data: tab
X-squared = 6.9515, df = 2, p-value = 0.03094
6. More advanced…
Find a list of marker genes that allow to distinguish ALL from AML. Take into account the problem of multiple testing (see lecture notes!) and correct for that.