Basic statistics - Hypothesis testing

1. Objectives
2. Load the data and first analysis
3. first statistical test
4. What if the data is not normally distributed?
5. testing proportions
6. More advanced…

1. Objectives

In this section, we want to learn how to use and interpret hypothesis tests. We will mainly focus on test on mean values, and will later (if time permits…) continue with tests on proportions.

2. Load the data and first analysis

We will work again on the ALL/AML expression dataset that we worked on for the exploratory data analysis. We cleaned up the dataset to remove extreme values, and to include gene symbols for better interpretation of the results.

all.aml = read.delim('http://bioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/practicals/data/all.aml.cleaned.csv',header=TRUE)
all.aml.anno = read.delim("http://bioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/practicals/data/all.aml.anno.cleaned.csv", header=TRUE)
#
# we convert all.aml into a data matrix rather than a data.frame
all.aml = data.matrix(all.aml)

We first check how the values are distributed for a gene across all the samples (= patients). For that, we take a random gene, and plot the histogram of the values across the samples. You should run this code several times, as at each run, another random gene is chosen!

# sample a random line from the expression table
i = sample(1:nrow(all.aml),1)
#
# plot the histogram
hist(all.aml[i,],main=rownames(all.aml)[i],col='lightblue',breaks=20)

We can now verify the normality of the distribution using a QQ-plot

# sample a random line from the expression table
i = sample(1:nrow(all.aml),1)
#
# plot the histogram
qqnorm(all.aml[i,],main=rownames(all.aml)[i]); qqline(all.aml[i,])

If the data is normaly distributed, the dots representing the quantiles should lie on the straight line. Here again, run the code several times, and check the qq-plot for several random genes!

Search for the line corresponding to the gene TP53 using the grep function (check the help!) inside the vector of row names rownames(all.aml) and check the normality of the distribution for this important oncogene.

The QQ-plot is a usefull tool to visualy inspect the normality; however, it does not give a quantitative measure of the normality!! We will see later on how we can determine in a more quantitative manner the normality.

3. first statistical test

We want to use a statistical test to check if, for a given gene,

there is a significant difference in the mean expression of this gene between two groups (for example ALL vs AML, 2-sample test); or
this gene has an expression which is significantly above a certain value (1-sample test)

This is exactly what mean tests such as the t-test or the Wilcoxon tests are designed for!

two sample t-test

We will now compare the expression of certain genes between ALL and AML patients. For a gene X, the question is :

Is there a significance difference in expression of gene X between ALL and AML patients?

Formulate the \(H_0\) hypothesis in this case!

Take the gene EIF4B as an example, an let us check if there is a significant different in gene expression between the 2 groups.

# get the column number of the ALL patients
i.all = grep('ALL', all.aml.anno$ALL.AML)
#
# find the line number of the gene SF1
i.gene = grep('CCND3',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
boxplot(list(ALL=all.aml[i.gene,i.all],AML=all.aml[i.gene,-i.all]),main=rownames(all.aml)[i.gene])

Does that look like a big expression difference? Redo this analysis for the gene CCND3 and OS9. What is your impression?

Now let us perform a statistical test to check if the H0 hypothesis for each of these 3 genes is valid or not.

Compute the t-value for a 2-sample t.test for each of the genes:

# find the line number of the gene EIF4B
i.gene = grep('EIF4B',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
t.obs=t.test(all.aml[i.gene,i.all],all.aml[i.gene,-i.all])$statistic
t.obs

        t 
0.9454213

Is this a large value (in absolute)? What value would be typically get if there was no difference (that is, if \(H_0\) is true)?

Building the H0 distribution

Let’s compute the t-value for EIF4B for 2 randomly chosen groups within the cohort. The rationale here: there is not reason why between 2 random groups, there should be a significant difference, right?

We create 2 random groups of patients, each one having the size of the ALL / AML group

# sample random columns 
i.randomALL  = sample(1:ncol(all.aml),length(i.all))
#
# compute t-test between the 2 random groups
t.h0=t.test(all.aml[i.gene,i.randomALL],
            all.aml[i.gene,-i.randomALL])$statistic
t.h0

       t 
1.559719

repeat this several times to get various t-values over various random groups

We can repeat this 10000 times over 10000 random splits of the cohorts

t.values = sapply(1:10000,function(i) {
  #random columns 
  i.randomALL  = sample(1:ncol(all.aml),length(i.all))
  #
  #   compute t-test between the 2 random groups
  t.test(all.aml[i.gene,i.randomALL],all.aml[i.gene,-i.randomALL])$statistic
})
#
# plot histogram
hist(t.values);abline(v=t.obs,lwd=3,col='red')

This distribution approximates the \(H_0\) distribution of the test statistics.

What does the distribution look like?? Compute the mean and standard deviation of the t.values; plot a qq-plot

mean(t.values)

[1] 0.0562028

sd(t.values)

[1] 1.031468

qqnorm(t.values);qqline(t.values)

This is a standard normal distribution (actually, a t-distribution with degrees of freedom the number of patients -2)!

We can overlay the empirical \(H_0\) distribution that we have just determined with theoretical \(H_0\) distribution, which is a t-distribution with 70 degrees of freedom.

x = seq(-3,3,by=0.2)
y = dt(x,df=ncol(all.aml)-2)
hist(t.values,freq=FALSE);lines(x,y,type='l',col= 'red',lwd=3)

Pretty close, no?

Computing the p-value

We can place the value of the observed t-statistics between the real ALL and AML subgroups on this histogram:

#i.gene = grep('CCND3',rownames(all.aml))
#
# now let us plot the expression in the 2 groups
#t.obs=t.test(all.aml[i.gene,i.all],all.aml[i.gene,-i.all])$statistic
##
hist(t.values,xlim=c(-10,10));abline(v=t.obs,lty=3,lwd=4,col='red')

Since we are performing a 2-sided t-test here (remember the \(H_0\) hypothesis formulation!), we have to check what the probability is to get more extreme t-values under \(H_0\).

Check in how many of these 10000 random splits we get a t-value that is more extreme than the observed, hence : lower that -abs(t.obs) or larger than abs(t.obs). Normalize this number to the 10000 random splits to obtain the p-value!!

p.value = sum(t.values < -abs(t.obs) | t.values > abs(t.obs))/length(t.values)
p.value

[1] 0.3651

Instead of using the empirical \(H_0\) distribution, we can use the theoretical distribution (t-distribution):

df = ncol(all.aml)-2
##
## this is the one-sided p-value for the upper tail test
pt(t.obs, df =df, lower.tail=FALSE)

        t 
0.1738479

##
## this is the two-sided p-value
p.2sided = 2*pt(t.obs, df =df, lower.tail=FALSE)

Finally, we can compare these computed p-value with the p-value returned by the function t.test:

t.test(all.aml[i.gene,i.all],all.aml[i.gene,-i.all])


    Welch Two Sample t-test

data:  all.aml[i.gene, i.all] and all.aml[i.gene, -i.all]
t = 0.94542, df = 52.374, p-value = 0.3488
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -334.3810  930.3572
sample estimates:
mean of x mean of y 
 2767.468  2469.480

Do you consider this to be significant?

Compute the p-value for a one-sided upper-tail test. Formulate the question and the \(H_0\) hypothesis.

Redo this whole analysis for CCND3!

4. What if the data is not normally distributed?

At the beginning, we checked if the distribution of the data in the samples corresponds to a normal distribution.

Check again the distribution of the expression values for the gene FOSB

expression = all.aml['FOSB',]
hist(expression,breaks = 30);qqnorm(expression)

This looks everything but normal! In that case, we cannot apply the t.test, but need to apply a non-parametric test called the Wilcoxon test (check the lecture notes!). This test is performed not on the values (like the t-test) but on the ranks of these values (remember the difference between the Pearson and the Spearman correlation!)

exp.all = all.aml['FOSB',i.all]
exp.aml = all.aml['FOSB',-i.all]
wilcox.test(exp.all,exp.aml)


    Wilcoxon rank sum test with continuity correction

data:  exp.all and exp.aml
W = 451, p-value = 0.1077
alternative hypothesis: true location shift is not equal to 0

Compare the obtained p-value with the p-value obtained if we would have used the t-test:

exp.all = all.aml['FOSB',i.all]
exp.aml = all.aml['FOSB',-i.all]
t.test(exp.all,exp.aml)


    Welch Two Sample t-test

data:  exp.all and exp.aml
t = -2.3055, df = 32.06, p-value = 0.02776
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2878.6785  -178.1539
sample estimates:
mean of x mean of y 
 1809.064  3337.480

The p-values are very different!! So is the difference of expression between ALL and AML patients for this gene significant or not taking \(\alpha=0.05\)?

Here, we cannot trust the t-test due to the non-normality of the data! Hence, the correct p-value is the one from the Wilcoxon test.

Check the normality of the expression of the RING1 gene; compare the two test for the RING1 gene

# expression of GRM4 over all samples
expression = all.aml['RING1',]
qqnorm(expression);qqline(expression)

#
#
exp.all = all.aml['RING1',i.all]
exp.aml = all.aml['RING1',-i.all]
wilcox.test(exp.all,exp.aml);t.test(exp.all,exp.aml)


    Wilcoxon rank sum test with continuity correction

data:  exp.all and exp.aml
W = 517.5, p-value = 0.411
alternative hypothesis: true location shift is not equal to 0


    Welch Two Sample t-test

data:  exp.all and exp.aml
t = -0.67503, df = 43.919, p-value = 0.5032
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -349.9799  174.3646
sample estimates:
mean of x mean of y 
 1372.872  1460.680

5. testing proportions

The proportion test (Fisher Extact Test or chi2 test) are used to investigate the relationship between 2 categorical variables, starting from a contingency table. We will use a dataset with clinical informations about breast cancer patients.

dat = read.delim('http://bioinfo.ipmb.uni-heidelberg.de/crg/datascience3fs/practicals/data/gbsg_ba_ca.dat', stringsAsFactors = FALSE)

Check which variables in this dataset are categorical/ordinal/numerical. We can now check if there is a significant relationship between some variables. For example, we can verify if the choice of treatment with tamoxifen (variable hormon) is related to the pre-/post-monopausal status (variable meno)

First, we can build the contingency table table for these 2 variables:

## build contingency table
table(dat$meno,dat$hormon)

                
                 had tamoxifen no tamoxifen
  Postmenopausal           187          209
  premenopausal             59          231

Compute the odds-ratio (OR) for these two variables. How would the odds-ratio look like if you would transpose the matrix?

Now we can run the Fisher Exact Test (FET), or the chi2-test.

## build contingency table
tab = table(dat$meno,dat$hormon)
#
## run the FET
fisher.test(tab)


    Fisher's Exact Test for Count Data

data:  tab
p-value = 2.475e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.444410 5.049548
sample estimates:
odds ratio 
  3.496639

Check if your computation of the odds-ratio is right! Compute also the two 1-sided test.

Now we want to verify the impact of age on the grade of the tumor. We categorize the patients in under and over 40 year groups, and perform a chi-squared test:

## contingency table
tab = table(dat$age>40,dat$grade)
tab

       
          1   2   3
  FALSE   6  41  26
  TRUE   75 403 135

##
chisq.test(tab)


    Pearson's Chi-squared test

data:  tab
X-squared = 6.9515, df = 2, p-value = 0.03094

6. More advanced…

Find a list of marker genes that allow to distinguish ALL from AML. Take into account the problem of multiple testing (see lecture notes!) and correct for that.