Basic statistics - Probability Distributions
1. Introduction and Objectives
In the previous lectures you learnt about different major types of probability distributions like Normal, Binomial, Negative binomial, Student t etc. In this tutorial you will practically experiment with these distributions make calculations and plot your results. You will also infer relations among these distributions and hopefully get a feeling of when each of these distributions are relevant.
Distributions represented by sparse lines represent outcomes which will be discrete (for example the roll of dice will always have discrete integer values 1 to 6). Distributions represented by dense lines represent outcomes which can be continuous i.e real numbers (for example the heights of people living in Heidelberg)
#2. Lets get started
R has in-built functions for almost all probability distributions (see below)
All of these functions for probability distribution follows the same common scheme, the root name of the function prefixed by either of p, d, q and r. For example for the Normal distribution we have the four variants of function available - pnorm, dnorm, qnorm and rnorm.
Function type (prefix) | Meaning | |
---|---|---|
p-
| (probability) gives the cumulative distribution function (cdf) | |
d-
| (density) gives the probability density function (pdf) | |
q-
| (quantiles) gives the inverse cumulative distribution function | |
r-
| (random) returns randomly generated numbers for the distribution | |
Here you can see all the available standard probability distributions and the respective function variants. In this tutorial we will focus on the first five.
Click on each distribution name to know their details. Almost all the of the functions follow the same syntax, if you know how to execute one you know all of them. Play around with the examples given in the help section. For example type
example(pnorm)
for seeing examples related to the Normal distribution.
Distribution | Functions | |||
---|---|---|---|---|
Normal |
pnorm
|
qnorm
|
dnorm
|
rnorm
|
Poisson |
ppois
|
qpois
|
dpois
|
rpois
|
Binomial |
pbinom
|
qbinom
|
dbinom
|
rbinom
|
Negative Binomial |
pnbinom
|
qnbinom
|
dnbinom
|
rnbinom
|
Student t |
pt
|
qt
|
dt
|
rt
|
Chi-Square |
pchisq
|
qchisq
|
dchisq
|
rchisq
|
Beta |
pbeta
|
qbeta
|
dbeta
|
rbeta
|
Cauchy |
pcauchy
|
qcauchy
|
dcauchy
|
rcauchy
|
Exponential |
pexp
|
qexp
|
dexp
|
rexp
|
F |
pf
|
qf
|
df
|
rf
|
Gamma |
pgamma
|
qgamma
|
dgamma
|
rgamma
|
Geometric |
pgeom
|
qgeom
|
dgeom
|
rgeom
|
Hypergeometric |
phyper
|
qhyper
|
dhyper
|
rhyper
|
Logistic |
plogis
|
qlogis
|
dlogis
|
rlogis
|
Log Normal |
plnorm
|
qlnorm
|
dlnorm
|
rlnorm
|
Studentized Range |
ptukey
|
qtukey
|
dtukey
|
rtukey
|
Uniform |
punif
|
qunif
|
dunif
|
runif
|
Weibull |
pweibull
|
qweibull
|
dweibull
|
rweibull
|
Wilcoxon Rank Sum Statistic |
pwilcox
|
qwilcox
|
dwilcox
|
rwilcox
|
Wilcoxon Signed Rank Statistic |
psignrank
|
qsignrank
|
dsignrank
|
rsignrank
|
#3. Normal/Gaussian distribution
The normal or the Gaussian distribution is given as -
\[P(x) = \frac{1}{{\sigma \sqrt {2\pi } }} \cdot e ^ \frac{-(x- \mu)^2}{{2\sigma ^2 }} \] where \(\mu\) is the mean of the distribution and \(\sigma\) is the standard deviation.
The standard normal distribution is a special case of normal distribution where the values for \(\sigma = 1\) and \(\mu = 0\). Thus, the above equation for the Normal distribution simplifies to -
\[P(x) = \frac{1}{{\sqrt {2\pi } }} \cdot e ^ \frac{-x^2}{2} \] Now for any \(x\) we can easily solve this equation since \(\pi\) and \(e\) are known constants.
Lets generate three random normal distributions with different means and standard deviations and visualize them together
## use the function to plot the density distribution
x = seq(-10,30,by=.1)
d1 = dnorm(x,mean=0,sd=1)
d2 = dnorm(x,mean=10,sd=1)
d3 = dnorm(x,mean=20,sd=1)
# compare with the histogram build from 1000 random number drawn from the distribution
r1 = rnorm( n=1000, mean=0, sd =1) # standard normal distribution
r2 = rnorm( n=1000, mean=10, sd =1)
r3 = rnorm( n=1000, mean=20, sd =1)
hist(r1,breaks=seq(-10,30,by=.5),probability = TRUE,ylim=c(0,.5),main="",xlab="")
hist(r2,breaks=seq(-10,30,by=.5),probability = TRUE,add=TRUE)
hist(r3,breaks=seq(-10,30,by=.5),probability = TRUE,add=TRUE)
lines(x,d1,col='red',type='l',lwd=4)
lines(x,d2,col='blue',lwd=4)
lines(x,d3,col='green',lwd=4)
Play with the mean and sd parameters and visualize the distributions (plain lines) as well as the corresponding histograms. What happens when you remove the parameter
probability=TRUE
in the histogram function?
Now we will use the Normal distribution to make predictions about gene expression of TP53 in lung cancer. TP53 is the most commonly mutated gene across multiple cancer types especially in lung cancers. We will read a table containing measurements of TP53 expression levels in 586 patients.
tp53.exp = read.table("https://www.dropbox.com/s/rwopdr8ycmdg8bd/TP53_expression_LungAdeno.txt?dl=1", header=T, sep="\t")[,1:2]
summary(tp53.exp)
Samples TP53_expression
TCGA-05-4244-01: 1 Min. : 92.6
TCGA-05-4245-01: 1 1st Qu.: 911.8
TCGA-05-4249-01: 1 Median :1313.3
TCGA-05-4250-01: 1 Mean :1380.8
TCGA-05-4382-01: 1 3rd Qu.:1778.1
TCGA-05-4384-01: 1 Max. :4703.9
(Other) :580 NA's :69
We will remove all the missing values and calculate the mean and standard deviation for the TP53 gene expression.
[1] 1380.822
[1] 719.5934
Lets see how well a normal distribution with \(\mu = 1380.822\) and \(\sigma = 719.5934\) can approximate the real distribution of TP53 expression. We assume that the population mean and standard deviation is similar as calculated above since we cannot measure the expression of TP53 in each and every lung cancer patient in the world.
plot(density(tp53.exp$TP53_expression), main ="", xlim=c(-1500, 6000),lwd=3)
#Make a normal distribution with the above parameters
set.seed(123)
d.pred = rnorm(n=517, mean =1380.822, sd = 719.5934)
lines(density(d.pred), col="red",lwd=3)
Using a normal distribution with \(\mu = 1380.822\) and \(\sigma = 719.5934\), we will ask the following questions -
– (Q1) What is the probability of observing the expression of TP53 to be less than 1000?
– (Q2) What is the probability of observing the expression of TP53 to be greater than 1000?
[1] 0.298327
[1] 0.701673
[1] 0.701673
Lets check how good these predictions are compared to real data.
[1] 0.2978723
[1] 0.7021277
I would say those predictions are pretty darn good !! Now lets try to break this model, Re execute the code above with different \(q\) values
q=100, q=500, q=4000, q=4500
etc. At what values do you think the model would not perform well. HINT: Look at the tails of the distribution
Again using a normal distribution with \(\mu = 1380.822\) and \(\sigma = 719.5934\), what if we ask these kinds of questions -
– (Q1) What is the expression of TP53 observed at 10th percentile?
– (Q2) What is the expression of TP53 observed at 90th percentile?
[1] 458.626
[1] 2303.018
Lets check how good these predictions are compared to our real data.
10% 90%
478.2844 2240.1913
Again the predictions are pretty good, re-execute the code above with p=0.25, p=0.5, p=0.75 etc
and check how good the predictions are.
Now lets plot the sample quantiles with theoretical quantiles to check the similarity between the two. This is called a quantile - quantile plot or a Q-Q plot
q = seq(0,1,0.01) # creating a vector of quantiles
#Find values corresponsing to these quantiles in the real data
q.observed = quantile(tp53.exp$TP53_expression, probs = q)
#Find values corresponding to thes quantiles in the theoretical normal distribution
q.theoretical = qnorm(p = q, mean = 1380.822, sd = 719.5934)
#Correlate the above two values
plot(x=q.theoretical, y=q.observed)
abline(0,1)
#4. Poisson distribution
A Poisson distribution can be defined as - \[P(x) = \frac{{e^{ - \lambda } \lambda ^x }}{{x!}}\]
where \(\lambda\) is the mean which also equals variance and \(x\) is the number of experiments done.
An experiment follows a Poisson distribution if -
- The outcomes are of only two types (binary) - success or failure, yes or no, present or absent etc
- The mean of successes \(\lambda\) that occurs in a specified interval is known.
- The probability of success is proportional to the size of the interval.
- The probability of success in a very small interval approaches zero.
- Intervals can be - length, area, volume, time, etc.
A real example of an experiment following Poisson distribution is gene mutation
- The outcomes are of only two types (binary) - mutated vs not mutated
- Average mutation \(\lambda\) that occurs in a specified interval (gene length) is known. Humans have a mutation rate of \(10^{-8} mutations/bp/generation\) (see here)
- The probability of mutation is proportional to the length of the gene. Longer genes accumulate more mutation.
- The probability of mutation in a very small genomic region is close to zero
Lets perform the following calculation -
- Human mutation rate = \(1 \times 10^{-8} mutations/bp/replication\)
- Human genome size = \(3 \times 10^{9}\) bp
- Thus, number of mutation accumulated after one generation = \(10^{-8} \cdot 3 \times 10^{9} = 30\)
Now what if we ask the question, what is the probability of observing 10 or less mutations using a Poisson distribution model build with the parameters \(\lambda\) calculated above.
[1] 2.234878e-05
Plot the probabilities for \(q=0:60\) and see how the probabilities change
A different way of visualizing the data which is more intuitive and clear is by plotting the densities of probabilities. We do the same as before using the Poisson distribution.
We clearly see the its much likely to get 30 mutations than others
Can you make the above prediction using the normal distribution instead? HINT \(\lambda = mean = variance\) see above equation. Derive standard deviation from variance. What difference do you see ? What is your interpretation ?
pd = ppois(q = 0:60, lambda = 30)
nd = pnorm(q = 0:60, mean = 30, sd = sqrt(30)) # since var = sd^2
plot(pd, col="red", type="l", xlab = "Average Mutations", ylab ="Probability")
lines(nd, col="grey", type="l")
#5. Binomial distribution
A binomial distribution can be defined as -
\[P(x) = \frac{n!}{x!(n-x)!}\cdot p^x \cdot (1-p)^{n-x}\]
Where \(x\) is the number of successes out of \(n\) experiments and \(p\) is the probability of success.
- \(mean = n \cdot p\)
- \(variance = np \cdot (1 - p)\)
- \(sd = \sqrt{np \cdot (1 - p)}\)
The design of the experiment is as follows -
- The experiment is repeated and are independent of one another
- Each experiment has just two outcomes
- The probability of success is constant and does not change with experiments
A real example would be coin toss -
- Flip a coin 100 times. Each of those flips are independent. One flip does not affect the result of another flip
- Outcomes is limited to heads or tails.
- The probability of success is constant - 0.5 on every trial.
We can simulate the above coin flipping experiment as below and clearly see that the number of success peaks at 50.
How would you simulate the effect of an unfair coin say which lads head 2/3 times?
What can you do using the other function variants
pbiom, qbiom, rbiom
#6. References 1. http://www.stat.umn.edu/geyer/old/5101/rlook.html 2. https://www.huber.embl.de/msmb/Chap-Generative.html 3. https://stattrek.com