Confidence interval

The confidence interval describes the interval containing the (unknown) expectation value of a distribution with 95% confidence. This means that out of 100 random realizations of this random variable, the true expectation value \(\mu\) will indeed be in this interval.

Let us try a simulation: we consider a random variable distributed according to a normal distribution \(\mathcal{N}(\mu,\sigma)\). Here, we know the true value of the expectation value. We want to get an estimate for \(\mu\), and check if the confidence interval contains the true expectation value.

## these are the parameters of the normal distribution
m <- 175 # this is the true value of the expectation value
s <- 20 # this is the true value of the standard deviation

We now consider 100 samples of \(N=5\) realizations of the random variable, and compute the mean \(m_N\) over these \(N=5\) realizations, then determine the confidence interval, and check, how often the expectation value \(\mu\) is inside the confidence interval. Remember that the CI is given by \[ [m_N-t_{95,N-1}\frac{\sigma}{\sqrt{N}},m_N+t_{95,N-1}\frac{\sigma}{\sqrt{N}}] \] where \(t_{95,N-1}\) is the critical value for the \(t\)-distribution with \(n-1\) degrees of freedom.

# size of the sample
set.seed(123)
N <- 5
df <- N-1 #degrees of freedom of the t-distribution
#
# we now draw 100 times samples of size N=5
X <- sapply(1:100,function(i) {rnorm(N,mean=m,sd=s)})
# we compute the mean
Xm <- apply(X,2,mean)
# and the sample standard deviation
Xsd <- apply(X,2,sd) 
#
#
tc <- qt(c(0.975),df) # this is the critcal value for the t-distribution
Xl <- Xm-tc*Xsd/sqrt(N) # upper bound of the CI
Xh <- Xm+tc*Xsd/sqrt(N) # lower bound of the CI

col <- c('red','blue')
i.ok <- as.factor(Xl > m | Xh < m)
plot(Xm,ylim=c(100,250),pch=20,ylab="",main=paste("Means values and confidence intervals,N=",N))
abline(h=m,lty=3)
lapply(1:length(Xl),function(i) {points(c(i,i),c(Xl[i],Xh[i]),type="l",col=col[i.ok[i]],lwd=2)})

Here, the red/blue bars represent the confidence interval, the black dot the mean of the sample values, and the dotted line at m represents the true expectation value. Whenever the true expectation value is within the CI, the bar is red, if not, the bar is blue. How often is the true expectation value outside the CI? Count the blue bars!

It happens 4 times, which fits pretty well with the expected 5%.

Changing the sample size

We redo this simulation, but now with samples of \(n=50\) (again 100 times)

##
set.seed(321)

N <- 50
df <- N-1
X <- sapply(1:100,function(i) {rnorm(N,m,s)})
Xm <- apply(X,2,mean) 
Xsd <- apply(X,2,sd) 
tc <- qt(c(0.975),df)
Xl <- Xm-tc*Xsd/sqrt(N)
Xh <- Xm+tc*Xsd/sqrt(N)

col <- c('red','blue')
i.ok <- as.factor(Xl > m | Xh < m)

plot(Xm,ylim=c(100,250),pch=20,ylab="",main=paste("Means values and confidence intervals,N=",N))
abline(h=m,lty=3)
lapply(1:length(Xl),function(i) {points(c(i,i),c(Xl[i],Xh[i]),type="l",col=col[i.ok[i]],lwd=2)})

What do we observe? The CI are smaller, which corresponds to the fact that we are estimating the mean over a larger sample, hence more accurately, and the black dots are closer to the true value (dotted line).

How often is the true expectation value outside the CI? Again, count the blue bars…

This happens 3 times.

Confidence interval

MoBi Data Analysis - WS1920

Carl Herrmann

Changing the sample size