Introduction

The goal of these sessions is to investigate how we can improve the prediction of binding sites for transcription factors by using additional information (epigenetic signals, sequence conservation, chromatin accessibility), besides sequence only informations based on using a sequence profile (PWM) for TFBS prediction.
This additional information could be for example histone modification datasets (ChIP-seq), information about chromatin opening given e.g. by DNAse hypersensitivity assays, information about DNA methylation, ...
As a gold standard to determine whether a predicted TFBS is a true or false hit, we will consider ChIP-seq datasets for the transcription factor of interest: every predicted TFBS overlapping with a ChIP peak will be considered a true positive, while those outside ChIP peaks will be false positives.

In the following, we will
  1. Predict potential binding sites using the PWM alone;
  2. Evaluate the rate of false positives/true positives using as a control ChIP-seq datasets for the TF of interest;
  3. Determine the signal of the different additional features around true/false binding sites, to see if there is a difference
  4. Build a model incorporating additional features using logistic regression
  5. Validate the model using independent datasets.
  6. Perform variable selection to obtain the simplest, yet accurate model.
Datasets: for each TF, there is an R rds object available containing a data frame: each line corresponds to a potential TFBS site on chromosome 21, predicted using a position weight matrix; each column corresponds to a feature; the numerical entries represent the scoring of the potential TFBS site with the corresponding feature.
  • accessibility : DNAse I hypersensitivity data
  • DNAme : DNA methylation data
  • phylop : phylogenetic conservation of the bases
  • H3kxxx: corresponding histone mark around the site
  • LLR : log-likelihood ratio (=score) of the predicted site (based only on sequence information!)
In addition, the last columns (tfbs) indicates if this is a true-positive (i.e. if the predicted site overlaps with a ChIP-seq peak) or not (i.e. no overlap)

In order to use the file, download the file onto your computer (e.g. STAT3.rds) and load it as:
> data = readRDS('/path/to/your/file/STAT3.rds')
> head(data)

Here are the links to the specific datasets; select one TF and click on the link to download the .rds file:
  1. CTCF
  2. EBF1
  3. JUNB

1. Evaluate the Precision/Recall based on sequence information

Goal: We want to evalute the baseline performance of a prediction model which is only based on using a position-weight matrix (pwm), hence is only based on sequence information.
We have scanned the sequence of the human chromosome 21 with a number of PWM for many transcription factors. The output files for each TF contain one line for each predicted TFBS, together with the log-likelihood score (column LRR) of the match, and whether the predicted TFBS can be considered as a true-positive (i.e. overlaps with a ChIP-seq peak for this TF) or a false-positive (i.e does NOT overlap with a ChIP-seq peak.) (column tfbs)

Hence, we have two sets of TFBS:
  1. one positive set which intersects true ChIP-seq peaks (with a 1 in the column tfbs)
  2. one negative set which does not (with a 0 in the column tfbs)
Goal:
  1. by filtering of the LLR column, select all TFBS with a LLR above a certain threshold (try 50,55,...)
  2. base on the tfbs column, compute the precision
  3. plot the precision curve as a function of the LLR cutoff(it should contain at least 6 points, i.e. 6 different thresholds on the LLR value!).
What is the performance of the sequence based prediction? Does the curve have the expected behavior?
How would you evaluate the recall? Here is a list of files with ChIP-seq peaks for each TF; can you use these to estimate an upper bound on the recall?
  1. CEBPB peaks
  2. CTCF peaks
  3. EBF1 peaks
  4. JUNB peaks
  5. PU.1 peaks
  6. YY1 peaks

2. Scoring TFBS with additional features

We will now explore the additional genomic features for all the potential TFBS; in particular, we will look for correlations between the features.

a. boxplot

We will explore the difference between positive and negative sites by looking at boxplots for each feature for positive/negative tfbs
  1. make a boxplot showing the distribution of the signal between positive and negative sites for each feature
  2. run a Wilcoxon test to check for significant differences

b. correlation structure

Compute the correlation between the features using the cor function in R, and plot the correlation matrix as a heatmap
> library(pheatmap)
> cor.ft = cor(data[,-16],method="spearman")
> diag(cor.ft) = NA
> pheatmap(cor.ft)

Which features are most correlated to each other? Does this make sense?
We need to keep in mind the correlation between the variables, as they might impact the results of the regression analysis later. Correlated variables are usually harmfull in regression approaches, as they impact the accuray of the prediction for the regression coefficients of the correlated variables. One solution would be to use a ridge regression or a LASSO method from the glmnet package.

3. Building a logistic regression model

We now come to the core of the machine learning analysis, and train a logistic regression to distinguish optimally the positive from the negative TFBS predictions. This will reveal which variables are most suited to discriminate true from false positives.
We will use the glm function in R, using the binomial model.

a. first fit of model

Training the model
We will now apply the glm function in R, assuming that our binary output variable is distributed according to a binomial distribution with parameter p. All available variables will be included into the model.
Run the glm function by specifying the binomial distribution as family:
> lrfit <- glm(tfbs ~ ., data=data,family="binomial")
> summary(lrfit)

Check the output of the summary function
  • what are the most relevant variables ?
  • what does the AIC represent ?
Evaluate the performance
Now that we have a model, we can apply it to test how good it performs, by using the model as a predictor :

> pred = predict(lrfit,newdata=data,type="response")

This yields the predicted probabilities.

  1. Make a boxplot showing the prediction probabilities for the real (tfbs=1) vs. false (tfbs=0) sites. Do the predictions make sense?
  2. Set a prediction threshold (e.g. 0.1), make a contingency matrix (tfbs 0/1 vs. above/below prediction threshold) and compute the accuracy = (TP+TN)/(TP+TN+FP+FN); repeat this at various thresholds!
  3. By applying different thresholds on the prediction probabilities, and comparing with the true status (0/1), compute the precision.. Try to integrate this new result into a graph together with the precision computed at the begining using only sequence information. Did we improve?
  4. How does the cross-entropy look like?

b. cross validation to evaluate the performance

Of course we have cheated here: we are evaluating the performance of the model on the data that was used to train the model, thus biasing the evaluation.... The proper way to evaluate the performance of a machine-learning model is using a cross-validation : splitting the data into a training and testing set, learning the model on the training set and applying it to the test set, thus evaluating the performance on an independent dataset.
Creating a partition
We will create a training dataset with 75% of the data, and keep the remaining 25% for the testing dataset; we need to make sure to sample 75% of the positive and the negative samples!
    > i.pos = which(data$tfbs==1)
    > i.pos.train = sample(i.pos,floor(0.75*length(i.pos)))
    > i.neg = which(data$tfbs==0)
    > i.neg.train = sample(i.neg,floor(0.75*length(i.neg)))
    > i.train = c(i.pos.train,i.neg.train)
    > train.set = data[i.train,]
    > test.set = data[-i.train,]
    
Train the model:
Run the glm function to fit the model, while evaluating the performance using the cross-validations
> lrFit = glm(tfbs~ .,data=train.set,family="binomial")

Applying the model :
Remember that we had defined a training and a test set? Now that we have an optimal model, we can apply it to the left-out test dataset...
  1. run the following function
    > pred <- predict(lrFit,newdata=test.set,type="response")
  2. Determine the precision here (use one particular threshold!c).
Did we improve compared to the sequence-only performance?

4. Variable selection

We have now a good model, which includes all variables. We can now try to simplify the model by reducing it, removing "uninteresting" variables that do not contribute to the performance of the model ...
How would you select variables that can be safely removed ?

a. subset selection

We can apply a built in function step, which recursively eliminates variables
> lrfit <- glm(tfbs ~ ., data=train.set,family="binomial")
> lr.red <- step(lrfit)

  1. What is happening here? What criteria is used to eliminate variables?
  2. Check the final model stored in lr.red: what variables does it contain? Is this expected?
  3. Apply this reduced model to the left out test.set
  4. Did the performance in terms of precision or recall get worse?

b. shrinkage methods

The glmnet package implements ridge regression and LASSO shrinkage methods. If it is not installed, install the package with
> install.packages('glmnet')
Check the tutorial for this package! (see link in the right column on this page)
Ridge regression
Start by performing a ridge regression using the following code:
> library(glmnet)
> var = data[,-16]
> out=as.integer(data[,16])
> fit.ridge = glmnet(data.matrix(var),out,alpha=0,family="binomial")
Now you can graphically plot the coefficients, as the regularization parameter lambda is varied:
> plot(fit.ridge,label=TRUE,lwd=2)
The number on the lines corespond to the column number of the corresponding variable.
Can you interpret the plot? Is it consistent with the previous stepwise variable selection?
LASSO
LASSO implements a different regularization compared to ridge regression (i.e. L1 regularization), which has the effect to set a number of variables to zero, hence leading to a selection of relevant variables:
> library(glmnet)
> var = data[,-16]
> out=as.integer(data[,16])
> fit.lasso = glmnet(data.matrix(var),out,alpha=1,family="binomial")
Now you can graphically plot the coefficients, as the regularization parameter lambda is varied:
> plot(fit.lasso,label=TRUE,lwd=2)
Is the selection of variables consistent with the backward procedure?