The goal of these sessions is to investigate how we can improve the prediction of binding sites for transcription factors by using additional information, besides the knowledge of the binding profile, given as a position weight matrix (PWM).
This additional information could be for example histone modification datasets (ChIP-seq), information about chromatin opening given e.g. by DNAse hypersensitivity assays, information about DNA methylation, ...
As a gold standard to determine whether a predicted TFBS is a true or false hit, we will consider ChIP-seq datasets for the transcription factor of interest: every predicted TFBS overlapping with a ChIP peak will be considered a true positive, while those outside ChIP peaks will be false positives.

In the following, we will
  1. Predict potential binding sites using the PWM alone;
  2. Evaluate the rate of false positives/true positives using as a control ChIP-seq datasets for the TF of interest;
  3. Determine the profile of the different additional features around true/false binding sites
  4. Build a model incorporating additional features using logistic regression
  5. Validate the model using independent datasets.
Datasets: the files which are needed are located in /bfg/data/courses/bioinfo2/, with subfolders containing all the data needed :
  • histone : contains the bigwig files for the histone modifications
  • dnase : DNAse I hypersensitivity data
  • methyl : DNA methylation bigwig
  • phylop : phylogenetic conservation of the bases
  • sequence: contains the fasta sequence of chromosome 21
The position weight matrices and the ChIP-seq peaks for the transcription factors can be downloaded here: Make a local directory bioinfo2 and subdirectories results/ and data/ and make a symbolic link to the data folder and extract the ChIP-seq peaks and the PWM files.

(1) Identifying potential TFBS using PWMs

Goal: Given a position weight matrix for a transcription factor of interest, we want to predict potential TFBS using only the information about the PWM. For reasons of computational simplicity, we will restrict the analysis to the chromosome 21.

A. Scanning a sequence with a PWM

  1. Go to the RSAT web site and select matrix-scan-quick in the left menu
  2. upload the fasta sequence of chr21 in the sequence section.
  3. select a PWM in the pwm folder (file ending with .tab) and upload the corresponding file in the Matrix section
  4. in the dropdown menus select "sequence origin": start and "Return": sites + pval
  5. set the threshold on the p-values to "pval<=": 1e-3
  6. Finally, at the bottom of the page, select "Output": email
  7. When the results are ready, download the result file to your local account and store it as chr21.tfbs.pval.bed (extract the column 1,5,6; remove the first lines starting with ";")

B. Evaluate the true positives

  1. In the folder tf_peaks, find the ChIP dataset corresponding to the chosen PWM, and copy it to your folder .
  2. use the function bedtools intersect to determine which of the predicted TFs are overlapping which the ChIP peaks; for that, use the commands :
    $ /bfg/software/bedtools/2.16.2/bedtools intersect  -a chr21.tfbs.pval.bed -b [file of ChIP-seq peaks as bed file] | cut -f1-3 | awk '{print $0"\tpos_"NR}' > TP.tfbs.bed
    $ /bfg/software/bedtools/2.16.2/bedtools intersect -v -a chr21.tfbs.pval.bed -b [file of ChIP-seq peaks as bed file] | cut -f1-3 | awk '{print $0"\tneg_"NR}' > FP.tfbs.bed
    We now have two sets of TFBS: one positive set which intersects true ChIP-seq peaks (pos.tfbs.bed), and one negative set (neg.tfbs.bed) which does not.
  3. From the previous analysis determine the precision and recall:
    # True positives : number of TFBS which fall into a peak:
    $ wc -l TP.tfbs.bed
    # False positives : number of TFBS which do NOT fall into a chip peak:
    $ wc -l FP.tfbs.bed
    # False negatives: number of chip peaks which do NOT contain a TFBS:
    $ /bfg/software/bedtools/2.16.2/bedtools intersect -v -a [chip peaks] -b [TFBS] | wc -l
  4. by filtering on the pwm score (last column of the chr21.tfbs.pval.bed file), determine how these two quantities behave as a function of the score (set thresholds at 6.5,7,7.5,etc...); plot the precision/recall curve for various levels of the threshold.

(2) Exploring additional features

Goal: Use additional regulatory features to improve the TFBS prediction.

A. profiles around TFBS

We consider various features that might have a potential impact on the binding of a transcription factors:
  • histone modifications
  • DNA methylation
  • phylogenetic conservation
For all these features, we want to determine their profile around true binding sites (i.e. PWM matches intersecting with ChIP peaks) and false positives.

  • Prepare a bed file containing the positive and negative TFBS determined previously from the ChIP-seq datasets with the following format
    # positive_tfbs
  • use deepTools to plot the profile of the various datasets around the true positive/false positive TFBS :
    $ /bfg/software/anaconda2/bin/computeMatrix reference-point  -S [bigWig file] -R [predicted TFBS as bed file] -a 500 -b 500 -out [name of out file] --referencePoint center 
  • Now plot the profiles in the region +/- 500 bp around the TFBS by running the command :
    $ /bfg/software/anaconda2/bin/plotProfile -m [out file from computeMatrix] --refPointLabel "TFBS" -out [name of output pdf file]
  • Compare the profiles around positive/negative TFBS for the various features (histone marks, DNA methylation, DNAse hypersensitivity,...)
  • if the plots look unexpected, you can increase the region using the -a xxx -b yyy options in computeMatrix
In order to make the processing of all bigwig files more efficient, you can now use a bash loop to repeat the same processing over all bigwig files:
for i in ../data/methyl/*; do k=`basename $i`; echo $k; /bfg/software/anaconda2/bin/computeMatrix reference-point  -S $i -R TPFP.bed -a 2000 -b 2000 -out $k.out --referencePoint center; /bfg/software/anaconda2/bin/plotProfile -m $k.out --refPointLabel "TFBS" -out $k.pdf; done

(3) Scoring TFBS with additional features

We will now score the TFBS (positives and negatives) using all the selected features, as a starting point for training the machine learning algorithm.
For each of the datasets explored in the previous section, score the TFBS (positives and negatives) using the command from the previous section:
$ /bfg/software/kentUtils/bigWigAverageOverBed [bigwig file] [bed file] [output file]

Using R, organize all the score files into one single data.frame, in which you have
  1. one row per TFBS
  2. one column per feature
Build one data frame for the positive (pos.scores) and one separate one for the negative TFBS (neg.scores) !

Optional: Compute the correlation between the features using the cor function in R, and plot the correlation matrix as a heatmap
> library(gplots)
> cor.pos <- cor(pos.scores,method="spearman")
> heatmap.2(cor.pos)
Which features are most correlated to each other ? Does this make sense ?
We need to keep in mind the correlation between the variables, as they might impact the results of the regression analysis later. Correlated variables are usually harmfull in regression approaches, as they impact the accuray of the prediction for the regression coefficients of the correlated variables.

(4) Building a logistic regression model

We now come to the core of the machine learning analysis, and train a logistic regression to distinguish optimally the positive from the negative TFBS predictions. This will reveal which variables are most suited to discriminate true from false positives.
We will use the glm function in R, using the binomial model.

A. first fit of model

Preparing the data
  1. Merge the pos.scores and neg.scores data.frames into one single dataframes using the rbind function (all.scores) ;
  2. Add a additional column (class) to this merged data.frame indicating which rows represent positive/negative TFBS (make sure that this additional column is encoded as factor with the as.factor() function !)
Training the model
We will now apply the glm function in R, assuming that our binary output variable is distributed according to a binomial distribution with parameter p. All available variables will be included into the model.
Run the glm function by specifying the binomial distribution as family:
> lrfit <- glm(class ~ ., data=all.scores,family="binomial")
> summary(lrfit)

Check the output of the summary function
  • what are the most relevant variables ?
  • what does the AIC represent ?
Evaluate the performance
Now that we have a model, we can apply it to test how good it performs, by using the model as a predictor :

> pred <- predict(lrfit,newdata=all.scores,type="response")

This yields the predicted probabilities. By applying different thresholds on the probabilities, compute the precision and recall.. Did we improve ?

B. cross validation to evaluate the performance

Of course we have cheated here: we are evaluating the performance of the model on the data that was used to train the model, thus biasing the evaluation.... The proper way to evaluate the performance of a machine-learning model is using a cross-validation : splitting the data into a training and testing set, learning the model on the training set and applying it to the test set, thus evaluating the performance on an independent dataset.
For doing this, we will use the caret package which implements convenient functions to perform this kind of cross-validations.

Creating a partition
We will create a training dataset with 75% of the data, and keep the remaining 25% for the testing dataset; we need to make sure to sample 75% of the positive and the negative samples!
    > i.pos = which(all.scores$class=='positive')
    > i.pos.train = sample(i.pos,floor(0.75*length(i.pos)))
    > i.neg = which(all.scores$class=='negative')
    > i.neg.train = sample(i.neg,floor(0.75*length(i.neg)))
    > i.train = c(i.pos.train,i.neg.train)
    > train.set = all.scores[i.train,]
    > test.set = all.scores[-i.train,]
Setting up the cross validation:
  1. Run the glm function to fit the model, while evaluating the performance using the cross-validations
    > lrFit <- glm(class~ .,data=train.set,family="binomial")
Applying the model :
Remember that we had defined a training and a test set ? Now that we have an optimal model, we can apply it to the left-out test dataset...
  1. run the following function
    > pred <- predict(lrFit,newdata=test.set,type="response")
  2. Determine the precision and recall here.

C. selecting variables

We have now a good model, which includes all variables. We can now try to simplify the model by reducing it, removing "uninteresting" variables that do not contribute to the performance of the model ...
How would you select variables that can be safely removed ?
We can apply a built in function step, which recursively eliminates variables
> lrfit <- glm(class ~ ., data=train.set,family="binomial")
> <- step(lrfit)

  1. What is happening here ? What criteria is used to eliminate variables ?
  2. Check the final model stored in what variables does it contain ? Is this expected ?
  3. Apply this reduced model to the left out test.test
  4. Did the performance in terms of precision or recall get worse ?