## Introduction

The goal of these sessions is to investigate**how we can improve the prediction of binding sites for transcription**factors by using additional information, besides the knowledge of the binding profile, given as a position weight matrix (PWM).

This additional information could be for example

**histone modification**datasets (ChIP-seq), information about

**chromatin opening**given e.g. by DNAse hypersensitivity assays, information about

**DNA methylation**, ...

As a gold standard to determine whether a predicted TFBS is a true or false hit, we will consider ChIP-seq datasets for the transcription factor of interest: every predicted TFBS overlapping with a ChIP peak will be considered a

**true positive**, while those outside ChIP peaks will be

**false positives**.

In the following, we will

- Predict potential binding sites using the PWM alone;
- Evaluate the rate of false positives/true positives using as a control ChIP-seq datasets for the TF of interest;
- Determine the profile of the different additional features around true/false binding sites
- Build a model incorporating additional features using
**logistic regression** - Validate the model using independent datasets.

**Datasets:**the files which are needed are located in

**/bfg/data/courses/bioinfo2/**, with subfolders containing all the data needed :

- histone : contains the bigwig files for the histone modifications
- dnase : DNAse I hypersensitivity data
- methyl : DNA methylation bigwig
- phylop : phylogenetic conservation of the bases
- sequence: contains the fasta sequence of chromosome 21

**Make a local directory**

`bioinfo2`and subdirectories`results/`and`data/`and make a symbolic link to the data folder and extract the ChIP-seq peaks and the PWM files.## (1) Identifying potential TFBS using PWMs

**Goal:**Given a position weight matrix for a transcription factor of interest, we want to predict potential TFBS using only the information about the PWM. For reasons of computational simplicity, we will restrict the analysis to the chromosome 21.

### A. Scanning a sequence with a PWM

- Go to the RSAT web site and select
`matrix-scan-quick`in the left menu - upload the fasta sequence of chr21 in the
`sequence`section. - select a PWM in the
`pwm`folder (file ending with .tab) and upload the corresponding file in the`Matrix`section - in the dropdown menus select
`"sequence origin": start`and`"Return": sites + pval` - set the threshold on the p-values to
`"pval<=": 1e-3` - Finally, at the bottom of the page, select
`"Output": email` - When the results are ready, download the result file to your local account and store it as
`chr21.tfbs.pval.bed`(extract the column 1,5,6; remove the first lines starting with ";")

### B. Evaluate the true positives

- In the folder
`tf_peaks`, find the ChIP dataset corresponding to the chosen PWM, and copy it to your folder . - use the function
`bedtools intersect`to determine which of the predicted TFs are overlapping which the ChIP peaks; for that, use the commands :$ /bfg/software/bedtools/2.16.2/bedtools intersect -a chr21.tfbs.pval.bed -b [file of ChIP-seq peaks as bed file] | cut -f1-3 | awk '{print $0"\tpos_"NR}' > TP.tfbs.bed

We now have two sets of TFBS: one

$ /bfg/software/bedtools/2.16.2/bedtools intersect -v -a chr21.tfbs.pval.bed -b [file of ChIP-seq peaks as bed file] | cut -f1-3 | awk '{print $0"\tneg_"NR}' > FP.tfbs.bed**positive set**which intersects true ChIP-seq peaks (pos.tfbs.bed), and one**negative set**(neg.tfbs.bed) which does not. - From the previous analysis determine the
*precision*and*recall*:# True positives : number of TFBS which fall into a peak: $ wc -l TP.tfbs.bed # False positives : number of TFBS which do NOT fall into a chip peak: $ wc -l FP.tfbs.bed # False negatives: number of chip peaks which do NOT contain a TFBS: $ /bfg/software/bedtools/2.16.2/bedtools intersect -v -a [chip peaks] -b [TFBS] | wc -l

- by filtering on the pwm score (last column of the
`chr21.tfbs.pval.bed`file), determine how these two quantities behave as a function of the score (set thresholds at 6.5,7,7.5,etc...); plot the precision/recall curve for various levels of the threshold.

## (2) Exploring additional features

**Goal:**Use additional regulatory features to improve the TFBS prediction.

### A. profiles around TFBS

We consider various features that might have a potential impact on the binding of a transcription factors:

- histone modifications
- DNA methylation
- phylogenetic conservation

**profile**around true binding sites (i.e. PWM matches intersecting with ChIP peaks) and false positives.- Prepare a bed file containing the positive and negative TFBS determined previously from the ChIP-seq datasets with the following format
...

...

...

# positive_tfbs

...

...

...

#negative_tfbs - use deepTools to plot the profile of the various datasets around the true positive/false positive TFBS :

$ /bfg/software/anaconda2/bin/computeMatrix reference-point -S [bigWig file] -R [predicted TFBS as bed file] -a 500 -b 500 -out [name of out file] --referencePoint center

. -
Now plot the profiles in the region +/- 500 bp around the TFBS by running the command :
$ /bfg/software/anaconda2/bin/plotProfile -m [out file from computeMatrix] --refPointLabel "TFBS" -out [name of output pdf file]

- Compare the profiles around positive/negative TFBS for the various features (histone marks, DNA methylation, DNAse hypersensitivity,...)
- if the plots look unexpected, you can increase the region using the
`-a xxx -b yyy`options in`computeMatrix`

for i in ../data/methyl/*; do k=`basename $i .gm12878.chr21.bw`; echo $k; /bfg/software/anaconda2/bin/computeMatrix reference-point -S $i -R TPFP.bed -a 2000 -b 2000 -out $k.out --referencePoint center; /bfg/software/anaconda2/bin/plotProfile -m $k.out --refPointLabel "TFBS" -out $k.pdf; done

## (3) Scoring TFBS with additional features

We will now score the TFBS (positives and negatives) using all the selected features, as a starting point for training the machine
learning algorithm.

For each of the datasets explored in the previous section, score the TFBS (positives and negatives) using the command from the previous section:
$ /bfg/software/kentUtils/bigWigAverageOverBed [bigwig file] [bed file] [output file]Using R, organize all the score files into one single data.frame, in which you have

- one row per TFBS
- one column per feature

**Build one data frame for the positive (pos.scores) and one separate one for the negative TFBS (neg.scores) !**

**Optional**: Compute the correlation between the features using the

`cor`function in R, and plot the correlation matrix as a heatmap

> library(gplots)

> cor.pos <- cor(pos.scores,method="spearman")

> heatmap.2(cor.pos)

Which features are most correlated to each other ? Does this make sense ?

We need to keep in mind the correlation between the variables, as they might impact the results of the regression analysis later. Correlated variables are usually harmfull in regression approaches, as they impact the

We need to keep in mind the correlation between the variables, as they might impact the results of the regression analysis later. Correlated variables are usually harmfull in regression approaches, as they impact the

**accuray**of the prediction for the regression coefficients of the correlated variables.## (4) Building a logistic regression model

We now come to the core of the machine learning analysis, and train a logistic regression to distinguish optimally the positive from the negative TFBS predictions. This will reveal which variables are most suited to discriminate true from false positives.

We will use the

We will use the

`glm`function in R, using the`binomial model`.### A. first fit of model

**Preparing the data**

- Merge the
`pos.scores`and`neg.scores`data.frames into one single dataframes using the`rbind`function**(all.scores)**; - Add a additional column (
`class`) to this merged data.frame indicating which rows represent positive/negative TFBS*(make sure that this additional column is encoded as factor with the*`as.factor()`function !)

**Training the model**

We will now apply the

`glm`function in R, assuming that our binary output variable is distributed according to a binomial distribution with parameter

*p*. All available variables will be included into the model.

Run the

`glm`function by specifying the

`binomial`distribution as family:

> lrfit <- glm(class ~ ., data=all.scores,family="binomial")

> summary(lrfit)

Check the output of the summary function

- what are the most relevant variables ?
- what does the AIC represent ?

**Evaluate the performance**

Now that we have a model, we can apply it to test how good it performs, by using the model as a

**predictor**:

> pred <- predict(lrfit,newdata=all.scores,type="response")This yields the predicted probabilities. By applying different thresholds on the probabilities, compute the

**precision**and

**recall.**. Did we improve ?

### B. cross validation to evaluate the performance

Of course we have cheated here: we are evaluating the performance of the model on the data that was used to train the model, thus biasing the evaluation.... The proper way to evaluate the performance of a machine-learning model is using a cross-validation : splitting the data into a training and testing set, learning the model on the training set and applying it to the test set, thus evaluating the performance on an independent dataset.

For doing this, we will use the

For doing this, we will use the

**caret**package which implements convenient functions to perform this kind of cross-validations.**Creating a partition**

We will create a

**training dataset**with 75% of the data, and keep the remaining 25% for the

**testing dataset**; we need to make sure to sample 75% of the positive and the negative samples!

> i.pos = which(all.scores$class=='positive') > i.pos.train = sample(i.pos,floor(0.75*length(i.pos))) > i.neg = which(all.scores$class=='negative') > i.neg.train = sample(i.neg,floor(0.75*length(i.neg))) > i.train = c(i.pos.train,i.neg.train) > train.set = all.scores[i.train,] > test.set = all.scores[-i.train,]

**Setting up the cross validation:**

- Run the
`glm`function to fit the model, while evaluating the performance using the cross-validations> lrFit <- glm(class~ .,data=train.set,family="binomial")

**Applying the model :**

Remember that we had defined a

`training`and a`test`set ? Now that we have an optimal model, we can apply it to the left-out`test`dataset...- run the following function
> pred <- predict(lrFit,newdata=test.set,type="response")

- Determine the precision and recall here.

### C. selecting variables

We have now a good model, which includes all variables. We can now try to simplify the model by reducing it, removing "uninteresting" variables that do not contribute to the performance of the model ...

How would you select variables that can be safely removed ?

We can
apply a built in function `step`, which recursively eliminates variables

> lrfit <- glm(class ~ ., data=train.set,family="binomial")

> lr.red <- step(lrfit)

- What is happening here ? What criteria is used to eliminate variables ?
- Check the final model stored in
`lr.red`: what variables does it contain ? Is this expected ? - Apply this reduced model to the left out
`test.test` - Did the performance in terms of
`precision`or`recall`get worse ?