=============================================
Part 2 : Going further using bedtools and awk
=============================================

.. sidebar:: Goals of this session

	     * Learn how to deal with genomic coordinates
	     * Work with the genomic tool bedtools
	     * learn how to use awk


-------------------------
Genomic coordinates
-------------------------


Working with genomic data frequently implies to deal with biological sequences. As these sequences (genes, exons, ...) are related to genomes, handling genomic coordinates is typically required. A basic task in genomic data analysis is to compare different sets of genomic features (transcripts, promoter regions, polymorphisms, conserved elements,...). One question could be for instance "which SNPs (Single Nucleotide Polymorphisms) are associated with a disease of interest and fall into exonic regions". Such questions require dedicated tools to ease data analysis.

Bedtools has been developed to compare sets of genomic features and is now becoming a standard Linux tool for people working in the field. Bedtools largely relies on the `BED file format <https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_ (although it may also operate with GFF/GTF, VCF, and SAM/BAM files).

The Bed file format is a very simple way to store information related to genomic features. A typical file in BED format will contain the following columns (the 3 first columns are mandatory):

    * chrom - The name of the chromosome (e.g. chr3, chrY, chr2...).
    * chromStart - The starting position of the feature in the chromosome.
    * chromEnd - The ending position of the feature in the chromosome.
    * name - A name for the feature (e.g. gene name...).
    * score - A score between 0 and 1000.
    * strand - Defines the strand - either '+' or '-'.


----

---------------------------------------------------------
Which fraction of the human genome is covered by exons ?
---------------------------------------------------------

In the section below we will try to answer the following question: "Which fraction of the genome is covered by exons ?".

Downloading genomic coordinates of exons.
"""""""""""""""""""""""""""""""""""""""""

* Download the following :download:`file <../documents/hg38_exons.bed>`  which contains the coordinates of the exons of the RefSeq transcripts version hg38.
* Using the ``mv`` command, move the Downloaded file into the directory **Bioinfo_WS1718/data**.


  .. container:: exo

		 * Look at the file using ``less`` to see the format. Is this a bed format ?
		 * How many exons do we have in total ?
		 * Tricky one : can you find again the transcript which has the largest number of exons ?


.. container:: toggle

    .. container:: header

        **Solution** (click to open)


    .. code-block:: bash
       :linenos:

	  # this is a genuine bed format
	  wc -l hg38_exons.bed
	  cut -f4 hg38_exons.bed | cut -d ";" -f1 | uniq -c | sort -nr
	  # first command extracts the column with the transcript name
	  # second extracts from this only the string which is left of the ";"
	  # third counts how often this transcript name occurs
	  # 4th sorts the results.


----


Merging overlapping regions
"""""""""""""""""""""""""""

We will use a "genomic toolbox" called **Bedtools**, which offers a large number of utilities to manage genomic datasets.

..
	.. note:: To use the bedtools suite, you need to run the following command first::

	  	$ export PATH=/bfg/software/bin:$PATH


To answer our simple question, we must first keep in mind that several exons may overlap, due to various phenomena (alternative splicing, multiple promoters or terminators, mutually overlapping genes). To avoid counting several times the same region, our first task will thus be to merge these overlapping regions. The **bedtools merge** command from the **Bedtools** suite combines overlapping features in an interval file into a single feature which spans all of the combined features. The image below illustrate this.


.. warning:: Beware: bedtools merge requires the genomic coordinates to be sorted (see below).

.. image:: ../documents/mg.jpg
	   :scale: 50 %
	   :align: center


.. container:: exo

    * Get some help about the ``bedtools sort`` command using the -h (help) argument.
    * Use the ``bedtools sort`` command to sort exons by coordinates and store the results in **hg38_exons_sorted.bed**.
    * Get some help about the ``bedtools merge`` command using the -h (help) argument.
    * Use ``bedtools merge`` with **hg38_exons_sorted.bed** as input to combine overlapping exons into single features and store the results into a file named **mergedExons.bed**.
    * Have a look at the first lines with the ``head`` command.
    * The length of one genomic feature can simply be obtained by computing column_3-column_2. Use a awk command to compute the sum of the length of all features *(check the pdf of the lecture on bash and awk* :download:`Vorlesung Bash <../documents/Vorlesung_Teil_Linux.pdf>`).
    * Compute the total length of the genome using :download:`this file <../documents/chromInfo.txt>`.
    * Now, what is the fraction of the genome that is covered by exons or genes ?


.. container:: toggle

   .. container:: header

		        **Solution** (click to open)

   .. code-block:: bash
       :linenos:
			 
       bedtools sort -i hg38_exons.bed > hg38_exons_sorted.bed
       bedtools merge -i hg38_exons_sorted.bed > hg38_exons_merged.bed

       # total length of all exons
       awk 'BEGIN {l=0} {l=l+$3-$2+1} END {print l}' 	hg38_exons_merged.bed

       # total length of the genome
       awk 'BEGIN {l=0} {l=l+$2} END {print l}' 	chromInfo.txt

       # then simply divide the first number by the second using a calculator ...