================================= Part 1 : First steps using Bash ================================= .. sidebar:: Goals of this session * using basic bash commands * getting help * learn how to work with and manipulate large text files * :download:`Bash cheat sheet <../documents/linux_cheat_sheet.pdf>` * :download:`Vorlesung Bash <../documents/Vorlesung_Teil_Linux.pdf>` ------------ Why Linux ? ------------ Bioinformatics is highly dependent on Linux-based computers and software. First, with the advent of new technologies (e.g. sequencing), analysis tend to rely more and more on powerful computing clusters. Second, Linux/UNIX systems come with pre-installed versions of most popular programing languages and thus tend to become the natural platform for newly developed software(some programs are exclusively developed for Linux-based systems). Thus, mastering the UNIX/Linux environment and interface is a critical step required to perform lots of bioinformatics analysis. In the present session we will aim at computing some basic statistics on the human genome (number of transcripts, number of genes, number of genes per chromosomes, number of exons,...) To this aim we will use information produced by The Reference Sequence Database (RefSeq). This project provides a comprehensive but not redundant list of known transcripts. Refseq transcript coordinates will be obtained from UCSC genome browser. -------------- First steps -------------- We will first experiment with some basic bash commands to navigate around folders. .. note:: Remember that you can always get help on a command my typing ``man`` followed by the name of the command : ``$ man ls`` Exercise 1: creating folders """""""""""""""""""""""""""" .. container:: exo 1. Navigate to your home directory using ``cd`` 2. Create a subdirectory **Bioinfo_WS1718** using ``mkdir`` 3. In that folder, create a subfolder bash_Teil1 4. Navigate back to your home directory 5. From there, try creating *directly* the subdirectory **Bioinfo_WS1718/bash_Teil2/Daten** .. code-block:: bash $ mkdir Bioinfo_WS1718/bash_Teil2/Daten .. error:: Do you get an error ? Do you understand the problem ? Then check the help page of mkdir using ``$ man mkdir`` .. container:: exo 6. Finally, create a directory **Bioinfo_WS1718/data** in which we will put all datasets we will work with. ---- -------------------- Working with files -------------------- Bash is very powerfull when it comes to handling large text files; remember that you can only handle ASCII encoded files, not binary files, for which you need special tools. We will use pager tools like ``less``, ``head`` or ``tail`` to inspect the content of a file prior to processing it. Exercice 2 : fetching an annotation file """""""""""""""""""""""""""""""""""""""" We will fetch an annotation file for genes on the Human genome: .. container:: exo * Using your Web Browser, go to the `UCSC web site `_ . * In the middle of the top menu, select Downloads. * Select the Human genome. * In the "Dec. 2013 (hg38, GRCh38)" section, select "Annotation database". This will connect you to the UCSC FTP web site. After a rather long explanatory text, you will find an even longer list of available files. * Find the file **refGene.txt.gz**. Right-click on the link and select *"Copy link adress"* to get the URL. * In a terminal, go to the *data* directory that we created in the first exercice. * Still in the terminal use the ``wget`` command followed by the URL to retrieve the refGene.txt.gz file. * Use the ``ls`` command to ensure that a file was created. You should see one file: refGene.txt.gz. * Get some help on the ``gunzip`` command. What is it for ? * Uncompress the file using the ``gunzip`` command. * Use the ``ls`` command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension). * Rename (``mv``) the refGene.txt file to refGene_hg38.txt. * Use the ``ls`` command with -l argument to check that the file was renamed. ---- The file we just downloaded contains the full annotation of the human genome version hg38 (the latest version). Each row in this file represents a transcript, associated to some **genomic annotations** (position, chromosome, name,...). Exercice 3 : inspecting the annotation file """"""""""""""""""""""""""""""""""""""""""" .. container:: exo * Count the number of lines of the file refGene_hg38.txt using the ``wc`` command. How many transcripts does the human genome contain ? * Use the ``head`` command with -n arguments to see the first 5 lines of the file refGene_hg38.txt. * Use the ``tail`` command with -n arguments to see the last 5 lines of the file refGene_hg38.txt. * Have a look at the whole file using ``less`` (use up/down arrow keys and type **q** to quit) * Test the difference between using ``less`` and ``less -S`` * Use the ``head`` command then the ``tail`` command (use a pipe) to select the 10th line of the file refGene_hg38.txt. * try to understand the different fields of the file: can you find the coordinates ? The number of exons ? .. note:: You can check your intepretation of the different columns by looking at a particular transcript in the `UCSC genome browser `_. Then go to *genomes* in the top bar, select the hg38 version of the genome, and type the name of transcript (*NM_xxxx*) in the search bar to visualize the transcript. ---- .. warning:: **Very important**: most text files we will manipulate contain data as columns; it is **crucial** to know how the columns are separated: tabulation, space, comma, semi-colon, ... Otherwise, you will not be able to extract individual columns. This is an important source of errors and trouble ! ---- --------------------------------------- Some statistics on transcript and genes --------------------------------------- Our knowledge of the human genome is evolving. New genes are still being discovered. In the file refGene_hg38.txt, each line corresponds to a transcript whose name/identifier is provided in the 2nd column. Note that a gene can correspond to one or more transcript. Indeed, some genes have several alternative promoters, or can be spliced differentially. The number of lines we counted in the previous exercise indicates the number of transcripts identified in the version 38 of the human genome (hg38). The symbol/identifier of the gene associated to this transcript is annotated in the 13th column of the RefGenes file. What about our current knowledge of human transcripts and genes ? Exercice 4: statistics on genes and transcripts """"""""""""""""""""""""""""""""""""""""""""""" .. container:: exo * What is the number of **distinct** genes in the version 38 of the human genome ? The result can be found by computing the number of non redundant gene symbols in the 13th column (use the ``cut`` command). * Coding genes have an id starting with *NM_xxx*, while non-coding genes have ids *NR_xxxx*; how many coding / non-coding transcripts do we have ? (use the ``grep`` command) * Which transcript has the largest number of exons ? (use the ```sort`` command for this) * What is the number of **transcripts** on each chromosome ? (take only the "regular" chromosomes) * What is the number of **genes** on each chromosome (use a combination of ``cut``, ``sort``, ``uniq`` and ``wc`` ...)? .. container:: toggle .. container:: header **Solution** .. code-block:: bash :linenos: cut -f13 refGene_hg38.txt | sort -u | wc -l # count the number of unique genes grep NM_ refGene_hg38.txt | sort |uniq | wc -l # coding transcripts grep NR_ refGene_hg38.txt | sort | uniq | wc -l # non-coding transcripts sort -n -k 9 | tail -1 # number of exons is in column 9; this commands take very long, because the file is huge. We will see a better solution for this later... cut -f3 refGene_hg38.txt | sort | uniq -c # counts the number of lines for each chromosome cut -f3,13 refGene_hg38.txt | sort | uniq | cut -f1 | uniq -c # master of pipe !!! ---- Exercice 5 : writing into files using loops """"""""""""""""""""""""""""""""""""""""""" .. container:: exo * Go to your folder ``Bioinfo_WS1718/bash_Teil1`` * Write the transcripts for each chromosome 1 to 10 into separate files called ``transcripts_hg38_chrXX.txt`` where XX represents the number of the chromosome. (*use the fact that you can retrieve the previous command using the up-arrow on the keyboard!!*) .. note:: Repeating repetitive commands like the previous one is tedious and stupid ... In bash, we can use a **loop**-structure to make bash do the repetition for us ... The structure of the command is the following .. code-block:: bash $ for i in {1..10}; do grep -w chr$i refGene_hg38.tx > transcripts_hg38_chr$i.txt; done Here, ``i`` represents a **variable**, which will take successively all values from 1 to 10; for each iteration of the loop, ``$i`` will therefore be replaced by the values 1,2,...10. The code between ``for ...;`` and ``done`` will be executed at each iteration. .. container:: exo * Check with ``ls`` that you have created files called ``transcripts_hg38_chXX.txt`` * Now delete all files ``transcripts_hg38_chXX.txt`` files that you have created previously using the ``rm transcripts_hg38_*`` command. * Check with ``ls`` that these files have indeed been deleted. * Now execute the loop command in the **Note** box. * Check with ``ls`` that the files have bee created; verify with ``less`` that they contain the correct information * Modify the command to produce this file for all chromosomes from 1 to 22. * How could you also add the chromosomes X and Y ?