Part 1 : First steps using Bash

Why Linux ?

Bioinformatics is highly dependent on Linux-based computers and software. First, with the advent of new technologies (e.g. sequencing), analysis tend to rely more and more on powerful computing clusters. Second, Linux/UNIX systems come with pre-installed versions of most popular programing languages and thus tend to become the natural platform for newly developed software(some programs are exclusively developed for Linux-based systems). Thus, mastering the UNIX/Linux environment and interface is a critical step required to perform lots of bioinformatics analysis.

In the present session we will aim at computing some basic statistics on the human genome (number of transcripts, number of genes, number of genes per chromosomes, number of exons,...)

To this aim we will use information produced by The Reference Sequence Database (RefSeq). This project provides a comprehensive but not redundant list of known transcripts. Refseq transcript coordinates will be obtained from UCSC genome browser.

First steps

We will first experiment with some basic bash commands to navigate around folders.

Note

Remember that you can always get help on a command my typing man followed by the name of the command : $ man ls

Exercise 1: creating folders

  1. Navigate to your home directory using cd
  2. Create a subdirectory Bioinfo_WS1718 using mkdir
  3. In that folder, create a subfolder bash_Teil1
  4. Navigate back to your home directory
  5. From there, try creating directly the subdirectory Bioinfo_WS1718/bash_Teil2/Daten
$ mkdir Bioinfo_WS1718/bash_Teil2/Daten

Error

Do you get an error ? Do you understand the problem ? Then check the help page of mkdir using $ man mkdir

  1. Finally, create a directory Bioinfo_WS1718/data in which we will put all datasets we will work with.

Working with files

Bash is very powerfull when it comes to handling large text files; remember that you can only handle ASCII encoded files, not binary files, for which you need special tools. We will use pager tools like less, head or tail to inspect the content of a file prior to processing it.

Exercice 2 : fetching an annotation file

We will fetch an annotation file for genes on the Human genome:

  • Using your Web Browser, go to the UCSC web site .
  • In the middle of the top menu, select Downloads.
  • Select the Human genome.
  • In the “Dec. 2013 (hg38, GRCh38)” section, select “Annotation database”. This will connect you to the UCSC FTP web site. After a rather long explanatory text, you will find an even longer list of available files.
  • Find the file refGene.txt.gz. Right-click on the link and select “Copy link adress” to get the URL.
  • In a terminal, go to the data directory that we created in the first exercice.
  • Still in the terminal use the wget command followed by the URL to retrieve the refGene.txt.gz file.
  • Use the ls command to ensure that a file was created. You should see one file: refGene.txt.gz.
  • Get some help on the gunzip command. What is it for ?
  • Uncompress the file using the gunzip command.
  • Use the ls command to ensure that the file was uncompressed (the file should have been renamed without the .gz extension).
  • Rename (mv) the refGene.txt file to refGene_hg38.txt.
  • Use the ls command with -l argument to check that the file was renamed.

The file we just downloaded contains the full annotation of the human genome version hg38 (the latest version). Each row in this file represents a transcript, associated to some genomic annotations (position, chromosome, name,...).

Exercice 3 : inspecting the annotation file

  • Count the number of lines of the file refGene_hg38.txt using the wc command. How many transcripts does the human genome contain ?
  • Use the head command with -n arguments to see the first 5 lines of the file refGene_hg38.txt.
  • Use the tail command with -n arguments to see the last 5 lines of the file refGene_hg38.txt.
  • Have a look at the whole file using less (use up/down arrow keys and type q to quit)
  • Test the difference between using less and less -S
  • Use the head command then the tail command (use a pipe) to select the 10th line of the file refGene_hg38.txt.
  • try to understand the different fields of the file: can you find the coordinates ? The number of exons ?

Note

You can check your intepretation of the different columns by looking at a particular transcript in the UCSC genome browser. Then go to genomes in the top bar, select the hg38 version of the genome, and type the name of transcript (NM_xxxx) in the search bar to visualize the transcript.


Warning

Very important: most text files we will manipulate contain data as columns; it is crucial to know how the columns are separated: tabulation, space, comma, semi-colon, ... Otherwise, you will not be able to extract individual columns. This is an important source of errors and trouble !


Some statistics on transcript and genes

Our knowledge of the human genome is evolving. New genes are still being discovered. In the file refGene_hg38.txt, each line corresponds to a transcript whose name/identifier is provided in the 2nd column. Note that a gene can correspond to one or more transcript. Indeed, some genes have several alternative promoters, or can be spliced differentially. The number of lines we counted in the previous exercise indicates the number of transcripts identified in the version 38 of the human genome (hg38).

The symbol/identifier of the gene associated to this transcript is annotated in the 13th column of the RefGenes file. What about our current knowledge of human transcripts and genes ?

Exercice 4: statistics on genes and transcripts

  • What is the number of distinct genes in the version 38 of the human genome ? The result can be found by computing the number of non redundant gene symbols in the 13th column (use the cut command).
  • Coding genes have an id starting with NM_xxx, while non-coding genes have ids NR_xxxx; how many coding / non-coding transcripts do we have ? (use the grep command)
  • Which transcript has the largest number of exons ? (use the `sort command for this)
  • What is the number of transcripts on each chromosome ? (take only the “regular” chromosomes)
  • What is the number of genes on each chromosome (use a combination of cut, sort, uniq and wc ...)?
Solution
1
2
3
4
5
6
cut -f13 refGene_hg38.txt | sort -u | wc -l # count the number of unique genes
        grep NM_ refGene_hg38.txt | sort |uniq | wc -l           # coding transcripts
                grep NR_ refGene_hg38.txt | sort | uniq | wc -l           # non-coding transcripts
                sort -n -k 9 | tail -1                      # number of exons is in column 9; this commands take very long, because the file is huge. We will see a better solution for this later...
                cut -f3 refGene_hg38.txt | sort | uniq -c  # counts the number of lines for each chromosome
        cut -f3,13 refGene_hg38.txt | sort | uniq | cut -f1 | uniq -c # master of pipe !!!

Exercice 5 : writing into files using loops

  • Go to your folder Bioinfo_WS1718/bash_Teil1
  • Write the transcripts for each chromosome 1 to 10 into separate files called transcripts_hg38_chrXX.txt where XX represents the number of the chromosome. (use the fact that you can retrieve the previous command using the up-arrow on the keyboard!!)

Note

Repeating repetitive commands like the previous one is tedious and stupid ... In bash, we can use a loop-structure to make bash do the repetition for us ... The structure of the command is the following

$ for i in {1..10}; do grep -w chr$i refGene_hg38.tx > transcripts_hg38_chr$i.txt; done

Here, i represents a variable, which will take successively all values from 1 to 10; for each iteration of the loop, $i will therefore be replaced by the values 1,2,...10. The code between for ...; and done will be executed at each iteration.

  • Check with ls that you have created files called transcripts_hg38_chXX.txt
  • Now delete all files transcripts_hg38_chXX.txt files that you have created previously using the rm transcripts_hg38_* command.
  • Check with ls that these files have indeed been deleted.
  • Now execute the loop command in the Note box.
  • Check with ls that the files have bee created; verify with less that they contain the correct information
  • Modify the command to produce this file for all chromosomes from 1 to 22.
  • How could you also add the chromosomes X and Y ?