Part 3 : First steps in Python¶

Why Python ?¶

The Python programming language was developed by Guido van Rossum. The first version was released in 1991. Python works under various systems (windows, Mac OS X, Linux,...), and is an open-source software, meaning it can be freely shared. The language is close to the “natural” language, as opposed to other programming languages (C,..) which can be hard to read and understand. Python’s motto : “There should be one — and preferably only one — obvious way to do”

Python is easy to learn, very powerful, with a large number of libraries which allow to adress many different types of tasks:

developing scripts
command line application
web applications
graphical interfaces
interaction with databases
numerical computations
...

Open an interactive session¶

open a terminal
check the version of python using python --version
launch the command
```
$ ipython
```

The IPython software is a python interpreter (there exist others), which is similar to the terminal, which is a bash-interpreter. It will understand and execute commands, which are interactively executed. When we want to write longer scripts, these commands will be written to a python script, a text file containing python commands (ending with .py).

No serious programming course can start without the obligatory hello world script... (sigh...!).

# we ask the python interpreter to print the string "Hello world"
print "Hello World!"

Variables and data structure¶

Variables can be thought of as boxes in the memory of the computer, which can contain values (numbers, logical values, character strings,...). Depending of the language, variables can be typed or not. In python, variables are typed, which means that python will handle these variables differently depending whether they are numerical, decimal, character strings,etc...

Numerical variables¶

Copy paste the following instructions one by one into the ipython interpreter, and try to understand what is happening:

# Note: text which is right of the dash sign are comments
# and will not be interpreted by ipython

# give the variable a the value 2
a = 2

# Check the content of the variable a
a

# What is the type of this variable ?
type(a) # Result: int (integer). The variable contains an integer number

# Let's give a the value 2.0
a = 2.0

# What is the type ?
type(a) # Result : float

# Give b the value 4
b = 4

# Value of b ?
b

# c takes the result of the operation a + b
c = a + b

# What is the value and type of c ?
c
type(c)

# Print all values together
print a,b,c

So far so good: c contains the sum of a and b, as expected ... Why is c a float, and not an integer ?

What happens to c if we change the value of b ?

b = 10
print a,b,c # can you explain the result ?

The symbol “=” is misleading : in math, it means that what is left and right of this sign are equal; in python, it means that we affect the the variable on the left the result of the right-hand side. Since we have not re-affected a new value to c, it still has the value of the first affectation.

Operations on variables¶

We can use standard operations on variables:

a = 4.0
type(a) # check the type of the variable
a/3

a = 4
type(a)
a/3 # oups... what is going on here ?

Note

This is a difference between Python version 2 and Python version 3. In version 2 (the one we are using), the division for integers and floats is different: the division of an integer yields an integer, the division of a float yields a float. In python 3, there is an automatic conversion (called cast) of integers to floats for the division. If we what this behavior also in python 2, we can use a trick (Beware, there are double underscores before and after the “future” !!):

from __future__ import division

Check again :

a = 4
type(a)
a/3

from __future__ import division
a/3 # better now ?

String variables¶

Assigning a value to a variable

   # like for numeric variables we can assign a string value to a variable

   file_name = "refGene"
   type(file_name)

   # check content
   file_name

Note that the value of the variables is printed between quotes, to indicate that this is a string variable. Check the difference with this:

   print file_name

   file_name = refGene  # why do we get an error ??

In the last command, we have asked python to affect to the variable file_name the value of the variable refGene. Since this variable does not exist, we get an error message. This is an important point of the syntax: to specify a character string, we need to enclose in with quotes (single or double).

Concatenation

We can easily concatenate strings using the “+” sign (which has a different behavior between strings and numerical variables).

   # we assign the value '.txt' to a variable called extension
   extension = ".txt"

   # What is the content of extension ?
   print extension

   # What is its type ?
   type(extension)

   # we can concatenate string variables using the "+" sign
   full_name = file_name + extension

   # Beware that you can only concatenate strings
   a = 2.0
   type(a)
   type(file_name)
   file_name + a # Do you understand the error message ?

   # in order to concatenate the file name with a, we need to transform
   # a into a string variable using the str(...) function
   file_name + str(a) # yep !

   # str(..) converts a variable into a character string
   str(27)

We can repeat strings several times using the “*” operator:

   file_name * 10

   a = 3.0
   type(a)

   file_name * a      # why do we get an error here ?
   file_name * int(a) # yep ...

Like the str(...) function, we have used the int(...) function to convert (cast) a float into an integer.

We can use the ``print ... `` function to print several variables:

   print c
   print full_name

   # the print command can handle a list of several variables
   x = 'alfred'
   y = 'sarah'
   z = 'josh'
   print x,y,z

   # The list can contain variables of different type
   print c, file_name, c, 2, 2.0

For longer text, we will use docstrings:

   sentence="""This is a very
   long sentence that spans
   several lines
   """

   print sentence

Note about type conversions (or casts)

We have seen two examples of functions which can convert data types: str(...) and int(...). Here are some examples of data type conversion between the different types of variables:

   ####################################
   ### Cast from an integer
   ####################################
   a = 2
   float(a)  # yields 2.0
   str(a)    # yields '2'. Check the quotes ...

   ####################################
   ### Cast from a float
   ####################################
   a=2.9
   int(a)            # yields 2
   str(a)            # yields '2.9' (again, quotes...)

   #####################################################
   ### Cast from a string
   #####################################################
   int("2")        # Yields 2.
   float("2.9")    # Yields 2.9.
   float("1.3e-3") # Yields 0.0013.

   # Casts to numerical values requires the string to be compatible with a number, obviously...
   int("A")     # Error
   float("A")   #  Error

   int("2.9")   # Why do we get an error ?
   float("2.9") # Why not here ?
   int(float("2.9")) # yep !

Logical (or boolean) variables¶

Boolean variables can only take 2 values, TRUE or FALSE.

   d = True
   c = False

   # We can use logical operators
   d
   not d
   not c
   d and c
   d or c
   d and (not c)

Variables and methods¶

Python is object-oriented language; this means that variables have associated methods. We can execute various operations on a variable by using the corresponding method. These methods are specific to a type of variables !

   dna = "ATACTAGAATTCGATAGGTGGTAAAACCCG"

   # what type of variable ?
   type(dna)

   # Get some help on this type of variables !
   help(str)


   # Check the various methods available !

   ## Type 'q' to exit the help

   # Note that methods which start and end with __
   # ("__method__") should be used as indicated
   #
   # x.__len__() <==> len(x) means that we will use
   # len(dna)
   # but not dna.__len__()


   # All other methods (without the "__") can be used
   # by putting the name of the method as a suffix to the variable name

   # Example: convert the string to lower case:
   dna.lower()

Exercice 1 : Basic string

Check the help on the str type; try to answer the following question:

What is the size of the DNA sequence stored in dna ?

Does the sequence end with CCG ?

Does the sequence contain a restriction site for EcoR1 “GAATTC” ? At which position ?

Convert the sequence to lower case

Compute the complement of the DNA sequence (Check the replace method to replace each nucleotide by its complement...)

More advanced : Can you figure out how to compute the reverse complement ? (Google is your friend...)

  len(dna)                    # use the len() function to find the length of the string
  dna.endswith('CCG')         # returns True or False
  dna.find('GAATTC')          # returns the position of the pattern or -1 if not found
  dna.index('GAATTC')          # returns the position of the pattern or error if not found
  dna.lower()
  # Computing the reverse complement :
  #
  # 'simple solution'
  dna = dna.upper()
  dna_replaced = dna.replace('A','t')
  dna_replaced = dna_replaced.replace('C','g')
  dna_replaced = dna_replaced.replace('G','c')
  dna_replaced = dna_replaced.replace('T','a')
  dna_replaced = dna_replaced.upper()
  dna_rc = dna_replaced[::-1].upper()
  # more advanced solution
  from string import maketrans
  my_trans = maketrans('ACGT','TGCA')
  dna_complement = dna.translate(my_trans)
  dna_rc = dna_complement[::-1]
  # a string is a list of characters; we can use
  # the indexing tricks for a list on a string
  # an increment of -1 reverses a list

Exercice 2 : CG content

Here is another DNA sequence:

dna = "ttcacctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgtgtgtctagctaagatgtattattctgctgtggatcccactaaagatatattcactgggcttattgggccaatgaaaatatgcaagaaaggaagtttacatgcaaatgggagacagaaagatgtagacaaggaattctatttgtttcctacagtatttgatgagaatgagagtttactcctggaagataatattagaatgtttacaactgcacctgatcaggtggataaggaagatgaagactttcaggaatctaataaaatgcactccatgaatggattcatgtatgggaatcagccgggtctcactatgtgcaaaggagattcggtcgtgtggtacttattcagcgccggaaatgaggccgatgtacatggaatatacttttcaggaaacacatatctgtggagaggagaacggagagacacagcaaacctcttccctcaaacaagtcttacgctccacatgtggcctgacacagaggggacttttaatgttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcaggcggcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgggattattccccacaaagggagtgggattaggagctgcatcatttacaagagcagaatgtttcaaatgcatttttagataagggagagttttacataggctcaaagtacaagaaagttgtgtatcggcagtatactgatagcacattccgtgttccagtggagagaaaagctgaagaagaacatctgggaattctaggtccacaacttcatgcagatgttggagacaaagtcaaaattatctttaaaaacatggccacaaggccctactcaatacatgcccatggggtacaaacagagagttctacagttactccaacattaccaggtaaactctcacttacgtatggaaaatcccagaaagatctggagctggaacagaggattctgcttgtattccatgggcttattattcaactgtggatcaagttaaggacctctacagtggattaattggccccctgattgtttgtcgaagaccttacttgaaagtattcaatcccagaaggaagctggaatttgcccttctgtttctagtttttgatgagaatgaatcttggtacttagatgacaacatcaaaacatactctgatcaccccgagaaagtaaacaaagatgatgaggaattcatagaaagcaataaaatgcatgctattaatggaagaatgtttggaaacct"

How many occurences of ‘gagg’ occur in the sequence?
What is the starting position of the first occurrence of ‘atta’? [report the actual base pair position as a human would understand it]
- For the more advanced : can you print the position of all the occurences of ‘gagg’ (not just the first one) ?
How long is the sequence?
What is the GC content of the sequence? The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs) Print the result as “The GC content of this sequence is XX.XX%” where XX.XX is the actual GC content.

dna.count('gagg')
dna.index('atta')+1 # remember that the index is 0-based !
len(dna)
gc_content = (dna.count('c')+dna.count('g'))/len(dna)
print "The CG content is " + str(gc_content*100) + " percent "

Lists and tuples¶

Lists¶

A list is an array that contains several values:

   ############################
   ### Create an empty list ###
   ############################
   my_list = list() # or my_list = []
   len(my_list) # contains 0 elements
   my_list


   ################################################################################################
   ### Create a list containing pre-defined elements (here gene names)
   ################################################################################################
   gene_list = ["CD3E", "CD4", "CD8", "CD5", "ZAP70", "ITK", "LAT","VAV1","CD3G","TCRA"]
   gene_list

   ##########################################################
   ### Extract a subset from a list
   ##########################################################

   # Beware ! Lists start at position 0
   gene_list[0]
   gene_list[1]

   # Start from the end
   gene_list[-1]
   gene_list[-2]

   # Extract a 'slice' from this list
   # gene_list[start:end] # from start to end-1
   gene_list[0:2]
   gene_list[0:3]
   gene_list[0:4]
   gene_list[0:5]

   # Extract the last elements from a list
   # gene_list[start:] # from start to the end
   gene_list[1:]
   gene_list[5:]

   # Extract the first elements
   # gene_list[:end] # from the first to position end-1
   gene_list[:4]

   # Specify a particular increment
   # gene_list[start:end:p] # from start to end-1 in steps p
   gene_list[0:10:2]
   gene_list[0::2] # same as gene_list[0:10:2]
   gene_list[::2] # same as gene_list[0::2]
   gene_list[::-1] # list in reverse order

   ########################################
   ### Modify elements of a list        ###
   ########################################
   gene_list[5:] = ['CD52','SATB1']
   gene_list
   len(gene_list)
   # Note: we just modified the length of the list by
   # replacing the last 5 elements of the list by
   # a list containing 2 elements

   ####################################################
   ### a string is implicitely considered as a list ###
   ### whose elements cannot be modified            ###
   ####################################################

   dna = "ATACTAGAATTCGATAGGTGGTAAAACCCG"
   dna[3:]
   dna[::-1] # what do you get with an increment of -1 ?
   dna[25:]= 'TTTTT' # why do you get an error message ?

Like any other variable, lists have associated methods:

   ########################################
   #### Get help about the class 'list' ###
   ########################################
   help(list)
   # q to quit

   ########################################
   ### Some standard methods           ####
   ########################################
   len(gene_list)                            # Number of elements
   gene_list + gene_list             # concatenate two lists
   'LAT' in gene_list                        # Is this element contained in the list ?
   del gene_list[5]                          # delete the 6th element
   gene_list*3                               # concatenate 3 times
   gene_list.append('BCL2')                  # add BCL2 to the list
   gene_list
   gene_list.count('CD3E')           # occurences of  'CD3E'
   gene = gene_list.pop(2)           # delete and return the 3rd element of the list
   gene
   gene_list.sort()                          # sort the list
   gene_list.sort(reverse=True)
   gene_list.insert(2,'P53')         # insert an element before a position
   gene_list.remove('P53')           # delete an element from the list

Exercice 3 : Counting birds ...

The number of birds banded at a series of sampling sites has been counted by your field crew and entered into the following list. Counts are entered in order and sites are numbered starting at one. Cut and paste the list into your assignment and then answer the following questions by printing them to the screen. Some Python functions that will come in hand include len(), max(), min(), and sum():

number_of_birds = [28, 32, 1, 0, 10, 22, 30, 19, 145, 27, 36,
              25, 9, 38, 21, 12, 122, 87, 36, 3, 0, 5, 55,
              62, 98, 32, 900, 33, 14, 39, 56, 81, 29, 38,
              1, 0, 143, 37, 98, 77, 92, 83, 34, 98, 40,
              45, 51, 17, 22, 37, 48, 38, 91, 73, 54, 46,
              102, 273, 600, 10, 11]

How many sites are there? Hint: the function len() works on lists as well as strings.
How many birds were counted at site 42? Remember, the number of the site and the number of its position may not be exactly the same.
How many birds were counted at the last site? Have the computer choose the last site automatically in some way, not by manually entering its position.
What is the total number of birds counted across all of the sites?
How often did we count zero birds ?
What is the smallest number of birds counted?
What is the largest number of birds counted?
What is the average number of birds seen at a site? You will need to combine two built-in functions. Print the result using the print function using an informative text !

   len(number_of_birds)
   number_of_birds[41] # beware that the first element has index 0 !
   number_of_birds[-1] # simplest solution
   number_of_birds[len(number_of_birds)-1] # a bit more complicated...
   sum(number_of_birds)
   min(number_of_birds)
   max(number_of_birds)
   average = sum(number_of_birds)/len(number_of_birds)

Tuples¶

Tuples are very similar to lists, but their values cannot be modified. While lists are declared with ‘[]’, tuples are declared using ‘()’:

myTuple=('a','b')
myTuple[0]
myTuple[0]='c' # error.

Dictionnaries¶

In lists, each element can only be referenced by its index; dictionnaries extend this, by associting to each element of the dictionnary (‘value’) a ‘key’ : key -> value . IN other languages, this data structure is called hash or associative array.

  #############################################
  ### Create an empty dictionnary           ###
  #############################################
  codon2AA = {}

  ############################################################
  ### Create a dictionnary with keys an values
  ############################################################
  # some codons...
  codon2AA =  {"UUU":"PHE", "UCU":"SER", "CUU":"LEU","AAA":"LYS"}


  ############################################################
  ### Get a value using a key
  ############################################################
  codon2AA['CUU']

  ## if the key is not defined, you get an error
  codon2AA['ABSENT']

  ##########################################
  ### add a key->value pair
  ##########################################
  codon2AA['ACA'] = "THR"
  codon2AA['AUG'] = "START"
  codon2AA

  ##########################################
  ### delete an element
  ##########################################
  ## del deletes a key-> value pair
  del codon2AA["AUG"]
  codon2AA


  ##########################################
  ### Test if a key exists          ###
  ##########################################
  'ACA' in codon2AA

Exercice 4 : Counting dinucleotides ...

Define a dna sequence:

dna = "ttcacctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgtgtgtctagctaagatgtattattctgctgtggatcccactaaagatatattcactgggcttattgggccaatgaaaatatgcaagaaaggaagtttacatgcaaatgggagacagaaagatgtagacaaggaattctatttgtttcctacagtatttgatgagaatgagagtttactcctggaagataatattagaatgtttacaactgcacctgatcaggtggataaggaagatgaagactttcaggaatctaataaaatgcactccatgaatggattcatgtatgggaatcagccgggtctcactatgtgcaaaggagattcggtcgtgtggtacttattcagcgccggaaatgaggccgatgtacatggaatatacttttcaggaaacacatatctgtggagaggagaacggagagacacagcaaacctcttccctcaaacaagtcttacgctccacatgtggcctgacacagaggggacttttaatgttgaatgccttacaactgatcattacacaggcggcatgaagcaaaaatatactgtgaaccaatgcaggcggcagtctgaggattccaccttctacctgggagagaggacatactatatcgcagcagtggaggtggaatgggattattccccacaaagggagtgggattaggagctgcatcatttacaagagcagaatgtttcaaatgcatttttagataagggagagttttacataggctcaaagtacaagaaagttgtgtatcggcagtatactgatagcacattccgtgttccagtggagagaaaagctgaagaagaacatctgggaattctaggtccacaacttcatgcagatgttggagacaaagtcaaaattatctttaaaaacatggccacaaggccctactcaatacatgcccatggggtacaaacagagagttctacagttactccaacattaccaggtaaactctcacttacgtatggaaaatcccagaaagatctggagctggaacagaggattctgcttgtattccatgggcttattattcaactgtggatcaagttaaggacctctacagtggattaattggccccctgattgtttgtcgaagaccttacttgaaagtattcaatcccagaaggaagctggaatttgcccttctgtttctagtttttgatgagaatgaatcttggtacttagatgacaacatcaaaacatactctgatcaccccgagaaagtaaacaaagatgatgaggaattcatagaaagcaataaaatgcatgctattaatggaagaatgtttggaaacct"

Create a dictionnary that contains the frequencies of all dinucleotides in this sequence, and print the content of the dictionnary
For the more advanced: can you find a way to sort the output by increasing number of occurences (first the dinucleotides with the lowest number, then increasing) ?