From Comaiwiki

Revision as of 07:44, 14 June 2009 by Root (Talk | contribs)


Learning Python

This page is an invitation to learn Python and apply it to bioinformatics. It uses my limited experience in this area to demonstrate that even if you never were exposed to any programming it is possible to learn enough Python to write programs that analyze sequence data and produce results. In addition, programming is fun and it beats Sudoku and crossword puzzles as a constructive brain teaser and past time. --Luca Comai

What is Python?

Python is a modern programming language that is easy to code and use. It resembles another very common language called Perl. Python code is easy to interpret because it uses indentation to separate blocks of code and to convey their hierarchy.

How I learned

I first tried Perl, another programming language, and became frustrated by my inability to understand the syntax. Now, I am not clear why this is the case because Perl is really not difficult. This just goes to show that learning programming may seem difficult. I dropped the effort. A few months later, I decided to try again. Victor Missirian, a bioinformatician who works with me, showed me the Python tutorial written by the Python inventor, Guido von Rossum. It was very simple and so I decided to try again. I downloaded the Python package from, bought a couple of books and never looked back. I started writing a program that would inventory and report all the restriction fragments in a genome. Having a specific objective helped me focusing and motivating me. Since then I have written dozens of little programs to do all kinds of stuff, such as parsing Illumina sequencing files, grading my class, performing in silico comparative genomic hybridization and so on. Now, do not get me wrong, I am really a beginner and there is a lot that I do not know and will most likely never learn. But this is really the good news. You do not need a degree in computer sciences to have fun and be productive.

How to learn

Get Python

Apple computers come with a version of Python installed. It is useful, however, to download a Python package from the official [ Python website] and install it. I have version 2.5.2, but you may want to install 2.6. Version 3.0 is also available. Many programs written with 2.5 or lower, will not work in 3.0 without considerable editing. So, I would stick to 2.6 for the time being. The installer will place in your computer the Python program, an Interative Developer Environment (IDE) called IDLE and plenty of documentation. Launch IDLE and start programming or get more help on IDLE.

Start practicing

With Python open in IDLE or in the Terminal (for apple computers), follow the tutorial that came with the installation of Python, or use the provided link. I have found two books very useful. The first is Learning Python by Mark Lutz. The second is Python Cookbook by Alex Martelli.

Examples of programs

Bin counter

Sequencing on Illumina GAII produces sequence reads that can be mapped on a reference genome, such as when using the Illumina program Eland. An Eland output line looks like this:


Eland reports the chromosome and position on the reference sequence where the read maps. For example, above a read is mapped on the chloroplast chromosome at position 123252. You can use read position and frequency as a measure of input DNA composition. To do this, you need to count the reads per unit length of genome, or per "bin", Depending on coverage you may want to choose something between single nucleotide and several thousands. The program outputs a file with chr, bin, and number of reads assigned to the chosen interval or bin. images/2/2d/Bin_counter_2.2.txt test media:bin_counter_2.2.txt

Personal tools