Biopython/C2/Blast/English

From Script | Spoken-Tutorial
Jump to: navigation, search
Visual Cue
Narration
Slide Number 1

Title Slide

Welcome to this tutorial on BLAST using Biopython tools.
Slide Number 2

Learning Objectives

In this tutorial, we will learn to
  • Run BLAST for the query sequence using Biopython tools
  • And parse the BLAST output for further analysis.
Slide Number 3

Pre-requisites

To follow this tutorial you should be familiar with,
  • Undergraduate Biochemistry or Bioinformatics
  • And basic Python programming

Refer to the Python tutorials at the given link.

Slide Number 4

System Requirement

To record this tutorial, I am using,
  • Ubuntu OS version. 14.10
  • Python version 2.7.8
  • Ipython interpretor version 2.3.0
  • Biopython version 1.64
  • And a working Internet connection
Slide Number 5

About BLAST


BLAST is the acronym for Basic Local Alignment Search Tool.


It is an algorithm for comparing sequence information.

Slide Number 6

About BLAST

The program
  • compares nucleotide or protein sequences to sequences in databases
  • and calculates the statistical significance of matches.
Slide Number 7

About BLAST

There are two different ways to run BLAST':
  • Local BLAST on your machine.
  • Run BLAST over Internet through NCBI servers.
Slide Number 8

About BLAST

Running BLAST in Biopython has two steps.


First, run BLAST for your query sequence and get some output.


Second, parse the BLAST output for further analysis.

Cursor on Slide Number 8 We will open the terminal and run BLAST for a nucleotide sequence.


Open the terminal by pressing Ctrl, Alt and T keys simultaneously.

Cursor on the terminal.

Type, ipython and press enter.

At the prompt type ipython and press Enter.
At the prompt type:

>>> from Bio.Blast import NCBIWWW


Press enter.

In this tutorial I will demonstrate how to run BLAST over internet using NCBI BLAST service.


Type the following at the prompt:


Import NCBIWWW module from Bio.Blast package.


Press Enter.

Type,


>>> result = NCBIWWW.qblast("blastn", "nt", "186429")

Next to run the BLAST over internet:


Type the following at the prompt,


We will use qblast function in the NCBIWWW module.

Cursor on the terminal.


Highlight the arguments.


>>> result = NCBIWWW.qblast("blastn", "nt", "186429")

The qblast function takes three arguments:


The first argument is the blast program to use for the search.


Second specifies the databases to search against.


The third argument is a your query sequence.

Cursor on the terminal.


The input for the query sequence can be in the form of GI number or a FASTA file.

Or it can also be a sequence record object.

Highlight the query. For this demonstration, I am using the GI number for a nucleotide sequence.


The GI number is for a nucleotide sequence of insulin.

Slide Number 9

qblast function.

The qblast function also takes a number of other option arguments.


These arguments are analogous to the different parameters you can set on the BLAST web page.


The qblast function will return the BLAST results in xml format.

Cursor on the terminal. Back to the terminal.


We have to use the appropriate Blast program,


depending on whether our query is a nucleotide or protein sequence.


Since our query is a nucleotide, we will use blastn program.


And nt refers to the nucleotide database.

Slide Number 10


NCBI BLAST website

Details about this are available at the NCBI BLAST webpage.


http://blast.ncbi.nlm.nih.gov/Blast.cgi

Cursor on the terminal. The blast output is stored in the variable result in the form of an xml file.

Press Enter.

Cursor on the terminal. Depending upon the speed of your Internet, it may take a few minutes to complete the BLAST search.
Type,


>>> save_file = open("blast.xml", "w")

>>> save_file.write(result.read())

>>> save_file.close()

>>> result.close()


Press enter

It is important to save the xml file on the disk before processing further.


Type the following lines to save the xml file.


These lines of code will save the search result as blast.xml in the home folder.

Navigate to your home folder and locate the file.


Click on the file and check the contents of the file.

Open the text file.


from Bio.Blast import NCBIWWW

fasta_string = open("insulin.fasta").read()

result = NCBIWWW.qblast("blastn", "nt", fasta_string)

save_file = open("blast.xml", "w")

save_file.write(result.read())

save_file.close()

result.close()

result= open("blast.xml")

Use the code in this text file if you want to use a FASTA file as a query.



Open the text file.


from Bio.Blast import NCBIWWW

from Bio import SeqIO

record = SeqIO.read("insulin.fasta", format="fasta")

result = NCBIWWW.qblast("blastn", "nt", record.seq)

save_file = open("blast.xml", "w")

save_file.write(result.read())

save_file.close()

result.close()

result= open("blast.xml")


Here is the code if you want to use sequence record object from a FASTA file as a query.
Type,

>>>result = open("my_blast.xml")


press Enter.


Back to the terminal.


The next step is to parse the file to extract data.


First step in parsing is to open the xml file for input.


Type the following at the prompt:


Press Enter.

>>> from Bio.Blast import NCBIXML


press enter.

Next import the module NCBIXML from Bio.Blast package.


Press Enter.

Type,


>>>records = NCBIXM.parse(result)

>>>blast_record = records.next()

press enter

Type the following lines to parse the Blast output.


A BLAST record contains all the information you want to extract from the BLAST output.

Cursor on the terminal. Let us print out some information about all hits in our blast report greater than a particular threshold.


Type the following


For a match to be significant, expect score should be less than 0.01.

Type the following code.


for alignment in blast_record.alignments:

for hsp in alignment.hsps:

if hsp.expect <0.01:

print('****Alignment****')

print('sequence:', alignment.title)

print('length:', alignment.length)

print('score:', hsp.score)

print('gaps:', hsp.gaps)

print('e value:', hsp.expect)

print(hsp.query[0:90] + '...')

print(hsp.match[0:90] + '...')

print(hsp.sbjct[0:90] + '…')


Press Enter key twice.

For each hsp that is high scoring pair, we get the title, length, gaps and expect value.


We also print strings containing the query (query), the aligned database sequence (sbjct) and string specifying the match and mismatch positions (match).


Press Enter key twice to get output.

Highlight the output. Observe the output:

For each alignment we have length, score, gaps, evalue and strings.

Cursor on the terminal. You can extract the required information using other functions available in Bio.Blast package.


We have come to the end of this tutorial.

Slide Number 11

Summary


Let's summarize,

In this tutorial we have learnt to,

  • Run BLAST for the query nucleotide sequence using GI number.
  • And parse the BLAST output file using Bio.Blast.Record module.
Slide Number 12

Assignment


(use blastp)

result = NCBIWWW.qblast("blastp", "nr", 386828)

non-redundant protein database (nr).

For the assignment,


Run BLAST

Search for a protein sequence of your choice.


Save the output file and parse the data contained in the file.


Your completed assignment should have the following lines of code :

As shown in this text file.


Since our query is protein sequence, here we have used blastp program.


And nr non-redundant protein database for the BLAST search.

Slide Number 13

Acknowledgement

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Slide Number 14 The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test.

For more details, please write to us.

Slide number 15 Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.

More information on this Mission is available at the link shown.

Slide number 15 This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

Nancyvarkey, Snehalathak