Biopython/C2/Parsing-Data/English

From Script | Spoken-Tutorial
Revision as of 10:49, 2 September 2015 by Snehalathak (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Visual Cue
Narration
Slide Number 1

Title Slide

Hello everyone.

Welcome to this tutorial on Parsing Data.

Slide Number 2

Learning Objectives

In this tutorial, we will learn to,
  • Download FASTA and GenBank files from NCBI database website.
  • And Parse data files using functions in Sequence Input/Output module.
Slide Number 3

Pre-requisites

To follow this tutorial you should be familiar with,
  • Undergraduate Biochemistry or Bioinformatics
  • And basic Python programming

Refer to the Python tutorials at the given link.

Slide Number 4

System Requirement

To record this tutorial I am using
  • Ubuntu OS version. 14.10
  • Python version 2.7.8
  • Ipython interpretor version 2.3.0
  • Biopython 1.64
  • And Mozilla Firefox browser 35.0
Slide Number 5

Data files

Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc


Data files can be download from the database websites.

Open the website link given below in any web browser.

http://www.ncbi.nlm.nih.gov/nucleotide

Cursor on the web-page.

Download FASTA files

A web-page opens.

Let us download FASTA and GenBank files for human insulin gene.

In the search box type, human insulin.

Click on search button.

Scroll down the page.


Click on check box.

The web-page shows many files for human insulin gene.

For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.

I will choose files that have less than 500 base pairs.

Click on the check box to select the file to download.

Bring the cursor to the “Send to” option. (Located on the top right hand corner.)


Click on the selection button.

Bring the cursor to the “Send to” option, located at the top right corner of the page.


Click on the small selection button with a down arrow present next to the “Send to” button.

Under the heading “Choose file destination” Click on “File” option.

Click on “format” drop down list box.

Choose “fasta” and click on “Create file” option.

Under the heading “Choose destination” Click on “File” option.

You can save this file in any file format listed under “format” drop down list box.

Choose “FASTA” from the given options.

Then click on “Create file” option.

A dialog box appears on screen. Click on “Save file” option. A dialog box appears on the screen.

Select “Open with” click on OK .

Cursor on the text editor.

Cursor on the first line.

Scroll down.

The file opens in a text editor.


The file shows 4 records, since we had selected four files to download.

The first line in each record is an identifier line,

It starts with a “greater than (>) symbol”.

This is followed by a sequence.

Save the file in your home folder as “sequence.fasta”.

Close the text editor.

Cursor on the web-page. Follow the same steps as above to download the files in GenBank format:

for the same files selected earlier.

Select the file format as GenBank.

Create file, open with a text editor.

Scroll down. Notice that the sequence file in GenBank format has more features than a FASTA file.

Save the file as sequence.gb in your home folder.

Close the text editor.

Click on the check boxes.

Select Human insulin gene complete cds,

click on the check box.

For demonstration purpose we need a FASTA file with a single record.

For this, clear the earlier selection by again clicking on the check boxes.

Now select the file “Human insulin gene complete cds”.

Click on the check box.

Save the file as insulin.fasta. And Follow the same steps shown earlier to save the file in the home folder.

Save the file as insulin.fasta.

Cursor on the text editor.

Close the text editor.

Biological data stored in these files can be extracted and modified using Biopython libraries.

Close the text editor.

Slide Number 6

Parsing

Extracting data from data files is called as Parsing.

Most file formats can be parsed using functions available in SeqIO module.

Most commonly used functions of SeqIO module are, parse, read, write, and convert.

Slide number 6

Open the terminal using ctrl, alt and t keys.

Open the terminal by pressing ctrl, alt and t keys simultaneously.
Type ipython at the prompt.

>>>ipython

Start Ipython by typing ipython at the prompt.


Press enter.

Cursor on the terminal. Next import SeqIO module from Bio package.
Type,

>>> from Bio import SeqIO

Press enter

At the prompt type,


from Bio import SeqIO

Press enter

(Open the file in text editor and scroll down) We will start with the most important function “parse”.


For demonstration, I will use a FASTA file that has many records.

Which we had downloaded earlier from the database.

Type,

or seq_record in SeqIO.parse("sequence.fasta", "fasta"):

print(seq_record.id)

print(repr(seq_record.seq))

print(len(seq_record))

Highlight all the lines.

Press enter key twice to get the output.

For simple FASTA parsing, type the following at the prompt.

Here we are using the parse function to read contents of sequence.fasta file.

For the output print, record id, sequence present in the record and also the length of the sequence.

Also notice that the parse function is used to read sequence data as Sequence record objects.

It is generally used with a for loop.

It can accept two arguments, the first one is the file name to read the data.

The second specifies the file format.

Press enter key twice to get the output.

Highlight the first line. The output shows the identifier line, followed by the sequence contained in the file.

Also the length of the sequence for all the records in the file.

Notice that the FASTA format does not specify the alphabet.

So, the output does not specifies it as as a DNA sequence.

Type,


for seq_record in SeqIO.parse("sequence.gb", "genbank"):


The same steps can be repeated for parsing GenBank file.


For Demonstration we will use the GenBank file which we have download earlier from the database.


Press up arrow key to get the lines of code which we had used earlier.


Change the file name to sequence.gb


Change the file format to genbank.


The rest of the code remains same.

from Bio import SeqIO

for seq_record in SeqIO.parse("insulin.gb", "genbank"):

print(seq_record.id)

print(seq_record.seq)

print(len(seq_record))

Press enter key twice to get the output.


Here too the output shows the record id, sequence and the length of the sequence for all the records in the file.


Notice that the GenBank format specifies the sequence as DNA sequence.

Cursor on the terminal. Similarly Swiss-prot and EMBL files can be parsed using same code as above.
Cursor on the terminal. If your file contains a single record then type the following lines for parsing.
Type,


>>> from Bio import SeqIO

>>> record = SeqIO.read("insulin.fasta", "fasta")

>>> record

Press enter

Here we will use the previously saved FASTA file with a single record, that is insulin.fasta as an example.


Notice that we have used read function instead of parse function.


Press enter.

Cursor on the terminal. The output shows the contents for the file insulin.fasta.


It shows the sequence as sequence record object.


And other attributes such as GI, accession number and description.

At the prompt type,

>>> record.seq


press enter

We can also view the individual attributes of this record as follows.


At the prompt type, record dot seq


press enter

Cursor on the terminal. The output shows the sequence present in the file.
type,


>>> record.id


press enter


To view the identifiers for this record, type, record dot id.


press enter

The output shows the GI number and accession number etc.

You can use the function described above to parse the data files of your choice.

Slide Number 9

Summary

Now Let's summarize,

In this tutorial we have learnt to,


Download FASTA and GenBank files from NCBI database website.


And use parse and read functions from the SeqIO module:


To extract data such as record ids, description and sequences, from FASTA and GenBank files.

Slide Number 10

Assignment


Now for the assignment,


Download FASTA files for nucleotide sequence of your choice from NCBI database.


Convert the file of sequences to their reverse complements.

Type at the prompt,


>>> from Bio import SeqIO

>>> for record in SeqIO.parse("sequence.fasta", "fasta"):

...

print(record.id)

...

print(record.seq.reverse_complement())

Your completed assignment should have the following lines of code.


Use parse function to load nucleotide sequences from the FASTA file.


Next print reverse complements using the Seq object’s built in reverse complement method.

Slide Number 11

Acknowledgement

Video at the following link, summarizes the spoken-tutorial project.

Please download and watch it.

Slide Number 12 The Spoken Tutorial Project Team:

Conducts workshops and gives certificates to those who pass an on-line test.

For more details, please write to us.

Slide number 13 The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.

More information on this Mission is available at the link shown.

This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

Snehalathak