Biopython/C2/Parsing-Data/English
|
|
---|---|
Slide Number 1
Title Slide |
Hello everyone.
Welcome to this tutorial on Parsing Data. |
Slide Number 2
Learning Objectives |
In this tutorial, we will learn to,
|
Slide Number 3
Pre-requisites |
To follow this tutorial you should be familiar with,
Refer to the Python tutorials at the given link. |
Slide Number 4
System Requirement |
To record this tutorial I am using
|
Slide Number 5
Data files |
Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc
Open the website link given below in any web browser. |
Cursor on the web-page.
Download FASTA files |
A web-page opens.
Let us download FASTA and GenBank files for human insulin gene. In the search box type, human insulin. Click on search button. |
Scroll down the page.
|
The web-page shows many files for human insulin gene.
For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. I will choose files that have less than 500 base pairs. Click on the check box to select the file to download. |
Bring the cursor to the “Send to” option. (Located on the top right hand corner.)
|
Bring the cursor to the “Send to” option, located at the top right corner of the page.
|
Under the heading “Choose file destination” Click on “File” option.
Click on “format” drop down list box. Choose “fasta” and click on “Create file” option. |
Under the heading “Choose destination” Click on “File” option.
You can save this file in any file format listed under “format” drop down list box. Choose “FASTA” from the given options. Then click on “Create file” option. |
A dialog box appears on screen. Click on “Save file” option. | A dialog box appears on the screen.
Select “Open with” click on OK . |
Cursor on the text editor.
Cursor on the first line. Scroll down. |
The file opens in a text editor.
The first line in each record is an identifier line, It starts with a “greater than (>) symbol”. This is followed by a sequence. Save the file in your home folder as “sequence.fasta”. Close the text editor. |
Cursor on the web-page. | Follow the same steps as above to download the files in GenBank format:
for the same files selected earlier. Select the file format as GenBank. Create file, open with a text editor. |
Scroll down. | Notice that the sequence file in GenBank format has more features than a FASTA file.
Save the file as sequence.gb in your home folder. Close the text editor. |
Click on the check boxes.
Select Human insulin gene complete cds, click on the check box. |
For demonstration purpose we need a FASTA file with a single record.
For this, clear the earlier selection by again clicking on the check boxes. Now select the file “Human insulin gene complete cds”. Click on the check box. |
Save the file as insulin.fasta. | And Follow the same steps shown earlier to save the file in the home folder.
Save the file as insulin.fasta. |
Cursor on the text editor.
Close the text editor. |
Biological data stored in these files can be extracted and modified using Biopython libraries.
Close the text editor. |
Slide Number 6
Parsing |
Extracting data from data files is called as Parsing.
Most file formats can be parsed using functions available in SeqIO module. Most commonly used functions of SeqIO module are, parse, read, write, and convert. |
Slide number 6
Open the terminal using ctrl, alt and t keys. |
Open the terminal by pressing ctrl, alt and t keys simultaneously. |
Type ipython at the prompt.
>>>ipython |
Start Ipython by typing ipython at the prompt.
|
Cursor on the terminal. | Next import SeqIO module from Bio package. |
Type,
>>> from Bio import SeqIO Press enter |
At the prompt type,
Press enter |
(Open the file in text editor and scroll down) | We will start with the most important function “parse”.
Which we had downloaded earlier from the database. |
Type,
or seq_record in SeqIO.parse("sequence.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) Highlight all the lines. Press enter key twice to get the output. |
For simple FASTA parsing, type the following at the prompt.
Here we are using the parse function to read contents of sequence.fasta file. For the output print, record id, sequence present in the record and also the length of the sequence. Also notice that the parse function is used to read sequence data as Sequence record objects. It is generally used with a for loop. It can accept two arguments, the first one is the file name to read the data. The second specifies the file format. Press enter key twice to get the output. |
Highlight the first line. | The output shows the identifier line, followed by the sequence contained in the file.
Also the length of the sequence for all the records in the file. Notice that the FASTA format does not specify the alphabet. So, the output does not specifies it as as a DNA sequence. |
Type,
|
The same steps can be repeated for parsing GenBank file.
|
from Bio import SeqIO
for seq_record in SeqIO.parse("insulin.gb", "genbank"): print(seq_record.id) print(seq_record.seq) print(len(seq_record)) |
Press enter key twice to get the output.
|
Cursor on the terminal. | Similarly Swiss-prot and EMBL files can be parsed using same code as above. |
Cursor on the terminal. | If your file contains a single record then type the following lines for parsing. |
Type,
>>> record = SeqIO.read("insulin.fasta", "fasta") >>> record Press enter |
Here we will use the previously saved FASTA file with a single record, that is insulin.fasta as an example.
|
Cursor on the terminal. | The output shows the contents for the file insulin.fasta.
|
At the prompt type,
>>> record.seq
|
We can also view the individual attributes of this record as follows.
|
Cursor on the terminal. | The output shows the sequence present in the file. |
type,
|
To view the identifiers for this record, type, record dot id.
The output shows the GI number and accession number etc. You can use the function described above to parse the data files of your choice. |
Slide Number 9
Summary |
Now Let's summarize,
In this tutorial we have learnt to,
|
Slide Number 10
Assignment
|
Now for the assignment,
|
Type at the prompt,
>>> for record in SeqIO.parse("sequence.fasta", "fasta"): ... print(record.id) ... print(record.seq.reverse_complement()) |
Your completed assignment should have the following lines of code.
|
Slide Number 11
Acknowledgement |
Video at the following link, summarizes the spoken-tutorial project.
Please download and watch it. |
Slide Number 12 | The Spoken Tutorial Project Team:
Conducts workshops and gives certificates to those who pass an on-line test. For more details, please write to us. |
Slide number 13 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
More information on this Mission is available at the link shown. |
This is Snehalatha from IIT Bombay signing off. Thank you for joining. |