Biopython/C2/Parsing-Data/English-timed
From Script | Spoken-Tutorial
Revision as of 23:30, 2 August 2016 by Sandhya.np14 (Talk | contribs)
|
|
---|---|
00:01 | Hello everyone. |
00:02 | Welcome to this tutorial on Parsing Data. |
00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website. |
00:14 | And, Parse data files using functions in Sequence Input/Output module. |
00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
00:26 | and basic Python programming. |
00:30 | Refer to the Python tutorials at the given link. |
00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
00:40 | * Python version 2.7.8 |
00:44 | * Ipython interpretor version 2.3.0 |
00:48 | * Biopython 1.64 and * Mozilla Firefox browser 35.0. |
00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc. |
01:07 | Data files can be download from the database websites. |
01:12 | Open the website link given below in any web browser. |
01:17 | A web-page opens. |
01:19 | Let us download FASTA and GenBank files for human insulin gene. |
01:25 | In the search box, type: human insulin, click on Search button. |
01:31 | The web-page shows many files for human insulin gene. |
01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
01:43 | I will choose files that have less than 500 base pairs. |
01:48 | Click on the check-box to select the file, to download. |
01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page. |
02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button. |
02:09 | Under the heading “Choose destination”, click on File option. |
02:13 | You can save this file in any file format, listed under format drop-down list box. |
02:21 | Choose FASTA from the given options. |
02:25 | Then click on Create file option. |
02:29 | A dialog-box appears on the screen. |
02:32 | Select Open with, click on OK . |
02:36 | A file opens in a text editor. |
02:39 | The file shows 4 records, since we had selected four files to download. |
02:46 | The first line in each record is an identifier line. |
02:50 | It starts with a “greater than (>)” symbol. |
02:53 | This is followed by a sequence. |
02:56 | Save the file in your home folder as “sequence.fasta'”. |
03:01 | Close the text editor. |
03:03 | Follow the same steps as above, to download the files in GenBank format |
03:08 | for the same files selected earlier. |
03:12 | Select the file format as GenBank. |
03:16 | Create file, open with a text editor. |
03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file. |
03:27 | Save the file as sequence.gb in your home folder. Close the text editor. |
03:34 | For demonstration purpose we need a FASTA file with a single record. |
03:39 | For this, clear the earlier selection by again clicking on the check boxes. |
03:48 | Now, select the file “Human insulin gene complete cds”. |
03:54 | Click on the check box. |
03:57 | And follow the same steps shown earlier to save the file in the home folder. |
04:01 | Save the file as insulin.fasta. |
04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries. |
04:16 | Close the text editor. |
04:19 | Extracting data from data files is called as Parsing. |
04:23 | Most file formats can be parsed using functions available in SeqIO module. |
04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert. |
04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously. |
04:44 | Start Ipython by typing ipython at the prompt. Press Enter. |
04:51 | Next, import "SeqIO" module from Bio package. |
04:56 | At the prompt, type: from Bio import SeqIO. Press Enter. |
05:04 | We will start with the most important function “parse”. |
05:07 | For demonstration, I will use a FASTA file that has many records, which we had downloaded earlier from the database. |
05:17 | For simple FASTA parsing, type the following at the prompt. |
05:22 | Here we are using the parse function to read contents of sequence.fasta file. |
05:30 | For the output print, record id, sequence present in the record and also the length of the sequence. |
05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects. |
05:48 | It is generally used with a for loop. |
05:52 | It can accept two arguments, the first one is the file name to read the data. |
05:59 | The second specifies the file format. |
06:02 | Press enter key twice to get the output. |
06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file. |
06:21 | Notice that the FASTA format does not specify the alphabet. |
06:26 | So, the output does not specifies it as as a DNA sequence. |
06:31 | The same steps can be repeated for parsing GenBank file. |
06:36 | For Demonstration we will use the GenBank file which we have download earlier from the database. |
06:43 | Press up arrow key to get the lines of code which we had used earlier. |
06:49 | Change the file name to sequence.gb |
06:53 | Change the file format to genbank. |
06:56 | The rest of the code remains same. |
06:58 | Press enter key twice to get the output. |
07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file. |
07:12 | Notice that the GenBank format specifies the sequence as DNA sequence. |
07:19 | Similarly Swiss-prot and EMBL files can be parsed using same code as above. |
07:27 | If your file contains a single record then type the following lines for parsing. |
07:34 | Here we will use the previously saved FASTA file with a single record, that is insulin.fasta as an example. |
07:43 | Notice that we have used read function instead of parse function. Press Enter. |
07:50 | The output shows the contents for the file insulin.fasta. |
07:55 | It shows the sequence as sequence record object. |
07:59 | And other attributes such as GI, accession number and description. |
08:06 | We can also view the individual attributes of this record as follows. |
08:11 | At the prompt, type: record dot seq. Press Enter. |
08:18 | The output shows the sequence present in the file. |
08:22 | To view the identifiers for this record, type, record dot id. Press Enter. |
08:29 | The output shows the GI number and accession number etc. |
08:34 | You can use the function described above to parse the data files of your choice. |
08:40 | Now let's summarize. |
08:42 | In this tutorial, we have learnt:
|
08:55 | * To extract data such as record ids, description and sequences from FASTA and GenBank files. |
09:03 | Now. for the assignment- |
09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database. |
09:13 | Convert the file of sequences to their reverse complements. |
09:17 | Your completed assignment should have the following lines of code. |
09:22 | Use parse function to load nucleotide sequences from the FASTA file. |
09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method. |
09:37 | Video at the following link summarizes the spoken-tutorial project. |
09:42 | Please download and watch it. |
09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
09:51 | For more details, please write to us. |
09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
10:01 | More information on this mission is available at the link shown. |
10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |