Biopython/C2/Parsing-Data/English-timed

From Script | Spoken-Tutorial
Revision as of 12:14, 6 June 2016 by PoojaMoolya (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Time
Narration
00:01 Hello everyone.
00:02 Welcome to this tutorial on Parsing Data.
00:06 In this tutorial, we will learn to, Download FASTA and GenBank files from NCBI database website.
00:14 And Parse data files using functions in Sequence Input/Output module.
00:19 To follow this tutorial you should be familiar with, Undergraduate Biochemistry or Bioinformatics
00:26 And basic Python programming
00:30 Refer to the Python tutorials at the given link.
00:34 To record this tutorial I am using Ubuntu OS version. 14.10
00:40 Python version 2.7.8
00:44 Ipython interpretor version 2.3.0
00:48 Biopython 1.64 and Mozilla Firefox browser 35.0
00:56 Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc
01:07 Data files can be download from the database websites.
01:12 Open the website link given below in any web browser.
01:17 A web-page opens.
01:19 Let us download FASTA and GenBank files for human insulin gene.
01:25 In the search box type, human insulin click on search button.
01:31 The web-page shows many files for human insulin gene.
01:35 For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.
01:43 I will choose files that have less than 500 base pairs.
01:48 Click on the check box to select the file to download.
01:56 Bring the cursor to the “Send to” option, located at the top right corner of the page.
02:02 Click on the small selection button with a down arrow present next to the “Send to” button.
02:09 Under the heading “Choose destination” Click on “File” option.
02:13 You can save this file in any file format listed under “format” drop down list box.
02:21 Choose “FASTA” from the given options.
02:25 Then click on “Create file” option.
02:29 A dialog box appears on the screen.
02:32 Select “Open with” click on OK .
02:36 A file opens in a text editor.
02:39 The file shows 4 records, since we had selected four files to download.
02:46 The first line in each record is an identifier line,
02:50 It starts with a “greater than (>) symbol”.
02:53 This is followed by a sequence.
02:56 Save the file in your home folder as “sequence.fasta”.
03:01 Close the text editor.
03:03 Follow the same steps as above to download the files in GenBank format
03:08 For the same files selected earlier.
03:12 Select the file format as GenBank.
03:16 Create file, open with a text editor.
03:21 Notice that the sequence file in GenBank format has more features than a FASTA file.
03:27 Save the file as sequence.gb in your home folder.Close the text editor.
03:34 For demonstration purpose we need a FASTA file with a single record.
03:39 For this, clear the earlier selection by again clicking on the check boxes.
03:48 Now select the file “Human insulin gene complete cds”.
03:54 Click on the check box.
03:57 And Follow the same steps shown earlier to save the file in the home folder.
04:01 Save the file as insulin.fasta.
04:08 Biological data stored in these files can be extracted and modified using Biopython libraries.
04:16 Close the text editor.
04:19 Extracting data from data files is called as Parsing.
04:23 Most file formats can be parsed using functions available in SeqIO module.
04:30 Most commonly used functions of SeqIO module are, parse, read, write, and convert.
04:38 Open the terminal by pressing ctrl, alt and t keys simultaneously.
04:44 Start Ipython by typing ipython at the prompt. Press enter.
04:51 Next import SeqIO module from Bio package.
04:56 At the prompt type,from Bio import SeqIO. Press enter
05:04 We will start with the most important function “parse”.
05:07 For demonstration, I will use a FASTA file that has many records, which we had downloaded earlier from the database.
05:17 For simple FASTA parsing, type the following at the prompt.
05:22 Here we are using the parse function to read contents of sequence.fasta file.
05:30 For the output print, record id, sequence present in the record and also the length of the sequence.
05:41 Also notice that the parse function is used to read sequence data as Sequence record objects.
05:48 It is generally used with a for loop.
05:52 It can accept two arguments, the first one is the file name to read the data.
05:59 The second specifies the file format.
06:02 Press enter key twice to get the output.
06:07 The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file.
06:21 Notice that the FASTA format does not specify the alphabet.
06:26 So, the output does not specifies it as as a DNA sequence.
06:31 The same steps can be repeated for parsing GenBank file.
06:36 For Demonstration we will use the GenBank file which we have download earlier from the database.
06:43 Press up arrow key to get the lines of code which we had used earlier.
06:49 Change the file name to sequence.gb
06:53 Change the file format to genbank.
06:56 The rest of the code remains same.
06:58 Press enter key twice to get the output.
07:03 Here too the output shows the record id, sequence and the length of the sequence for all the records in the file.
07:12 Notice that the GenBank format specifies the sequence as DNA sequence.
07:19 Similarly Swiss-prot and EMBL files can be parsed using same code as above.
07:27 If your file contains a single record then type the following lines for parsing.
07:34 Here we will use the previously saved FASTA file with a single record, that is insulin.fasta as an example.
07:43 Notice that we have used read function instead of parse function. Press Enter
07:50 The output shows the contents for the file insulin.fasta.
07:55 It shows the sequence as sequence record object.
07:59 And other attributes such as GI, accession number and description.
08:06 We can also view the individual attributes of this record as follows.
08:11 At the prompt type, record dot seq . Press enter
08:18 The output shows the sequence present in the file.
08:22 To view the identifiers for this record, type, record dot id. Press enter
08:29 The output shows the GI number and accession number etc.
08:34 You can use the function described above to parse the data files of your choice.
08:40 Now Let's summarize,
08:42 In this tutorial we have learnt, to Download FASTA and GenBank files from NCBI database website, and use parse and read functions from the SeqIO module:
08:55 To extract data such as record ids, description and sequences, from FASTA and GenBank files.
09:03 Now for the assignment,
09:06 Download FASTA files for nucleotide sequence of your choice from NCBI database.
09:13 Convert the file of sequences to their reverse complements.
09:17 Your completed assignment should have the following lines of code.
09:22 Use parse function to load nucleotide sequences from the FASTA file.
09:28 Next print reverse complements using the Sequence object’s built in reverse complement method.
09:37 Video at the following link, summarizes the spoken-tutorial project.
09:42 Please download and watch it.
09:44 The Spoken Tutorial Project Team Conducts workshops and gives certificates to those who pass an on-line test.
09:51 For more details, please write to us.
09:55 The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
10:01 More information on this Mission is available at the link shown.
10:06 This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Sandhya.np14