Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with "{| Border=1 ! <center>Time</center> ! <center>Narration</center> |- | 00:01 | Hello everyone. |- | 00:02 | Welcome to this tutorial on '''Parsing Data.''' |- | 00:06 | In t...")
 
 
(3 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
|-
 
|-
 
| 00:01
 
| 00:01
| Hello everyone.
+
| Hello everyone.Welcome to this tutorial on '''Parsing Data.'''
 
+
|-
+
| 00:02
+
| Welcome to this tutorial on '''Parsing Data.'''
+
  
 
|-
 
|-
 
| 00:06
 
| 00:06
| In this tutorial, we will learn to, Download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website.
+
| In this tutorial, we will learn to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website.
  
 
|-
 
|-
 
| 00:14
 
| 00:14
| And '''Parse''' data files using functions in '''Sequence Input/Output''' module.
+
| And, '''Parse''' data files using '''function'''s in '''Sequence Input/Output''' module.
  
 
|-
 
|-
 
| 00:19
 
| 00:19
| To follow this tutorial you should be familiar with, Undergraduate Biochemistry or Bioinformatics
+
| To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics
  
 
|-
 
|-
 
| 00:26
 
| 00:26
| And basic '''Python''' programming  
+
| and basic '''Python''' programming.
  
 
|-
 
|-
Line 33: Line 29:
 
|-
 
|-
 
| 00:34
 
| 00:34
| To record this tutorial I am using '''Ubuntu''' OS version. 14.10
+
| To record this tutorial, I am using: * '''Ubuntu OS''' version 14.10
  
 
|-
 
|-
Line 45: Line 41:
 
|-
 
|-
 
| 00:48
 
| 00:48
| '''Biopython''' 1.64 and '''Mozilla Firefox '''browser 35.0
+
|'''Biopython''' version 1.64 and * '''Mozilla Firefox '''browser 35.0.
  
 
|-
 
|-
 
| 00:56
 
| 00:56
| Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc
+
| Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc.
  
 
|-
 
|-
 
| 01:07
 
| 01:07
|Data files can be download from the database websites.
+
|Data files can be downloaded from the database websites.
  
 
|-
 
|-
 
| 01:12
 
| 01:12
|Open the website link given below in any web browser.
+
|Open the website link given below, in any web browser.
  
 
|-
 
|-
Line 65: Line 61:
 
|-
 
|-
 
| 01:19
 
| 01:19
|Let us download '''FASTA''' and '''GenBank''' files for human '''insulin''' gene.
+
|Let us download '''FASTA''' and '''GenBank''' files for human '''insulin gene'''.
  
 
|-
 
|-
 
| 01:25
 
| 01:25
|In the search box type, '''human insulin''' click on search button.
+
|In the search box, type: "human insulin", click on '''Search''' button.
  
 
|-
 
|-
 
| 01:31
 
| 01:31
| The web-page shows many files for human insulin gene.
+
| The web-page shows many files for human '''insulin gene'''.
  
 
|-
 
|-
 
| 01:35
 
| 01:35
|For demonstration, I will select 4 files with the name “'''Homo sapiens Insulin mRNA”. '''
+
|For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.  
  
 
|-
 
|-
 
| 01:43
 
| 01:43
|I will choose files that have less than 500 base pairs.
+
|I will choose files that have less than 500 '''base''' pairs.
  
 
|-
 
|-
 
| 01:48
 
| 01:48
|Click on the check box to select the file to download.
+
|Click on the check-box to select the file, to download.
  
 
|-
 
|-
Line 93: Line 89:
 
|-
 
|-
 
| 02:02
 
| 02:02
|Click on the small selection button with a down arrow present next to the “'''Send to'''” button.
+
|Click on the small selection button with a down arrow, present next to the “'''Send to'''” button.
  
 
|-
 
|-
 
| 02:09
 
| 02:09
| Under the heading “'''Choose destination'''” Click on '''File'''option.
+
| Under the heading “'''Choose destination'''”, click on '''File''' option.
 
|-
 
|-
 
| 02:13
 
| 02:13
|You can save this file in any file format listed under '''format'''drop down list box.
+
|You can '''save''' this file in any file format, listed under '''format''' drop-down list box.
  
 
|-
 
|-
 
| 02:21
 
| 02:21
|Choose '''FASTA'''from the given options.
+
|Choose '''FASTA''' from the given options.
  
 
|-
 
|-
 
| 02:25
 
| 02:25
|Then click on '''Create file'''option.  
+
|Then click on '''Create file''' option.  
  
 
|-
 
|-
 
| 02:29
 
| 02:29
| A dialog box appears on the screen.  
+
| A dialog-box appears on the screen.  
  
 
|-
 
|-
 
|02:32  
 
|02:32  
|Select '''Open with'''click on '''OK .'''
+
|Select '''Open with''', click on '''OK.'''
  
 
|-
 
|-
 
| 02:36
 
| 02:36
| A file opens in a text editor.
+
| A file opens in a '''text editor'''.
  
 
|-
 
|-
Line 128: Line 124:
 
|-
 
|-
 
| 02:46
 
| 02:46
|The first line in each record is an '''identifier''' line,
+
|The first line in each record is an '''identifier''' line.
  
 
|-
 
|-
 
| 02:50
 
| 02:50
|It starts with a “greater than (>) symbol”.
+
|It starts with a “greater than (>)” symbol.
  
 
|-
 
|-
Line 140: Line 136:
 
|-
 
|-
 
| 02:56
 
| 02:56
|Save the file in your home folder as “'''sequence.fasta'''”.
+
|'''Save''' the file in your '''home''' folder as “sequence.fasta'”.
  
 
|-
 
|-
Line 148: Line 144:
 
|-
 
|-
 
| 03:03
 
| 03:03
| Follow the same steps as above to download the files in '''GenBank''' format
+
| Follow the same steps as above, to download the files in '''GenBank''' format
  
 
|-
 
|-
 
| 03:08
 
| 03:08
|For the same files selected earlier.
+
|for the same files selected earlier.
  
 
|-
 
|-
 
| 03:12
 
| 03:12
|Select the file format as '''GenBank.'''
+
|Select the '''file format''' as '''GenBank.'''
  
 
|-
 
|-
 
| 03:16
 
| 03:16
|'''Create file''', open with a text editor.  
+
|Create a file. Open with a text editor.  
  
 
|-
 
|-
Line 168: Line 164:
 
|-
 
|-
 
| 03:27
 
| 03:27
|Save the file as '''sequence.gb '''in your home folder'''.'''Close the text editor'''.'''
+
|'''Save''' the file as "sequence.gb" in your '''home''' folder. Close the text editor.
  
 
|-
 
|-
 
| 03:34
 
| 03:34
| For demonstration purpose we need a FASTA file with a single record.
+
| For demonstration purpose, we need a FASTA file with a single '''record'''.
  
 
|-
 
|-
Line 180: Line 176:
 
|-
 
|-
 
| 03:48
 
| 03:48
|Now select the file “'''Human insulin gene complete cds'''”.
+
|Now, select the file “'''Human insulin gene complete cds'''”.
  
 
|-
 
|-
 
| 03:54
 
| 03:54
|Click on the check box.
+
|Click on the check-box.
  
 
|-
 
|-
 
| 03:57
 
| 03:57
| And Follow the same steps shown earlier to save the file in the home folder.
+
| And follow the same steps shown earlier to '''save''' the file in the '''home''' folder.
  
 
|-
 
|-
 
| 04:01
 
| 04:01
|Save the file as '''insulin.fasta.'''
+
|'''Save''' the file as "insulin.fasta".
  
 
|-
 
|-
Line 200: Line 196:
 
|-
 
|-
 
| 04:16
 
| 04:16
|Close the text editor.
+
|Close the text-editor.
  
 
|-
 
|-
Line 208: Line 204:
 
|-
 
|-
 
| 04:23
 
| 04:23
|Most file formats can be parsed using functions available in '''SeqIO''' module.
+
|Most file formats can be parsed using '''function'''s available in '''SeqIO''' module.
  
 
|-
 
|-
 
| 04:30
 
| 04:30
|Most commonly used functions of '''SeqIO''' module are, '''parse''', '''read''', '''write''', and '''convert'''.
+
|Most commonly used functions of '''SeqIO''' module are: '''parse, read, write''' and '''convert'''.
  
 
|-
 
|-
 
| 04:38
 
| 04:38
| Open the terminal by pressing '''ctrl''', '''alt''' and '''t''' keys simultaneously.
+
| Open the terminal by pressing '''Ctrl, Alt''' and '''t''' keys simultaneously.
  
 
|-
 
|-
 
| 04:44
 
| 04:44
| Start '''Ipython''' by typing '''ipython''' at the prompt. Press enter.
+
| Start '''Ipython''' by typing "ipython" at the prompt. Press '''Enter'''.
  
 
|-
 
|-
 
| 04:51
 
| 04:51
| Next import '''SeqIO''' module from '''Bio''' package.
+
| Next, '''import''' "SeqIO" module from '''Bio''' package.
  
 
|-
 
|-
 
| 04:56
 
| 04:56
| At the prompt type,'''from Bio import SeqIO'''. Press enter
+
| At the prompt, type: '''from Bio import SeqIO'''. Press '''Enter'''.
  
 
|-
 
|-
Line 236: Line 232:
 
|-
 
|-
 
|05:07
 
|05:07
|For demonstration, I will use a '''FASTA''' file that has many records, which we had downloaded earlier from the database.
+
|For demonstration, I will use a '''FASTA''' file that has many '''record'''s which we had downloaded earlier from the database.
  
 
|-
 
|-
Line 244: Line 240:
 
|-
 
|-
 
| 05:22
 
| 05:22
|Here we are using the '''parse''' function to read contents of '''sequence.fasta''' file.
+
|Here, we are using the '''parse''' function to read the contents of the '''sequence.fasta''' file.
  
 
|-
 
|-
 
| 05:30
 
| 05:30
|For the output print, '''record id''', sequence present in the record and also the length of the sequence.
+
|For the output, print '''record id''', sequence present in the record and also the length of the sequence.
 
|-
 
|-
 
| 05:41
 
| 05:41
Line 267: Line 263:
 
|-
 
|-
 
| 06:02
 
| 06:02
|Press enter key twice to get the output.
+
|Press '''Enter''' key twice to get the output.
  
 
|-
 
|-
Line 279: Line 275:
 
|-
 
|-
 
| 06:26
 
| 06:26
|So, the output does not specifies it as as a '''DNA sequence'''.
+
|So, the output does not specifies it as a '''DNA sequence'''.
  
 
|-
 
|-
Line 287: Line 283:
 
|-
 
|-
 
| 06:36
 
| 06:36
|For Demonstration we will use the '''GenBank''' file which we have download earlier from the database.
+
|For Demonstration we will use the '''GenBank''' file which we have downloaded earlier from the database.
  
 
|-
 
|-
 
| 06:43
 
| 06:43
|Press up arrow key to get the lines of code which we had used earlier.
+
|Press up-arrow key to get the lines of code which we had used earlier.
  
 
|-
 
|-
 
| 06:49
 
| 06:49
|Change the file name to '''sequence.gb '''
+
|Change the file name to '''sequence.gb '''.
 
|-
 
|-
 
| 06:53
 
| 06:53
Line 306: Line 302:
 
|-
 
|-
 
| 06:58
 
| 06:58
| Press enter key twice to get the output.
+
| Press '''Enter''' key twice to get the output.
  
 
|-
 
|-
 
| 07:03
 
| 07:03
|Here too the output shows the '''record id''', sequence and the length of the sequence for all the records in the file.
+
|Here too the output shows the '''record id''', '''sequence''' and the length of the sequence for all the records in the file.
  
 
|-
 
|-
Line 318: Line 314:
 
|-
 
|-
 
| 07:19
 
| 07:19
| Similarly '''Swiss-prot''' and '''EMBL''' files can be parsed using same code as above.
+
| Similarly, '''Swiss-prot''' and '''EMBL''' files can be parsed using the same code as above.
  
 
|-
 
|-
Line 326: Line 322:
 
|-
 
|-
 
| 07:34
 
| 07:34
| Here we will use the previously saved FASTA file with a single record, that is '''insulin.fasta '''as an example'''.'''
+
| Here, we will use the previously saved '''FASTA''' file with a single record, that is, '''insulin.fasta '''as an example.
  
 
|-
 
|-
 
| 07:43
 
| 07:43
|Notice that we have used '''read''' function instead of parse function. Press Enter
+
|Notice that we have used '''read''' function instead of '''parse''' function. Press '''Enter'''.
  
 
|-
 
|-
Line 338: Line 334:
 
|-
 
|-
 
| 07:55
 
| 07:55
|It shows the sequence as sequence record object.
+
|It shows the sequence as '''sequence record object'''.
  
 
|-
 
|-
Line 350: Line 346:
 
|-
 
|-
 
| 08:11
 
| 08:11
|At the prompt type, '''record dot seq '''. Press enter
+
|At the prompt, type: '''record dot seq'''. Press '''Enter'''.
  
 
|-
 
|-
Line 358: Line 354:
 
|-
 
|-
 
| 08:22
 
| 08:22
| To view the identifiers for this record, type, '''record dot id.''' Press enter
+
| To view the identifiers for this record, type: '''record dot id.''' Press '''Enter'''.
  
 
|-
 
|-
 
| 08:29
 
| 08:29
|The output shows the GI number and accession number etc.
+
|The output shows the '''GI''' number and accession number etc.
  
 
|-
 
|-
 
| 08:34
 
| 08:34
|You can use the function described above to parse the data files of your choice.
+
|You can use the function described above to '''parse''' the data files of your choice.
  
 
|-
 
|-
 
| 08:40
 
| 08:40
| Now Let's summarize,
+
| Now, let's summarize.
  
 
|-
 
|-
 
| 08:42
 
| 08:42
|In this tutorial we have learnt, to Download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website, and use '''parse''' and '''read''' functions from the '''SeqIO''' module:
+
|In this tutorial, we have learnt:to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website and use '''parse''' and '''read''' functions from the '''SeqIO''' module
  
 
|-
 
|-
 
| 08:55
 
| 08:55
|To extract data such as record ids, description and sequences, from '''FASTA''' and '''GenBank''' files.
+
| to extract data such as '''record id'''s, description and sequences from '''FASTA''' and '''GenBank''' files.
  
 
|-
 
|-
 
| 09:03
 
| 09:03
| Now for the assignment,
+
| Now, for the assignment-
  
 
|-
 
|-
Line 390: Line 386:
 
|-
 
|-
 
| 09:13
 
| 09:13
|Convert the file of sequences to their reverse complements.
+
|Convert the file of sequences to their '''reverse complement'''s.
  
 
|-
 
|-
Line 398: Line 394:
 
|-
 
|-
 
| 09:22
 
| 09:22
|Use '''parse''' function to load nucleotide sequences from the '''FASTA''' file.
+
|Use '''parse''' function to '''load''' nucleotide sequences from the '''FASTA''' file.
  
 
|-
 
|-
 
| 09:28
 
| 09:28
|Next print reverse complements using the Sequence object’s built in reverse complement method.
+
|Next, print reverse complements using the Sequence object’s built in '''reverse complement''' method.
  
 
|-
 
|-
 
| 09:37
 
| 09:37
| Video at the following link, summarizes the spoken-tutorial project.
+
| Video at the following link summarizes the spoken-tutorial project.
  
 
|-
 
|-
Line 414: Line 410:
 
|-
 
|-
 
| 09:44
 
| 09:44
| The Spoken Tutorial Project Team Conducts workshops and gives certificates to those who pass an on-line test.  
+
| The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test.  
  
 
|-
 
|-
Line 426: Line 422:
 
|-
 
|-
 
| 10:01
 
| 10:01
|More information on this Mission is available at the link shown.  
+
|More information on this mission is available at the link shown.  
  
 
|-
 
|-
 
| 10:06
 
| 10:06
|This is Snehalatha from IIT Bombay signing off. Thank you for joining.  
+
|This is Snehalatha from '''IIT Bombay''', signing off. Thank you for joining.  
  
 
|}
 
|}

Latest revision as of 17:26, 10 March 2017

Time
Narration
00:01 Hello everyone.Welcome to this tutorial on Parsing Data.
00:06 In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website.
00:14 And, Parse data files using functions in Sequence Input/Output module.
00:19 To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics
00:26 and basic Python programming.
00:30 Refer to the Python tutorials at the given link.
00:34 To record this tutorial, I am using: * Ubuntu OS version 14.10
00:40 Python version 2.7.8
00:44 Ipython interpretor version 2.3.0
00:48 Biopython version 1.64 and * Mozilla Firefox browser 35.0.
00:56 Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc.
01:07 Data files can be downloaded from the database websites.
01:12 Open the website link given below, in any web browser.
01:17 A web-page opens.
01:19 Let us download FASTA and GenBank files for human insulin gene.
01:25 In the search box, type: "human insulin", click on Search button.
01:31 The web-page shows many files for human insulin gene.
01:35 For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.
01:43 I will choose files that have less than 500 base pairs.
01:48 Click on the check-box to select the file, to download.
01:56 Bring the cursor to the “Send to” option, located at the top right corner of the page.
02:02 Click on the small selection button with a down arrow, present next to the “Send to” button.
02:09 Under the heading “Choose destination”, click on File option.
02:13 You can save this file in any file format, listed under format drop-down list box.
02:21 Choose FASTA from the given options.
02:25 Then click on Create file option.
02:29 A dialog-box appears on the screen.
02:32 Select Open with, click on OK.
02:36 A file opens in a text editor.
02:39 The file shows 4 records, since we had selected four files to download.
02:46 The first line in each record is an identifier line.
02:50 It starts with a “greater than (>)” symbol.
02:53 This is followed by a sequence.
02:56 Save the file in your home folder as “sequence.fasta'”.
03:01 Close the text editor.
03:03 Follow the same steps as above, to download the files in GenBank format
03:08 for the same files selected earlier.
03:12 Select the file format as GenBank.
03:16 Create a file. Open with a text editor.
03:21 Notice that the sequence file in GenBank format has more features than a FASTA file.
03:27 Save the file as "sequence.gb" in your home folder. Close the text editor.
03:34 For demonstration purpose, we need a FASTA file with a single record.
03:39 For this, clear the earlier selection by again clicking on the check boxes.
03:48 Now, select the file “Human insulin gene complete cds”.
03:54 Click on the check-box.
03:57 And follow the same steps shown earlier to save the file in the home folder.
04:01 Save the file as "insulin.fasta".
04:08 Biological data stored in these files can be extracted and modified using Biopython libraries.
04:16 Close the text-editor.
04:19 Extracting data from data files is called as Parsing.
04:23 Most file formats can be parsed using functions available in SeqIO module.
04:30 Most commonly used functions of SeqIO module are: parse, read, write and convert.
04:38 Open the terminal by pressing Ctrl, Alt and t keys simultaneously.
04:44 Start Ipython by typing "ipython" at the prompt. Press Enter.
04:51 Next, import "SeqIO" module from Bio package.
04:56 At the prompt, type: from Bio import SeqIO. Press Enter.
05:04 We will start with the most important function “parse”.
05:07 For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database.
05:17 For simple FASTA parsing, type the following at the prompt.
05:22 Here, we are using the parse function to read the contents of the sequence.fasta file.
05:30 For the output, print record id, sequence present in the record and also the length of the sequence.
05:41 Also notice that the parse function is used to read sequence data as Sequence record objects.
05:48 It is generally used with a for loop.
05:52 It can accept two arguments, the first one is the file name to read the data.
05:59 The second specifies the file format.
06:02 Press Enter key twice to get the output.
06:07 The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file.
06:21 Notice that the FASTA format does not specify the alphabet.
06:26 So, the output does not specifies it as a DNA sequence.
06:31 The same steps can be repeated for parsing GenBank file.
06:36 For Demonstration we will use the GenBank file which we have downloaded earlier from the database.
06:43 Press up-arrow key to get the lines of code which we had used earlier.
06:49 Change the file name to sequence.gb .
06:53 Change the file format to genbank.
06:56 The rest of the code remains same.
06:58 Press Enter key twice to get the output.
07:03 Here too the output shows the record id, sequence and the length of the sequence for all the records in the file.
07:12 Notice that the GenBank format specifies the sequence as DNA sequence.
07:19 Similarly, Swiss-prot and EMBL files can be parsed using the same code as above.
07:27 If your file contains a single record then type the following lines for parsing.
07:34 Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example.
07:43 Notice that we have used read function instead of parse function. Press Enter.
07:50 The output shows the contents for the file insulin.fasta.
07:55 It shows the sequence as sequence record object.
07:59 And other attributes such as GI, accession number and description.
08:06 We can also view the individual attributes of this record as follows.
08:11 At the prompt, type: record dot seq. Press Enter.
08:18 The output shows the sequence present in the file.
08:22 To view the identifiers for this record, type: record dot id. Press Enter.
08:29 The output shows the GI number and accession number etc.
08:34 You can use the function described above to parse the data files of your choice.
08:40 Now, let's summarize.
08:42 In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module
08:55 to extract data such as record ids, description and sequences from FASTA and GenBank files.
09:03 Now, for the assignment-
09:06 Download FASTA files for nucleotide sequence of your choice from NCBI database.
09:13 Convert the file of sequences to their reverse complements.
09:17 Your completed assignment should have the following lines of code.
09:22 Use parse function to load nucleotide sequences from the FASTA file.
09:28 Next, print reverse complements using the Sequence object’s built in reverse complement method.
09:37 Video at the following link summarizes the spoken-tutorial project.
09:42 Please download and watch it.
09:44 The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test.
09:51 For more details, please write to us.
09:55 The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
10:01 More information on this mission is available at the link shown.
10:06 This is Snehalatha from IIT Bombay, signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Sandhya.np14