Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"

Latest revision as of 17:26, 10 March 2017

Time	Narration
00:01	Hello everyone.Welcome to this tutorial on Parsing Data.
00:06	In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website.
00:14	And, Parse data files using functions in Sequence Input/Output module.
00:19	To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics
00:26	and basic Python programming.
00:30	Refer to the Python tutorials at the given link.
00:34	To record this tutorial, I am using: * Ubuntu OS version 14.10
00:40	Python version 2.7.8
00:44	Ipython interpretor version 2.3.0
00:48	Biopython version 1.64 and * Mozilla Firefox browser 35.0.
00:56	Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc.
01:07	Data files can be downloaded from the database websites.
01:12	Open the website link given below, in any web browser.
01:17	A web-page opens.
01:19	Let us download FASTA and GenBank files for human insulin gene.
01:25	In the search box, type: "human insulin", click on Search button.
01:31	The web-page shows many files for human insulin gene.
01:35	For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.
01:43	I will choose files that have less than 500 base pairs.
01:48	Click on the check-box to select the file, to download.
01:56	Bring the cursor to the “Send to” option, located at the top right corner of the page.
02:02	Click on the small selection button with a down arrow, present next to the “Send to” button.
02:09	Under the heading “Choose destination”, click on File option.
02:13	You can save this file in any file format, listed under format drop-down list box.
02:21	Choose FASTA from the given options.
02:25	Then click on Create file option.
02:29	A dialog-box appears on the screen.
02:32	Select Open with, click on OK.
02:36	A file opens in a text editor.
02:39	The file shows 4 records, since we had selected four files to download.
02:46	The first line in each record is an identifier line.
02:50	It starts with a “greater than (>)” symbol.
02:53	This is followed by a sequence.
02:56	Save the file in your home folder as “sequence.fasta'”.
03:01	Close the text editor.
03:03	Follow the same steps as above, to download the files in GenBank format
03:08	for the same files selected earlier.
03:12	Select the file format as GenBank.
03:16	Create a file. Open with a text editor.
03:21	Notice that the sequence file in GenBank format has more features than a FASTA file.
03:27	Save the file as "sequence.gb" in your home folder. Close the text editor.
03:34	For demonstration purpose, we need a FASTA file with a single record.
03:39	For this, clear the earlier selection by again clicking on the check boxes.
03:48	Now, select the file “Human insulin gene complete cds”.
03:54	Click on the check-box.
03:57	And follow the same steps shown earlier to save the file in the home folder.
04:01	Save the file as "insulin.fasta".
04:08	Biological data stored in these files can be extracted and modified using Biopython libraries.
04:16	Close the text-editor.
04:19	Extracting data from data files is called as Parsing.
04:23	Most file formats can be parsed using functions available in SeqIO module.
04:30	Most commonly used functions of SeqIO module are: parse, read, write and convert.
04:38	Open the terminal by pressing Ctrl, Alt and t keys simultaneously.
04:44	Start Ipython by typing "ipython" at the prompt. Press Enter.
04:51	Next, import "SeqIO" module from Bio package.
04:56	At the prompt, type: from Bio import SeqIO. Press Enter.
05:04	We will start with the most important function “parse”.
05:07	For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database.
05:17	For simple FASTA parsing, type the following at the prompt.
05:22	Here, we are using the parse function to read the contents of the sequence.fasta file.
05:30	For the output, print record id, sequence present in the record and also the length of the sequence.
05:41	Also notice that the parse function is used to read sequence data as Sequence record objects.
05:48	It is generally used with a for loop.
05:52	It can accept two arguments, the first one is the file name to read the data.
05:59	The second specifies the file format.
06:02	Press Enter key twice to get the output.
06:07	The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file.
06:21	Notice that the FASTA format does not specify the alphabet.
06:26	So, the output does not specifies it as a DNA sequence.
06:31	The same steps can be repeated for parsing GenBank file.
06:36	For Demonstration we will use the GenBank file which we have downloaded earlier from the database.
06:43	Press up-arrow key to get the lines of code which we had used earlier.
06:49	Change the file name to sequence.gb .
06:53	Change the file format to genbank.
06:56	The rest of the code remains same.
06:58	Press Enter key twice to get the output.
07:03	Here too the output shows the record id, sequence and the length of the sequence for all the records in the file.
07:12	Notice that the GenBank format specifies the sequence as DNA sequence.
07:19	Similarly, Swiss-prot and EMBL files can be parsed using the same code as above.
07:27	If your file contains a single record then type the following lines for parsing.
07:34	Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example.
07:43	Notice that we have used read function instead of parse function. Press Enter.
07:50	The output shows the contents for the file insulin.fasta.
07:55	It shows the sequence as sequence record object.
07:59	And other attributes such as GI, accession number and description.
08:06	We can also view the individual attributes of this record as follows.
08:11	At the prompt, type: record dot seq. Press Enter.
08:18	The output shows the sequence present in the file.
08:22	To view the identifiers for this record, type: record dot id. Press Enter.
08:29	The output shows the GI number and accession number etc.
08:34	You can use the function described above to parse the data files of your choice.
08:40	Now, let's summarize.
08:42	In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module
08:55	to extract data such as record ids, description and sequences from FASTA and GenBank files.
09:03	Now, for the assignment-
09:06	Download FASTA files for nucleotide sequence of your choice from NCBI database.
09:13	Convert the file of sequences to their reverse complements.
09:17	Your completed assignment should have the following lines of code.
09:22	Use parse function to load nucleotide sequences from the FASTA file.
09:28	Next, print reverse complements using the Sequence object’s built in reverse complement method.
09:37	Video at the following link summarizes the spoken-tutorial project.
09:42	Please download and watch it.
09:44	The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test.
09:51	For more details, please write to us.
09:55	The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
10:01	More information on this mission is available at the link shown.
10:06	This is Snehalatha from IIT Bombay, signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Sandhya.np14

Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"

Latest revision as of 17:26, 10 March 2017

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools

@@ Line 5: / Line 5: @@
 |-
 | 00:01
-| Hello everyone.
+| Hello everyone.Welcome to this tutorial on '''Parsing Data.'''
-|-
-| 00:02
-| Welcome to this tutorial on '''Parsing Data.'''
 |-
 | 00:06
-| In this tutorial, we will learn to, Download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website.
+| In this tutorial, we will learn to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website.
 |-
 | 00:14
-| And '''Parse''' data files using functions in '''Sequence Input/Output''' module.
+| And, '''Parse''' data files using '''function'''s in '''Sequence Input/Output''' module.
 |-
 | 00:19
-| To follow this tutorial you should be familiar with, Undergraduate Biochemistry or Bioinformatics
+| To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics
 |-
 | 00:26
-| And basic '''Python''' programming
+| and basic '''Python''' programming.
 |-
@@ Line 33: / Line 29: @@
 |-
 | 00:34
-| To record this tutorial I am using '''Ubuntu''' OS version. 14.10
+| To record this tutorial, I am using: * '''Ubuntu OS''' version 14.10
 |-
@@ Line 45: / Line 41: @@
 |-
 | 00:48
-| '''Biopython''' 1.64 and '''Mozilla Firefox '''browser 35.0
+|'''Biopython''' version 1.64 and * '''Mozilla Firefox '''browser 35.0.
 |-
 | 00:56
-| Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc
+| Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc.
 |-
 | 01:07
-|Data files can be download from the database websites.
+|Data files can be downloaded from the database websites.
 |-
 | 01:12
-|Open the website link given below in any web browser.
+|Open the website link given below, in any web browser.
 |-
@@ Line 65: / Line 61: @@
 |-
 | 01:19
-|Let us download '''FASTA''' and '''GenBank''' files for human '''insulin''' gene.
+|Let us download '''FASTA''' and '''GenBank''' files for human '''insulin gene'''.
 |-
 | 01:25
-|In the search box type, '''human insulin''' click on search button.
+|In the search box, type: "human insulin", click on '''Search''' button.
 |-
 | 01:31
-| The web-page shows many files for human insulin gene.
+| The web-page shows many files for human '''insulin gene'''.
 |-
 | 01:35
-|For demonstration, I will select 4 files with the name “'''Homo sapiens Insulin mRNA”. '''
+|For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.
 |-
 | 01:43
-|I will choose files that have less than 500 base pairs.
+|I will choose files that have less than 500 '''base''' pairs.
 |-
 | 01:48
-|Click on the check box to select the file to download.
+|Click on the check-box to select the file, to download.
 |-
@@ Line 93: / Line 89: @@
 |-
 | 02:02
-|Click on the small selection button with a down arrow present next to the “'''Send to'''” button.
+|Click on the small selection button with a down arrow, present next to the “'''Send to'''” button.
 |-
 | 02:09
-| Under the heading “'''Choose destination'''” Click on “'''File'''” option.
+| Under the heading “'''Choose destination'''”, click on '''File''' option.
 |-
 | 02:13
-|You can save this file in any file format listed under “'''format'''” drop down list box.
+|You can '''save''' this file in any file format, listed under '''format''' drop-down list box.
 |-
 | 02:21
-|Choose “'''FASTA'''” from the given options.
+|Choose '''FASTA''' from the given options.
 |-
 | 02:25
-|Then click on “'''Create file'''” option.
+|Then click on '''Create file''' option.
 |-
 | 02:29
-| A dialog box appears on the screen.
+| A dialog-box appears on the screen.
 |-
 |02:32
-|Select “'''Open with'''” click on '''OK .'''
+|Select '''Open with''', click on '''OK.'''
 |-
 | 02:36
-| A file opens in a text editor.
+| A file opens in a '''text editor'''.
 |-
@@ Line 128: / Line 124: @@
 |-
 | 02:46
-|The first line in each record is an '''identifier''' line,
+|The first line in each record is an '''identifier''' line.
 |-
 | 02:50
-|It starts with a “greater than (>) symbol”.
+|It starts with a “greater than (>)” symbol.
 |-
@@ Line 140: / Line 136: @@
 |-
 | 02:56
-|Save the file in your home folder as “'''sequence.fasta'''”.
+|'''Save''' the file in your '''home''' folder as “sequence.fasta'”.
 |-
@@ Line 148: / Line 144: @@
 |-
 | 03:03
-| Follow the same steps as above to download the files in '''GenBank''' format
+| Follow the same steps as above, to download the files in '''GenBank''' format
 |-
 | 03:08
-|For the same files selected earlier.
+|for the same files selected earlier.
 |-
 | 03:12
-|Select the file format as '''GenBank.'''
+|Select the '''file format''' as '''GenBank.'''
 |-
 | 03:16
-|'''Create file''', open with a text editor.
+|Create a file. Open with a text editor.
 |-
@@ Line 168: / Line 164: @@
 |-
 | 03:27
-|Save the file as '''sequence.gb '''in your home folder'''.'''Close the text editor'''.'''
+|'''Save''' the file as "sequence.gb" in your '''home''' folder. Close the text editor.
 |-
 | 03:34
-| For demonstration purpose we need a FASTA file with a single record.
+| For demonstration purpose, we need a FASTA file with a single '''record'''.
 |-
@@ Line 180: / Line 176: @@
 |-
 | 03:48
-|Now select the file “'''Human insulin gene complete cds'''”.
+|Now, select the file “'''Human insulin gene complete cds'''”.
 |-
 | 03:54
-|Click on the check box.
+|Click on the check-box.
 |-
 | 03:57
-| And Follow the same steps shown earlier to save the file in the home folder.
+| And follow the same steps shown earlier to '''save''' the file in the '''home''' folder.
 |-
 | 04:01
-|Save the file as '''insulin.fasta.'''
+|'''Save''' the file as "insulin.fasta".
 |-
@@ Line 200: / Line 196: @@
 |-
 | 04:16
-|Close the text editor.
+|Close the text-editor.
 |-
@@ Line 208: / Line 204: @@
 |-
 | 04:23
-|Most file formats can be parsed using functions available in '''SeqIO''' module.
+|Most file formats can be parsed using '''function'''s available in '''SeqIO''' module.
 |-
 | 04:30
-|Most commonly used functions of '''SeqIO''' module are, '''parse''', '''read''', '''write''', and '''convert'''.
+|Most commonly used functions of '''SeqIO''' module are: '''parse, read, write''' and '''convert'''.
 |-
 | 04:38
-| Open the terminal by pressing '''ctrl''', '''alt''' and '''t''' keys simultaneously.
+| Open the terminal by pressing '''Ctrl, Alt''' and '''t''' keys simultaneously.
 |-
 | 04:44
-| Start '''Ipython''' by typing '''ipython''' at the prompt. Press enter.
+| Start '''Ipython''' by typing "ipython" at the prompt. Press '''Enter'''.
 |-
 | 04:51
-| Next import '''SeqIO''' module from '''Bio''' package.
+| Next, '''import''' "SeqIO" module from '''Bio''' package.
 |-
 | 04:56
-| At the prompt type,'''from Bio import SeqIO'''. Press enter
+| At the prompt, type: '''from Bio import SeqIO'''. Press '''Enter'''.
 |-
@@ Line 236: / Line 232: @@
 |-
 |05:07
-|For demonstration, I will use a '''FASTA''' file that has many records, which we had downloaded earlier from the database.
+|For demonstration, I will use a '''FASTA''' file that has many '''record'''s which we had downloaded earlier from the database.
 |-
@@ Line 244: / Line 240: @@
 |-
 | 05:22
-|Here we are using the '''parse''' function to read contents of '''sequence.fasta''' file.
+|Here, we are using the '''parse''' function to read the contents of the '''sequence.fasta''' file.
 |-
 | 05:30
-|For the output print, '''record id''', sequence present in the record and also the length of the sequence.
+|For the output, print '''record id''', sequence present in the record and also the length of the sequence.
 |-
 | 05:41
@@ Line 267: / Line 263: @@
 |-
 | 06:02
-|Press enter key twice to get the output.
+|Press '''Enter''' key twice to get the output.
 |-
@@ Line 279: / Line 275: @@
 |-
 | 06:26
-|So, the output does not specifies it as as a '''DNA sequence'''.
+|So, the output does not specifies it as a '''DNA sequence'''.
 |-
@@ Line 287: / Line 283: @@
 |-
 | 06:36
-|For Demonstration we will use the '''GenBank''' file which we have download earlier from the database.
+|For Demonstration we will use the '''GenBank''' file which we have downloaded earlier from the database.
 |-
 | 06:43
-|Press up arrow key to get the lines of code which we had used earlier.
+|Press up-arrow key to get the lines of code which we had used earlier.
 |-
 | 06:49
-|Change the file name to '''sequence.gb '''
+|Change the file name to '''sequence.gb '''.
 |-
 | 06:53
@@ Line 306: / Line 302: @@
 |-
 | 06:58
-| Press enter key twice to get the output.
+| Press '''Enter''' key twice to get the output.
 |-
 | 07:03
-|Here too the output shows the '''record id''', sequence and the length of the sequence for all the records in the file.
+|Here too the output shows the '''record id''', '''sequence''' and the length of the sequence for all the records in the file.
 |-
@@ Line 318: / Line 314: @@
 |-
 | 07:19
-| Similarly '''Swiss-prot''' and '''EMBL''' files can be parsed using same code as above.
+| Similarly, '''Swiss-prot''' and '''EMBL''' files can be parsed using the same code as above.
 |-
@@ Line 326: / Line 322: @@
 |-
 | 07:34
-| Here we will use the previously saved FASTA file with a single record, that is '''insulin.fasta '''as an example'''.'''
+| Here, we will use the previously saved '''FASTA''' file with a single record, that is, '''insulin.fasta '''as an example.
 |-
 | 07:43
-|Notice that we have used '''read''' function instead of parse function. Press Enter
+|Notice that we have used '''read''' function instead of '''parse''' function. Press '''Enter'''.
 |-
@@ Line 338: / Line 334: @@
 |-
 | 07:55
-|It shows the sequence as sequence record object.
+|It shows the sequence as '''sequence record object'''.
 |-
@@ Line 350: / Line 346: @@
 |-
 | 08:11
-|At the prompt type, '''record dot seq '''. Press enter
+|At the prompt, type: '''record dot seq'''. Press '''Enter'''.
 |-
@@ Line 358: / Line 354: @@
 |-
 | 08:22
-| To view the identifiers for this record, type, '''record dot id.''' Press enter
+| To view the identifiers for this record, type: '''record dot id.''' Press '''Enter'''.
 |-
 | 08:29
-|The output shows the GI number and accession number etc.
+|The output shows the '''GI''' number and accession number etc.
 |-
 | 08:34
-|You can use the function described above to parse the data files of your choice.
+|You can use the function described above to '''parse''' the data files of your choice.
 |-
 | 08:40
-| Now Let's summarize,
+| Now, let's summarize.
 |-
 | 08:42
-|In this tutorial we have learnt, to Download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website, and use '''parse''' and '''read''' functions from the '''SeqIO''' module:
+|In this tutorial, we have learnt:to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website and use '''parse''' and '''read''' functions from the '''SeqIO''' module
 |-
 | 08:55
-|To extract data such as record ids, description and sequences, from '''FASTA''' and '''GenBank''' files.
+| to extract data such as '''record id'''s, description and sequences from '''FASTA''' and '''GenBank''' files.
 |-
 | 09:03
-| Now for the assignment,
+| Now, for the assignment-
 |-
@@ Line 390: / Line 386: @@
 |-
 | 09:13
-|Convert the file of sequences to their reverse complements.
+|Convert the file of sequences to their '''reverse complement'''s.
 |-
@@ Line 398: / Line 394: @@
 |-
 | 09:22
-|Use '''parse''' function to load nucleotide sequences from the '''FASTA''' file.
+|Use '''parse''' function to '''load''' nucleotide sequences from the '''FASTA''' file.
 |-
 | 09:28
-|Next print reverse complements using the Sequence object’s built in reverse complement method.
+|Next, print reverse complements using the Sequence object’s built in '''reverse complement''' method.
 |-
 | 09:37
-| Video at the following link, summarizes the spoken-tutorial project.
+| Video at the following link summarizes the spoken-tutorial project.
 |-
@@ Line 414: / Line 410: @@
 |-
 | 09:44
-| The Spoken Tutorial Project Team Conducts workshops and gives certificates to those who pass an on-line test.
+| The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test.
 |-
@@ Line 426: / Line 422: @@
 |-
 | 10:01
-|More information on this Mission is available at the link shown.
+|More information on this mission is available at the link shown.
 |-
 | 10:06
-|This is Snehalatha from IIT Bombay signing off. Thank you for joining.
+|This is Snehalatha from '''IIT Bombay''', signing off. Thank you for joining.
 |}