Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"
From Script | Spoken-Tutorial
PoojaMoolya (Talk | contribs) (Created page with "{| Border=1 ! <center>Time</center> ! <center>Narration</center> |- | 00:01 | Hello everyone. |- | 00:02 | Welcome to this tutorial on '''Parsing Data.''' |- | 00:06 | In t...") |
PoojaMoolya (Talk | contribs) |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
|- | |- | ||
| 00:01 | | 00:01 | ||
− | | Hello everyone. | + | | Hello everyone.Welcome to this tutorial on '''Parsing Data.''' |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|- | |- | ||
| 00:06 | | 00:06 | ||
− | | In this tutorial, we will learn to | + | | In this tutorial, we will learn to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website. |
|- | |- | ||
| 00:14 | | 00:14 | ||
− | | And '''Parse''' data files using | + | | And, '''Parse''' data files using '''function'''s in '''Sequence Input/Output''' module. |
|- | |- | ||
| 00:19 | | 00:19 | ||
− | | To follow this tutorial you should be familiar with | + | | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
|- | |- | ||
| 00:26 | | 00:26 | ||
− | | | + | | and basic '''Python''' programming. |
|- | |- | ||
Line 33: | Line 29: | ||
|- | |- | ||
| 00:34 | | 00:34 | ||
− | | To record this tutorial I am using '''Ubuntu''' | + | | To record this tutorial, I am using: * '''Ubuntu OS''' version 14.10 |
|- | |- | ||
Line 45: | Line 41: | ||
|- | |- | ||
| 00:48 | | 00:48 | ||
− | | '''Biopython''' 1.64 and '''Mozilla Firefox '''browser 35.0 | + | |'''Biopython''' version 1.64 and * '''Mozilla Firefox '''browser 35.0. |
|- | |- | ||
| 00:56 | | 00:56 | ||
− | | Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc | + | | Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc. |
|- | |- | ||
| 01:07 | | 01:07 | ||
− | |Data files can be | + | |Data files can be downloaded from the database websites. |
|- | |- | ||
| 01:12 | | 01:12 | ||
− | |Open the website link given below in any web browser. | + | |Open the website link given below, in any web browser. |
|- | |- | ||
Line 65: | Line 61: | ||
|- | |- | ||
| 01:19 | | 01:19 | ||
− | |Let us download '''FASTA''' and '''GenBank''' files for human '''insulin''' | + | |Let us download '''FASTA''' and '''GenBank''' files for human '''insulin gene'''. |
|- | |- | ||
| 01:25 | | 01:25 | ||
− | |In the search box type, ''' | + | |In the search box, type: "human insulin", click on '''Search''' button. |
|- | |- | ||
| 01:31 | | 01:31 | ||
− | | The web-page shows many files for human insulin gene. | + | | The web-page shows many files for human '''insulin gene'''. |
|- | |- | ||
| 01:35 | | 01:35 | ||
− | |For demonstration, I will select 4 files with the name | + | |For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
|- | |- | ||
| 01:43 | | 01:43 | ||
− | |I will choose files that have less than 500 base pairs. | + | |I will choose files that have less than 500 '''base''' pairs. |
|- | |- | ||
| 01:48 | | 01:48 | ||
− | |Click on the check box to select the file to download. | + | |Click on the check-box to select the file, to download. |
|- | |- | ||
Line 93: | Line 89: | ||
|- | |- | ||
| 02:02 | | 02:02 | ||
− | |Click on the small selection button with a down arrow present next to the “'''Send to'''” button. | + | |Click on the small selection button with a down arrow, present next to the “'''Send to'''” button. |
|- | |- | ||
| 02:09 | | 02:09 | ||
− | | Under the heading “'''Choose destination'''” | + | | Under the heading “'''Choose destination'''”, click on '''File''' option. |
|- | |- | ||
| 02:13 | | 02:13 | ||
− | |You can save this file in any file format listed under | + | |You can '''save''' this file in any file format, listed under '''format''' drop-down list box. |
|- | |- | ||
| 02:21 | | 02:21 | ||
− | |Choose | + | |Choose '''FASTA''' from the given options. |
|- | |- | ||
| 02:25 | | 02:25 | ||
− | |Then click on | + | |Then click on '''Create file''' option. |
|- | |- | ||
| 02:29 | | 02:29 | ||
− | | A dialog box appears on the screen. | + | | A dialog-box appears on the screen. |
|- | |- | ||
|02:32 | |02:32 | ||
− | |Select | + | |Select '''Open with''', click on '''OK.''' |
|- | |- | ||
| 02:36 | | 02:36 | ||
− | | A file opens in a text editor. | + | | A file opens in a '''text editor'''. |
|- | |- | ||
Line 128: | Line 124: | ||
|- | |- | ||
| 02:46 | | 02:46 | ||
− | |The first line in each record is an '''identifier''' line | + | |The first line in each record is an '''identifier''' line. |
|- | |- | ||
| 02:50 | | 02:50 | ||
− | |It starts with a “greater than (>) | + | |It starts with a “greater than (>)” symbol. |
|- | |- | ||
Line 140: | Line 136: | ||
|- | |- | ||
| 02:56 | | 02:56 | ||
− | |Save the file in your home | + | |'''Save''' the file in your '''home''' folder as “sequence.fasta'”. |
|- | |- | ||
Line 148: | Line 144: | ||
|- | |- | ||
| 03:03 | | 03:03 | ||
− | | Follow the same steps as above to download the files in '''GenBank''' format | + | | Follow the same steps as above, to download the files in '''GenBank''' format |
|- | |- | ||
| 03:08 | | 03:08 | ||
− | | | + | |for the same files selected earlier. |
|- | |- | ||
| 03:12 | | 03:12 | ||
− | |Select the file format as '''GenBank.''' | + | |Select the '''file format''' as '''GenBank.''' |
|- | |- | ||
| 03:16 | | 03:16 | ||
− | | | + | |Create a file. Open with a text editor. |
|- | |- | ||
Line 168: | Line 164: | ||
|- | |- | ||
| 03:27 | | 03:27 | ||
− | |Save the file as | + | |'''Save''' the file as "sequence.gb" in your '''home''' folder. Close the text editor. |
|- | |- | ||
| 03:34 | | 03:34 | ||
− | | For demonstration purpose we need a FASTA file with a single record. | + | | For demonstration purpose, we need a FASTA file with a single '''record'''. |
|- | |- | ||
Line 180: | Line 176: | ||
|- | |- | ||
| 03:48 | | 03:48 | ||
− | |Now select the file “'''Human insulin gene complete cds'''”. | + | |Now, select the file “'''Human insulin gene complete cds'''”. |
|- | |- | ||
| 03:54 | | 03:54 | ||
− | |Click on the check box. | + | |Click on the check-box. |
|- | |- | ||
| 03:57 | | 03:57 | ||
− | | And | + | | And follow the same steps shown earlier to '''save''' the file in the '''home''' folder. |
|- | |- | ||
| 04:01 | | 04:01 | ||
− | |Save | + | |'''Save''' the file as "insulin.fasta". |
|- | |- | ||
Line 200: | Line 196: | ||
|- | |- | ||
| 04:16 | | 04:16 | ||
− | |Close the text editor. | + | |Close the text-editor. |
|- | |- | ||
Line 208: | Line 204: | ||
|- | |- | ||
| 04:23 | | 04:23 | ||
− | |Most file formats can be parsed using | + | |Most file formats can be parsed using '''function'''s available in '''SeqIO''' module. |
|- | |- | ||
| 04:30 | | 04:30 | ||
− | |Most commonly used functions of '''SeqIO''' module are | + | |Most commonly used functions of '''SeqIO''' module are: '''parse, read, write''' and '''convert'''. |
|- | |- | ||
| 04:38 | | 04:38 | ||
− | | Open the terminal by pressing ''' | + | | Open the terminal by pressing '''Ctrl, Alt''' and '''t''' keys simultaneously. |
|- | |- | ||
| 04:44 | | 04:44 | ||
− | | Start '''Ipython''' by typing | + | | Start '''Ipython''' by typing "ipython" at the prompt. Press '''Enter'''. |
|- | |- | ||
| 04:51 | | 04:51 | ||
− | | Next | + | | Next, '''import''' "SeqIO" module from '''Bio''' package. |
|- | |- | ||
| 04:56 | | 04:56 | ||
− | | At the prompt | + | | At the prompt, type: '''from Bio import SeqIO'''. Press '''Enter'''. |
|- | |- | ||
Line 236: | Line 232: | ||
|- | |- | ||
|05:07 | |05:07 | ||
− | |For demonstration, I will use a '''FASTA''' file that has many | + | |For demonstration, I will use a '''FASTA''' file that has many '''record'''s which we had downloaded earlier from the database. |
|- | |- | ||
Line 244: | Line 240: | ||
|- | |- | ||
| 05:22 | | 05:22 | ||
− | |Here we are using the '''parse''' function to read contents of '''sequence.fasta''' file. | + | |Here, we are using the '''parse''' function to read the contents of the '''sequence.fasta''' file. |
|- | |- | ||
| 05:30 | | 05:30 | ||
− | |For the output | + | |For the output, print '''record id''', sequence present in the record and also the length of the sequence. |
|- | |- | ||
| 05:41 | | 05:41 | ||
Line 267: | Line 263: | ||
|- | |- | ||
| 06:02 | | 06:02 | ||
− | |Press | + | |Press '''Enter''' key twice to get the output. |
|- | |- | ||
Line 279: | Line 275: | ||
|- | |- | ||
| 06:26 | | 06:26 | ||
− | |So, the output does not specifies it | + | |So, the output does not specifies it as a '''DNA sequence'''. |
|- | |- | ||
Line 287: | Line 283: | ||
|- | |- | ||
| 06:36 | | 06:36 | ||
− | |For Demonstration we will use the '''GenBank''' file which we have | + | |For Demonstration we will use the '''GenBank''' file which we have downloaded earlier from the database. |
|- | |- | ||
| 06:43 | | 06:43 | ||
− | |Press up arrow key to get the lines of code which we had used earlier. | + | |Press up-arrow key to get the lines of code which we had used earlier. |
|- | |- | ||
| 06:49 | | 06:49 | ||
− | |Change the file name to '''sequence.gb ''' | + | |Change the file name to '''sequence.gb '''. |
|- | |- | ||
| 06:53 | | 06:53 | ||
Line 306: | Line 302: | ||
|- | |- | ||
| 06:58 | | 06:58 | ||
− | | Press | + | | Press '''Enter''' key twice to get the output. |
|- | |- | ||
| 07:03 | | 07:03 | ||
− | |Here too the output shows the '''record id''', sequence and the length of the sequence for all the records in the file. | + | |Here too the output shows the '''record id''', '''sequence''' and the length of the sequence for all the records in the file. |
|- | |- | ||
Line 318: | Line 314: | ||
|- | |- | ||
| 07:19 | | 07:19 | ||
− | | Similarly '''Swiss-prot''' and '''EMBL''' files can be parsed using same code as above. | + | | Similarly, '''Swiss-prot''' and '''EMBL''' files can be parsed using the same code as above. |
|- | |- | ||
Line 326: | Line 322: | ||
|- | |- | ||
| 07:34 | | 07:34 | ||
− | | Here we will use the previously saved FASTA file with a single record, that is '''insulin.fasta '''as an example | + | | Here, we will use the previously saved '''FASTA''' file with a single record, that is, '''insulin.fasta '''as an example. |
|- | |- | ||
| 07:43 | | 07:43 | ||
− | |Notice that we have used '''read''' function instead of parse function. Press Enter | + | |Notice that we have used '''read''' function instead of '''parse''' function. Press '''Enter'''. |
|- | |- | ||
Line 338: | Line 334: | ||
|- | |- | ||
| 07:55 | | 07:55 | ||
− | |It shows the sequence as sequence record object. | + | |It shows the sequence as '''sequence record object'''. |
|- | |- | ||
Line 350: | Line 346: | ||
|- | |- | ||
| 08:11 | | 08:11 | ||
− | |At the prompt | + | |At the prompt, type: '''record dot seq'''. Press '''Enter'''. |
|- | |- | ||
Line 358: | Line 354: | ||
|- | |- | ||
| 08:22 | | 08:22 | ||
− | | To view the identifiers for this record, type | + | | To view the identifiers for this record, type: '''record dot id.''' Press '''Enter'''. |
|- | |- | ||
| 08:29 | | 08:29 | ||
− | |The output shows the GI number and accession number etc. | + | |The output shows the '''GI''' number and accession number etc. |
|- | |- | ||
| 08:34 | | 08:34 | ||
− | |You can use the function described above to parse the data files of your choice. | + | |You can use the function described above to '''parse''' the data files of your choice. |
|- | |- | ||
| 08:40 | | 08:40 | ||
− | | Now | + | | Now, let's summarize. |
|- | |- | ||
| 08:42 | | 08:42 | ||
− | |In this tutorial we have learnt | + | |In this tutorial, we have learnt:to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website and use '''parse''' and '''read''' functions from the '''SeqIO''' module |
|- | |- | ||
| 08:55 | | 08:55 | ||
− | | | + | | to extract data such as '''record id'''s, description and sequences from '''FASTA''' and '''GenBank''' files. |
|- | |- | ||
| 09:03 | | 09:03 | ||
− | | Now for the assignment | + | | Now, for the assignment- |
|- | |- | ||
Line 390: | Line 386: | ||
|- | |- | ||
| 09:13 | | 09:13 | ||
− | |Convert the file of sequences to their reverse | + | |Convert the file of sequences to their '''reverse complement'''s. |
|- | |- | ||
Line 398: | Line 394: | ||
|- | |- | ||
| 09:22 | | 09:22 | ||
− | |Use '''parse''' function to load nucleotide sequences from the '''FASTA''' file. | + | |Use '''parse''' function to '''load''' nucleotide sequences from the '''FASTA''' file. |
|- | |- | ||
| 09:28 | | 09:28 | ||
− | |Next print reverse complements using the Sequence object’s built in reverse complement method. | + | |Next, print reverse complements using the Sequence object’s built in '''reverse complement''' method. |
|- | |- | ||
| 09:37 | | 09:37 | ||
− | | Video at the following link | + | | Video at the following link summarizes the spoken-tutorial project. |
|- | |- | ||
Line 414: | Line 410: | ||
|- | |- | ||
| 09:44 | | 09:44 | ||
− | | The Spoken Tutorial Project | + | | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
|- | |- | ||
Line 426: | Line 422: | ||
|- | |- | ||
| 10:01 | | 10:01 | ||
− | |More information on this | + | |More information on this mission is available at the link shown. |
|- | |- | ||
| 10:06 | | 10:06 | ||
− | |This is Snehalatha from IIT Bombay signing off. Thank you for joining. | + | |This is Snehalatha from '''IIT Bombay''', signing off. Thank you for joining. |
|} | |} |
Latest revision as of 17:26, 10 March 2017
|
|
---|---|
00:01 | Hello everyone.Welcome to this tutorial on Parsing Data. |
00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website. |
00:14 | And, Parse data files using functions in Sequence Input/Output module. |
00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
00:26 | and basic Python programming. |
00:30 | Refer to the Python tutorials at the given link. |
00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
00:40 | Python version 2.7.8 |
00:44 | Ipython interpretor version 2.3.0 |
00:48 | Biopython version 1.64 and * Mozilla Firefox browser 35.0. |
00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc. |
01:07 | Data files can be downloaded from the database websites. |
01:12 | Open the website link given below, in any web browser. |
01:17 | A web-page opens. |
01:19 | Let us download FASTA and GenBank files for human insulin gene. |
01:25 | In the search box, type: "human insulin", click on Search button. |
01:31 | The web-page shows many files for human insulin gene. |
01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
01:43 | I will choose files that have less than 500 base pairs. |
01:48 | Click on the check-box to select the file, to download. |
01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page. |
02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button. |
02:09 | Under the heading “Choose destination”, click on File option. |
02:13 | You can save this file in any file format, listed under format drop-down list box. |
02:21 | Choose FASTA from the given options. |
02:25 | Then click on Create file option. |
02:29 | A dialog-box appears on the screen. |
02:32 | Select Open with, click on OK. |
02:36 | A file opens in a text editor. |
02:39 | The file shows 4 records, since we had selected four files to download. |
02:46 | The first line in each record is an identifier line. |
02:50 | It starts with a “greater than (>)” symbol. |
02:53 | This is followed by a sequence. |
02:56 | Save the file in your home folder as “sequence.fasta'”. |
03:01 | Close the text editor. |
03:03 | Follow the same steps as above, to download the files in GenBank format |
03:08 | for the same files selected earlier. |
03:12 | Select the file format as GenBank. |
03:16 | Create a file. Open with a text editor. |
03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file. |
03:27 | Save the file as "sequence.gb" in your home folder. Close the text editor. |
03:34 | For demonstration purpose, we need a FASTA file with a single record. |
03:39 | For this, clear the earlier selection by again clicking on the check boxes. |
03:48 | Now, select the file “Human insulin gene complete cds”. |
03:54 | Click on the check-box. |
03:57 | And follow the same steps shown earlier to save the file in the home folder. |
04:01 | Save the file as "insulin.fasta". |
04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries. |
04:16 | Close the text-editor. |
04:19 | Extracting data from data files is called as Parsing. |
04:23 | Most file formats can be parsed using functions available in SeqIO module. |
04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert. |
04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously. |
04:44 | Start Ipython by typing "ipython" at the prompt. Press Enter. |
04:51 | Next, import "SeqIO" module from Bio package. |
04:56 | At the prompt, type: from Bio import SeqIO. Press Enter. |
05:04 | We will start with the most important function “parse”. |
05:07 | For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database. |
05:17 | For simple FASTA parsing, type the following at the prompt. |
05:22 | Here, we are using the parse function to read the contents of the sequence.fasta file. |
05:30 | For the output, print record id, sequence present in the record and also the length of the sequence. |
05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects. |
05:48 | It is generally used with a for loop. |
05:52 | It can accept two arguments, the first one is the file name to read the data. |
05:59 | The second specifies the file format. |
06:02 | Press Enter key twice to get the output. |
06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file. |
06:21 | Notice that the FASTA format does not specify the alphabet. |
06:26 | So, the output does not specifies it as a DNA sequence. |
06:31 | The same steps can be repeated for parsing GenBank file. |
06:36 | For Demonstration we will use the GenBank file which we have downloaded earlier from the database. |
06:43 | Press up-arrow key to get the lines of code which we had used earlier. |
06:49 | Change the file name to sequence.gb . |
06:53 | Change the file format to genbank. |
06:56 | The rest of the code remains same. |
06:58 | Press Enter key twice to get the output. |
07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file. |
07:12 | Notice that the GenBank format specifies the sequence as DNA sequence. |
07:19 | Similarly, Swiss-prot and EMBL files can be parsed using the same code as above. |
07:27 | If your file contains a single record then type the following lines for parsing. |
07:34 | Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example. |
07:43 | Notice that we have used read function instead of parse function. Press Enter. |
07:50 | The output shows the contents for the file insulin.fasta. |
07:55 | It shows the sequence as sequence record object. |
07:59 | And other attributes such as GI, accession number and description. |
08:06 | We can also view the individual attributes of this record as follows. |
08:11 | At the prompt, type: record dot seq. Press Enter. |
08:18 | The output shows the sequence present in the file. |
08:22 | To view the identifiers for this record, type: record dot id. Press Enter. |
08:29 | The output shows the GI number and accession number etc. |
08:34 | You can use the function described above to parse the data files of your choice. |
08:40 | Now, let's summarize. |
08:42 | In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module |
08:55 | to extract data such as record ids, description and sequences from FASTA and GenBank files. |
09:03 | Now, for the assignment- |
09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database. |
09:13 | Convert the file of sequences to their reverse complements. |
09:17 | Your completed assignment should have the following lines of code. |
09:22 | Use parse function to load nucleotide sequences from the FASTA file. |
09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method. |
09:37 | Video at the following link summarizes the spoken-tutorial project. |
09:42 | Please download and watch it. |
09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
09:51 | For more details, please write to us. |
09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
10:01 | More information on this mission is available at the link shown. |
10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |