Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"
From Script | Spoken-Tutorial
PoojaMoolya (Talk | contribs) (Created page with "{| Border=1 ! <center>Time</center> ! <center>Narration</center> |- | 00:01 | Hello everyone. |- | 00:02 | Welcome to this tutorial on '''Parsing Data.''' |- | 00:06 | In t...") |
Sandhya.np14 (Talk | contribs) |
||
Line 13: | Line 13: | ||
|- | |- | ||
| 00:06 | | 00:06 | ||
− | | In this tutorial, we will learn to | + | | In this tutorial, we will learn to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website. |
|- | |- | ||
| 00:14 | | 00:14 | ||
− | | And '''Parse''' data files using | + | | And, '''Parse''' data files using '''function'''s in '''Sequence Input/Output''' module. |
|- | |- | ||
| 00:19 | | 00:19 | ||
− | | To follow this tutorial you should be familiar with | + | | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
|- | |- | ||
| 00:26 | | 00:26 | ||
− | | | + | | and basic '''Python''' programming. |
|- | |- | ||
Line 33: | Line 33: | ||
|- | |- | ||
| 00:34 | | 00:34 | ||
− | | To record this tutorial I am using '''Ubuntu''' | + | | To record this tutorial, I am using: * '''Ubuntu OS''' version 14.10 |
|- | |- | ||
| 00:40 | | 00:40 | ||
− | | '''Python''' version 2.7.8 | + | |* '''Python''' version 2.7.8 |
|- | |- | ||
| 00:44 | | 00:44 | ||
− | | '''Ipython interpretor''' version 2.3.0 | + | |* '''Ipython interpretor''' version 2.3.0 |
|- | |- | ||
| 00:48 | | 00:48 | ||
− | | '''Biopython''' 1.64 and '''Mozilla Firefox '''browser 35.0 | + | |* '''Biopython''' 1.64 and * '''Mozilla Firefox '''browser 35.0. |
|- | |- | ||
| 00:56 | | 00:56 | ||
− | | Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc | + | | Scientific data in biology is generally stored in text files such as '''FASTA''', '''GenBank''', '''EMBL''', '''Swiss-Prot''' etc. |
|- | |- | ||
Line 69: | Line 69: | ||
|- | |- | ||
| 01:25 | | 01:25 | ||
− | |In the search box | + | |In the search box, type: '''human insulin''', click on '''Search''' button. |
|- | |- | ||
| 01:31 | | 01:31 | ||
− | | The web-page shows many files for human insulin gene. | + | | The web-page shows many files for human '''insulin gene'''. |
|- | |- | ||
| 01:35 | | 01:35 | ||
− | |For demonstration, I will select 4 files with the name | + | |For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
|- | |- | ||
| 01:43 | | 01:43 | ||
− | |I will choose files that have less than 500 base pairs. | + | |I will choose files that have less than 500 '''base''' pairs. |
|- | |- | ||
| 01:48 | | 01:48 | ||
− | |Click on the check box to select the file to download. | + | |Click on the check-box to select the file, to download. |
|- | |- | ||
Line 93: | Line 93: | ||
|- | |- | ||
| 02:02 | | 02:02 | ||
− | |Click on the small selection button with a down arrow present next to the “'''Send to'''” button. | + | |Click on the small selection button with a down arrow, present next to the “'''Send to'''” button. |
|- | |- | ||
| 02:09 | | 02:09 | ||
− | | Under the heading “'''Choose destination'''” | + | | Under the heading “'''Choose destination'''”, click on '''File''' option. |
|- | |- | ||
| 02:13 | | 02:13 | ||
− | |You can save this file in any file format listed under | + | |You can '''save''' this file in any file format, listed under '''format''' drop-down list box. |
|- | |- | ||
| 02:21 | | 02:21 | ||
− | |Choose | + | |Choose '''FASTA''' from the given options. |
|- | |- | ||
| 02:25 | | 02:25 | ||
− | |Then click on | + | |Then click on '''Create file''' option. |
|- | |- | ||
| 02:29 | | 02:29 | ||
− | | A dialog box appears on the screen. | + | | A dialog-box appears on the screen. |
|- | |- | ||
|02:32 | |02:32 | ||
− | |Select | + | |Select '''Open with''', click on '''OK .''' |
|- | |- | ||
| 02:36 | | 02:36 | ||
− | | A file opens in a text editor. | + | | A file opens in a '''text editor'''. |
|- | |- | ||
Line 128: | Line 128: | ||
|- | |- | ||
| 02:46 | | 02:46 | ||
− | |The first line in each record is an '''identifier''' line | + | |The first line in each record is an '''identifier''' line. |
|- | |- | ||
| 02:50 | | 02:50 | ||
− | |It starts with a “greater than (>) | + | |It starts with a “greater than (>)” symbol. |
|- | |- | ||
Line 140: | Line 140: | ||
|- | |- | ||
| 02:56 | | 02:56 | ||
− | |Save the file in your home | + | |'''Save''' the file in your '''home''' folder as “sequence.fasta'”. |
|- | |- | ||
Line 148: | Line 148: | ||
|- | |- | ||
| 03:03 | | 03:03 | ||
− | | Follow the same steps as above to download the files in '''GenBank''' format | + | | Follow the same steps as above, to download the files in '''GenBank''' format |
|- | |- | ||
| 03:08 | | 03:08 | ||
− | | | + | |for the same files selected earlier. |
|- | |- | ||
| 03:12 | | 03:12 | ||
− | |Select the file format as '''GenBank.''' | + | |Select the '''file format''' as '''GenBank.''' |
|- | |- | ||
Line 168: | Line 168: | ||
|- | |- | ||
| 03:27 | | 03:27 | ||
− | |Save the file as '''sequence.gb '''in your | + | |Save the file as '''sequence.gb '''in your '''home''' folder. Close the text editor. |
|- | |- | ||
Line 180: | Line 180: | ||
|- | |- | ||
| 03:48 | | 03:48 | ||
− | |Now select the file “'''Human insulin gene complete cds'''”. | + | |Now, select the file “'''Human insulin gene complete cds'''”. |
|- | |- | ||
Line 188: | Line 188: | ||
|- | |- | ||
| 03:57 | | 03:57 | ||
− | | And | + | | And follow the same steps shown earlier to '''save''' the file in the '''home''' folder. |
|- | |- | ||
Line 208: | Line 208: | ||
|- | |- | ||
| 04:23 | | 04:23 | ||
− | |Most file formats can be parsed using | + | |Most file formats can be parsed using '''function'''s available in '''SeqIO''' module. |
|- | |- | ||
| 04:30 | | 04:30 | ||
− | |Most commonly used functions of '''SeqIO''' module are | + | |Most commonly used functions of '''SeqIO''' module are: '''parse''', '''read''', '''write''' and '''convert'''. |
|- | |- | ||
| 04:38 | | 04:38 | ||
− | | Open the terminal by pressing ''' | + | | Open the terminal by pressing '''Ctrl, Alt''' and '''t''' keys simultaneously. |
|- | |- | ||
| 04:44 | | 04:44 | ||
− | | Start '''Ipython''' by typing '''ipython''' at the prompt. Press | + | | Start '''Ipython''' by typing '''ipython''' at the prompt. Press '''Enter'''. |
|- | |- | ||
| 04:51 | | 04:51 | ||
− | | Next | + | | Next, '''import''' "SeqIO" module from '''Bio''' package. |
|- | |- | ||
| 04:56 | | 04:56 | ||
− | | At the prompt | + | | At the prompt, type: '''from Bio import SeqIO'''. Press '''Enter'''. |
|- | |- | ||
Line 330: | Line 330: | ||
|- | |- | ||
| 07:43 | | 07:43 | ||
− | |Notice that we have used '''read''' function instead of parse function. Press Enter | + | |Notice that we have used '''read''' function instead of parse function. Press '''Enter'''. |
|- | |- | ||
Line 350: | Line 350: | ||
|- | |- | ||
| 08:11 | | 08:11 | ||
− | |At the prompt | + | |At the prompt, type: '''record dot seq'''. Press '''Enter'''. |
|- | |- | ||
Line 358: | Line 358: | ||
|- | |- | ||
| 08:22 | | 08:22 | ||
− | | To view the identifiers for this record, type, '''record dot id.''' Press | + | | To view the identifiers for this record, type, '''record dot id.''' Press '''Enter'''. |
|- | |- | ||
Line 366: | Line 366: | ||
|- | |- | ||
| 08:34 | | 08:34 | ||
− | |You can use the function described above to parse the data files of your choice. | + | |You can use the function described above to '''parse''' the data files of your choice. |
|- | |- | ||
| 08:40 | | 08:40 | ||
− | | Now | + | | Now let's summarize. |
|- | |- | ||
| 08:42 | | 08:42 | ||
− | |In this tutorial we have learnt | + | |In this tutorial, we have learnt: |
+ | * to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website | ||
+ | * and use '''parse''' and '''read''' functions from the '''SeqIO''' module | ||
|- | |- | ||
| 08:55 | | 08:55 | ||
− | |To extract data such as record ids, description and sequences | + | |* To extract data such as record ids, description and sequences from '''FASTA''' and '''GenBank''' files. |
|- | |- | ||
| 09:03 | | 09:03 | ||
− | | Now for the assignment | + | | Now. for the assignment- |
|- | |- | ||
Line 398: | Line 400: | ||
|- | |- | ||
| 09:22 | | 09:22 | ||
− | |Use '''parse''' function to load nucleotide sequences from the '''FASTA''' file. | + | |Use '''parse''' function to '''load''' nucleotide sequences from the '''FASTA''' file. |
|- | |- | ||
| 09:28 | | 09:28 | ||
− | |Next print reverse complements using the Sequence object’s built in reverse complement method. | + | |Next, print reverse complements using the Sequence object’s built in reverse complement method. |
|- | |- | ||
| 09:37 | | 09:37 | ||
− | | Video at the following link | + | | Video at the following link summarizes the spoken-tutorial project. |
|- | |- | ||
Line 414: | Line 416: | ||
|- | |- | ||
| 09:44 | | 09:44 | ||
− | | The Spoken Tutorial Project | + | | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
|- | |- | ||
Line 426: | Line 428: | ||
|- | |- | ||
| 10:01 | | 10:01 | ||
− | |More information on this | + | |More information on this mission is available at the link shown. |
|- | |- | ||
| 10:06 | | 10:06 | ||
− | |This is Snehalatha from IIT Bombay signing off. Thank you for joining. | + | |This is Snehalatha from '''IIT Bombay''', signing off. Thank you for joining. |
|} | |} |
Revision as of 23:30, 2 August 2016
|
|
---|---|
00:01 | Hello everyone. |
00:02 | Welcome to this tutorial on Parsing Data. |
00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website. |
00:14 | And, Parse data files using functions in Sequence Input/Output module. |
00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
00:26 | and basic Python programming. |
00:30 | Refer to the Python tutorials at the given link. |
00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
00:40 | * Python version 2.7.8 |
00:44 | * Ipython interpretor version 2.3.0 |
00:48 | * Biopython 1.64 and * Mozilla Firefox browser 35.0. |
00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc. |
01:07 | Data files can be download from the database websites. |
01:12 | Open the website link given below in any web browser. |
01:17 | A web-page opens. |
01:19 | Let us download FASTA and GenBank files for human insulin gene. |
01:25 | In the search box, type: human insulin, click on Search button. |
01:31 | The web-page shows many files for human insulin gene. |
01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
01:43 | I will choose files that have less than 500 base pairs. |
01:48 | Click on the check-box to select the file, to download. |
01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page. |
02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button. |
02:09 | Under the heading “Choose destination”, click on File option. |
02:13 | You can save this file in any file format, listed under format drop-down list box. |
02:21 | Choose FASTA from the given options. |
02:25 | Then click on Create file option. |
02:29 | A dialog-box appears on the screen. |
02:32 | Select Open with, click on OK . |
02:36 | A file opens in a text editor. |
02:39 | The file shows 4 records, since we had selected four files to download. |
02:46 | The first line in each record is an identifier line. |
02:50 | It starts with a “greater than (>)” symbol. |
02:53 | This is followed by a sequence. |
02:56 | Save the file in your home folder as “sequence.fasta'”. |
03:01 | Close the text editor. |
03:03 | Follow the same steps as above, to download the files in GenBank format |
03:08 | for the same files selected earlier. |
03:12 | Select the file format as GenBank. |
03:16 | Create file, open with a text editor. |
03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file. |
03:27 | Save the file as sequence.gb in your home folder. Close the text editor. |
03:34 | For demonstration purpose we need a FASTA file with a single record. |
03:39 | For this, clear the earlier selection by again clicking on the check boxes. |
03:48 | Now, select the file “Human insulin gene complete cds”. |
03:54 | Click on the check box. |
03:57 | And follow the same steps shown earlier to save the file in the home folder. |
04:01 | Save the file as insulin.fasta. |
04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries. |
04:16 | Close the text editor. |
04:19 | Extracting data from data files is called as Parsing. |
04:23 | Most file formats can be parsed using functions available in SeqIO module. |
04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert. |
04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously. |
04:44 | Start Ipython by typing ipython at the prompt. Press Enter. |
04:51 | Next, import "SeqIO" module from Bio package. |
04:56 | At the prompt, type: from Bio import SeqIO. Press Enter. |
05:04 | We will start with the most important function “parse”. |
05:07 | For demonstration, I will use a FASTA file that has many records, which we had downloaded earlier from the database. |
05:17 | For simple FASTA parsing, type the following at the prompt. |
05:22 | Here we are using the parse function to read contents of sequence.fasta file. |
05:30 | For the output print, record id, sequence present in the record and also the length of the sequence. |
05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects. |
05:48 | It is generally used with a for loop. |
05:52 | It can accept two arguments, the first one is the file name to read the data. |
05:59 | The second specifies the file format. |
06:02 | Press enter key twice to get the output. |
06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file. |
06:21 | Notice that the FASTA format does not specify the alphabet. |
06:26 | So, the output does not specifies it as as a DNA sequence. |
06:31 | The same steps can be repeated for parsing GenBank file. |
06:36 | For Demonstration we will use the GenBank file which we have download earlier from the database. |
06:43 | Press up arrow key to get the lines of code which we had used earlier. |
06:49 | Change the file name to sequence.gb |
06:53 | Change the file format to genbank. |
06:56 | The rest of the code remains same. |
06:58 | Press enter key twice to get the output. |
07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file. |
07:12 | Notice that the GenBank format specifies the sequence as DNA sequence. |
07:19 | Similarly Swiss-prot and EMBL files can be parsed using same code as above. |
07:27 | If your file contains a single record then type the following lines for parsing. |
07:34 | Here we will use the previously saved FASTA file with a single record, that is insulin.fasta as an example. |
07:43 | Notice that we have used read function instead of parse function. Press Enter. |
07:50 | The output shows the contents for the file insulin.fasta. |
07:55 | It shows the sequence as sequence record object. |
07:59 | And other attributes such as GI, accession number and description. |
08:06 | We can also view the individual attributes of this record as follows. |
08:11 | At the prompt, type: record dot seq. Press Enter. |
08:18 | The output shows the sequence present in the file. |
08:22 | To view the identifiers for this record, type, record dot id. Press Enter. |
08:29 | The output shows the GI number and accession number etc. |
08:34 | You can use the function described above to parse the data files of your choice. |
08:40 | Now let's summarize. |
08:42 | In this tutorial, we have learnt:
|
08:55 | * To extract data such as record ids, description and sequences from FASTA and GenBank files. |
09:03 | Now. for the assignment- |
09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database. |
09:13 | Convert the file of sequences to their reverse complements. |
09:17 | Your completed assignment should have the following lines of code. |
09:22 | Use parse function to load nucleotide sequences from the FASTA file. |
09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method. |
09:37 | Video at the following link summarizes the spoken-tutorial project. |
09:42 | Please download and watch it. |
09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
09:51 | For more details, please write to us. |
09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
10:01 | More information on this mission is available at the link shown. |
10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |