Difference between revisions of "Biopython/C2/Parsing-Data/English-timed"
From Script | Spoken-Tutorial
Sandhya.np14 (Talk | contribs) |
PoojaMoolya (Talk | contribs) |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 5: | Line 5: | ||
|- | |- | ||
| 00:01 | | 00:01 | ||
− | | Hello everyone. | + | | Hello everyone.Welcome to this tutorial on '''Parsing Data.''' |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
|- | |- | ||
Line 37: | Line 33: | ||
|- | |- | ||
| 00:40 | | 00:40 | ||
− | | | + | | '''Python''' version 2.7.8 |
|- | |- | ||
| 00:44 | | 00:44 | ||
− | | | + | | '''Ipython interpretor''' version 2.3.0 |
|- | |- | ||
| 00:48 | | 00:48 | ||
− | | | + | |'''Biopython''' version 1.64 and * '''Mozilla Firefox '''browser 35.0. |
|- | |- | ||
Line 53: | Line 49: | ||
|- | |- | ||
| 01:07 | | 01:07 | ||
− | |Data files can be | + | |Data files can be downloaded from the database websites. |
|- | |- | ||
| 01:12 | | 01:12 | ||
− | |Open the website link given below in any web browser. | + | |Open the website link given below, in any web browser. |
|- | |- | ||
Line 65: | Line 61: | ||
|- | |- | ||
| 01:19 | | 01:19 | ||
− | |Let us download '''FASTA''' and '''GenBank''' files for human '''insulin''' | + | |Let us download '''FASTA''' and '''GenBank''' files for human '''insulin gene'''. |
|- | |- | ||
| 01:25 | | 01:25 | ||
− | |In the search box, type: | + | |In the search box, type: "human insulin", click on '''Search''' button. |
|- | |- | ||
Line 116: | Line 112: | ||
|- | |- | ||
|02:32 | |02:32 | ||
− | |Select '''Open with''', click on '''OK .''' | + | |Select '''Open with''', click on '''OK.''' |
|- | |- | ||
Line 160: | Line 156: | ||
|- | |- | ||
| 03:16 | | 03:16 | ||
− | | | + | |Create a file. Open with a text editor. |
|- | |- | ||
Line 168: | Line 164: | ||
|- | |- | ||
| 03:27 | | 03:27 | ||
− | |Save | + | |'''Save''' the file as "sequence.gb" in your '''home''' folder. Close the text editor. |
|- | |- | ||
| 03:34 | | 03:34 | ||
− | | For demonstration purpose we need a FASTA file with a single record. | + | | For demonstration purpose, we need a FASTA file with a single '''record'''. |
|- | |- | ||
Line 184: | Line 180: | ||
|- | |- | ||
| 03:54 | | 03:54 | ||
− | |Click on the check box. | + | |Click on the check-box. |
|- | |- | ||
Line 192: | Line 188: | ||
|- | |- | ||
| 04:01 | | 04:01 | ||
− | |Save | + | |'''Save''' the file as "insulin.fasta". |
|- | |- | ||
Line 200: | Line 196: | ||
|- | |- | ||
| 04:16 | | 04:16 | ||
− | |Close the text editor. | + | |Close the text-editor. |
|- | |- | ||
Line 212: | Line 208: | ||
|- | |- | ||
| 04:30 | | 04:30 | ||
− | |Most commonly used functions of '''SeqIO''' module are: '''parse | + | |Most commonly used functions of '''SeqIO''' module are: '''parse, read, write''' and '''convert'''. |
|- | |- | ||
Line 220: | Line 216: | ||
|- | |- | ||
| 04:44 | | 04:44 | ||
− | | Start '''Ipython''' by typing | + | | Start '''Ipython''' by typing "ipython" at the prompt. Press '''Enter'''. |
|- | |- | ||
Line 236: | Line 232: | ||
|- | |- | ||
|05:07 | |05:07 | ||
− | |For demonstration, I will use a '''FASTA''' file that has many | + | |For demonstration, I will use a '''FASTA''' file that has many '''record'''s which we had downloaded earlier from the database. |
|- | |- | ||
Line 244: | Line 240: | ||
|- | |- | ||
| 05:22 | | 05:22 | ||
− | |Here we are using the '''parse''' function to read contents of '''sequence.fasta''' file. | + | |Here, we are using the '''parse''' function to read the contents of the '''sequence.fasta''' file. |
|- | |- | ||
| 05:30 | | 05:30 | ||
− | |For the output | + | |For the output, print '''record id''', sequence present in the record and also the length of the sequence. |
|- | |- | ||
| 05:41 | | 05:41 | ||
Line 267: | Line 263: | ||
|- | |- | ||
| 06:02 | | 06:02 | ||
− | |Press | + | |Press '''Enter''' key twice to get the output. |
|- | |- | ||
Line 279: | Line 275: | ||
|- | |- | ||
| 06:26 | | 06:26 | ||
− | |So, the output does not specifies it | + | |So, the output does not specifies it as a '''DNA sequence'''. |
|- | |- | ||
Line 287: | Line 283: | ||
|- | |- | ||
| 06:36 | | 06:36 | ||
− | |For Demonstration we will use the '''GenBank''' file which we have | + | |For Demonstration we will use the '''GenBank''' file which we have downloaded earlier from the database. |
|- | |- | ||
| 06:43 | | 06:43 | ||
− | |Press up arrow key to get the lines of code which we had used earlier. | + | |Press up-arrow key to get the lines of code which we had used earlier. |
|- | |- | ||
| 06:49 | | 06:49 | ||
− | |Change the file name to '''sequence.gb ''' | + | |Change the file name to '''sequence.gb '''. |
|- | |- | ||
| 06:53 | | 06:53 | ||
Line 306: | Line 302: | ||
|- | |- | ||
| 06:58 | | 06:58 | ||
− | | Press | + | | Press '''Enter''' key twice to get the output. |
|- | |- | ||
| 07:03 | | 07:03 | ||
− | |Here too the output shows the '''record id''', sequence and the length of the sequence for all the records in the file. | + | |Here too the output shows the '''record id''', '''sequence''' and the length of the sequence for all the records in the file. |
|- | |- | ||
Line 318: | Line 314: | ||
|- | |- | ||
| 07:19 | | 07:19 | ||
− | | Similarly '''Swiss-prot''' and '''EMBL''' files can be parsed using same code as above. | + | | Similarly, '''Swiss-prot''' and '''EMBL''' files can be parsed using the same code as above. |
|- | |- | ||
Line 326: | Line 322: | ||
|- | |- | ||
| 07:34 | | 07:34 | ||
− | | Here we will use the previously saved FASTA file with a single record, that is '''insulin.fasta '''as an example | + | | Here, we will use the previously saved '''FASTA''' file with a single record, that is, '''insulin.fasta '''as an example. |
|- | |- | ||
| 07:43 | | 07:43 | ||
− | |Notice that we have used '''read''' function instead of parse function. Press '''Enter'''. | + | |Notice that we have used '''read''' function instead of '''parse''' function. Press '''Enter'''. |
|- | |- | ||
Line 338: | Line 334: | ||
|- | |- | ||
| 07:55 | | 07:55 | ||
− | |It shows the sequence as sequence record object. | + | |It shows the sequence as '''sequence record object'''. |
|- | |- | ||
Line 358: | Line 354: | ||
|- | |- | ||
| 08:22 | | 08:22 | ||
− | | To view the identifiers for this record, type | + | | To view the identifiers for this record, type: '''record dot id.''' Press '''Enter'''. |
|- | |- | ||
| 08:29 | | 08:29 | ||
− | |The output shows the GI number and accession number etc. | + | |The output shows the '''GI''' number and accession number etc. |
|- | |- | ||
Line 370: | Line 366: | ||
|- | |- | ||
| 08:40 | | 08:40 | ||
− | | Now let's summarize. | + | | Now, let's summarize. |
|- | |- | ||
| 08:42 | | 08:42 | ||
− | |In this tutorial, we have learnt: | + | |In this tutorial, we have learnt:to download '''FASTA''' and '''GenBank''' files from '''NCBI''' database website and use '''parse''' and '''read''' functions from the '''SeqIO''' module |
− | + | ||
− | + | ||
|- | |- | ||
| 08:55 | | 08:55 | ||
− | | | + | | to extract data such as '''record id'''s, description and sequences from '''FASTA''' and '''GenBank''' files. |
|- | |- | ||
| 09:03 | | 09:03 | ||
− | | Now | + | | Now, for the assignment- |
|- | |- | ||
Line 392: | Line 386: | ||
|- | |- | ||
| 09:13 | | 09:13 | ||
− | |Convert the file of sequences to their reverse | + | |Convert the file of sequences to their '''reverse complement'''s. |
|- | |- | ||
Line 404: | Line 398: | ||
|- | |- | ||
| 09:28 | | 09:28 | ||
− | |Next, print reverse complements using the Sequence object’s built in reverse complement method. | + | |Next, print reverse complements using the Sequence object’s built in '''reverse complement''' method. |
|- | |- |
Latest revision as of 17:26, 10 March 2017
|
|
---|---|
00:01 | Hello everyone.Welcome to this tutorial on Parsing Data. |
00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website. |
00:14 | And, Parse data files using functions in Sequence Input/Output module. |
00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
00:26 | and basic Python programming. |
00:30 | Refer to the Python tutorials at the given link. |
00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
00:40 | Python version 2.7.8 |
00:44 | Ipython interpretor version 2.3.0 |
00:48 | Biopython version 1.64 and * Mozilla Firefox browser 35.0. |
00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc. |
01:07 | Data files can be downloaded from the database websites. |
01:12 | Open the website link given below, in any web browser. |
01:17 | A web-page opens. |
01:19 | Let us download FASTA and GenBank files for human insulin gene. |
01:25 | In the search box, type: "human insulin", click on Search button. |
01:31 | The web-page shows many files for human insulin gene. |
01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
01:43 | I will choose files that have less than 500 base pairs. |
01:48 | Click on the check-box to select the file, to download. |
01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page. |
02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button. |
02:09 | Under the heading “Choose destination”, click on File option. |
02:13 | You can save this file in any file format, listed under format drop-down list box. |
02:21 | Choose FASTA from the given options. |
02:25 | Then click on Create file option. |
02:29 | A dialog-box appears on the screen. |
02:32 | Select Open with, click on OK. |
02:36 | A file opens in a text editor. |
02:39 | The file shows 4 records, since we had selected four files to download. |
02:46 | The first line in each record is an identifier line. |
02:50 | It starts with a “greater than (>)” symbol. |
02:53 | This is followed by a sequence. |
02:56 | Save the file in your home folder as “sequence.fasta'”. |
03:01 | Close the text editor. |
03:03 | Follow the same steps as above, to download the files in GenBank format |
03:08 | for the same files selected earlier. |
03:12 | Select the file format as GenBank. |
03:16 | Create a file. Open with a text editor. |
03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file. |
03:27 | Save the file as "sequence.gb" in your home folder. Close the text editor. |
03:34 | For demonstration purpose, we need a FASTA file with a single record. |
03:39 | For this, clear the earlier selection by again clicking on the check boxes. |
03:48 | Now, select the file “Human insulin gene complete cds”. |
03:54 | Click on the check-box. |
03:57 | And follow the same steps shown earlier to save the file in the home folder. |
04:01 | Save the file as "insulin.fasta". |
04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries. |
04:16 | Close the text-editor. |
04:19 | Extracting data from data files is called as Parsing. |
04:23 | Most file formats can be parsed using functions available in SeqIO module. |
04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert. |
04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously. |
04:44 | Start Ipython by typing "ipython" at the prompt. Press Enter. |
04:51 | Next, import "SeqIO" module from Bio package. |
04:56 | At the prompt, type: from Bio import SeqIO. Press Enter. |
05:04 | We will start with the most important function “parse”. |
05:07 | For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database. |
05:17 | For simple FASTA parsing, type the following at the prompt. |
05:22 | Here, we are using the parse function to read the contents of the sequence.fasta file. |
05:30 | For the output, print record id, sequence present in the record and also the length of the sequence. |
05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects. |
05:48 | It is generally used with a for loop. |
05:52 | It can accept two arguments, the first one is the file name to read the data. |
05:59 | The second specifies the file format. |
06:02 | Press Enter key twice to get the output. |
06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file. |
06:21 | Notice that the FASTA format does not specify the alphabet. |
06:26 | So, the output does not specifies it as a DNA sequence. |
06:31 | The same steps can be repeated for parsing GenBank file. |
06:36 | For Demonstration we will use the GenBank file which we have downloaded earlier from the database. |
06:43 | Press up-arrow key to get the lines of code which we had used earlier. |
06:49 | Change the file name to sequence.gb . |
06:53 | Change the file format to genbank. |
06:56 | The rest of the code remains same. |
06:58 | Press Enter key twice to get the output. |
07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file. |
07:12 | Notice that the GenBank format specifies the sequence as DNA sequence. |
07:19 | Similarly, Swiss-prot and EMBL files can be parsed using the same code as above. |
07:27 | If your file contains a single record then type the following lines for parsing. |
07:34 | Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example. |
07:43 | Notice that we have used read function instead of parse function. Press Enter. |
07:50 | The output shows the contents for the file insulin.fasta. |
07:55 | It shows the sequence as sequence record object. |
07:59 | And other attributes such as GI, accession number and description. |
08:06 | We can also view the individual attributes of this record as follows. |
08:11 | At the prompt, type: record dot seq. Press Enter. |
08:18 | The output shows the sequence present in the file. |
08:22 | To view the identifiers for this record, type: record dot id. Press Enter. |
08:29 | The output shows the GI number and accession number etc. |
08:34 | You can use the function described above to parse the data files of your choice. |
08:40 | Now, let's summarize. |
08:42 | In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module |
08:55 | to extract data such as record ids, description and sequences from FASTA and GenBank files. |
09:03 | Now, for the assignment- |
09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database. |
09:13 | Convert the file of sequences to their reverse complements. |
09:17 | Your completed assignment should have the following lines of code. |
09:22 | Use parse function to load nucleotide sequences from the FASTA file. |
09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method. |
09:37 | Video at the following link summarizes the spoken-tutorial project. |
09:42 | Please download and watch it. |
09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
09:51 | For more details, please write to us. |
09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
10:01 | More information on this mission is available at the link shown. |
10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |