Biopython/C2/Parsing-Data/English-timed
From Script | Spoken-Tutorial
| |
|
|---|---|
| 00:01 | Hello everyone.Welcome to this tutorial on Parsing Data. |
| 00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website. |
| 00:14 | And, Parse data files using functions in Sequence Input/Output module. |
| 00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics |
| 00:26 | and basic Python programming. |
| 00:30 | Refer to the Python tutorials at the given link. |
| 00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
| 00:40 | Python version 2.7.8 |
| 00:44 | Ipython interpretor version 2.3.0 |
| 00:48 | Biopython version 1.64 and * Mozilla Firefox browser 35.0. |
| 00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc. |
| 01:07 | Data files can be downloaded from the database websites. |
| 01:12 | Open the website link given below, in any web browser. |
| 01:17 | A web-page opens. |
| 01:19 | Let us download FASTA and GenBank files for human insulin gene. |
| 01:25 | In the search box, type: "human insulin", click on Search button. |
| 01:31 | The web-page shows many files for human insulin gene. |
| 01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”. |
| 01:43 | I will choose files that have less than 500 base pairs. |
| 01:48 | Click on the check-box to select the file, to download. |
| 01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page. |
| 02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button. |
| 02:09 | Under the heading “Choose destination”, click on File option. |
| 02:13 | You can save this file in any file format, listed under format drop-down list box. |
| 02:21 | Choose FASTA from the given options. |
| 02:25 | Then click on Create file option. |
| 02:29 | A dialog-box appears on the screen. |
| 02:32 | Select Open with, click on OK. |
| 02:36 | A file opens in a text editor. |
| 02:39 | The file shows 4 records, since we had selected four files to download. |
| 02:46 | The first line in each record is an identifier line. |
| 02:50 | It starts with a “greater than (>)” symbol. |
| 02:53 | This is followed by a sequence. |
| 02:56 | Save the file in your home folder as “sequence.fasta'”. |
| 03:01 | Close the text editor. |
| 03:03 | Follow the same steps as above, to download the files in GenBank format |
| 03:08 | for the same files selected earlier. |
| 03:12 | Select the file format as GenBank. |
| 03:16 | Create a file. Open with a text editor. |
| 03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file. |
| 03:27 | Save the file as "sequence.gb" in your home folder. Close the text editor. |
| 03:34 | For demonstration purpose, we need a FASTA file with a single record. |
| 03:39 | For this, clear the earlier selection by again clicking on the check boxes. |
| 03:48 | Now, select the file “Human insulin gene complete cds”. |
| 03:54 | Click on the check-box. |
| 03:57 | And follow the same steps shown earlier to save the file in the home folder. |
| 04:01 | Save the file as "insulin.fasta". |
| 04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries. |
| 04:16 | Close the text-editor. |
| 04:19 | Extracting data from data files is called as Parsing. |
| 04:23 | Most file formats can be parsed using functions available in SeqIO module. |
| 04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert. |
| 04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously. |
| 04:44 | Start Ipython by typing "ipython" at the prompt. Press Enter. |
| 04:51 | Next, import "SeqIO" module from Bio package. |
| 04:56 | At the prompt, type: from Bio import SeqIO. Press Enter. |
| 05:04 | We will start with the most important function “parse”. |
| 05:07 | For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database. |
| 05:17 | For simple FASTA parsing, type the following at the prompt. |
| 05:22 | Here, we are using the parse function to read the contents of the sequence.fasta file. |
| 05:30 | For the output, print record id, sequence present in the record and also the length of the sequence. |
| 05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects. |
| 05:48 | It is generally used with a for loop. |
| 05:52 | It can accept two arguments, the first one is the file name to read the data. |
| 05:59 | The second specifies the file format. |
| 06:02 | Press Enter key twice to get the output. |
| 06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file. |
| 06:21 | Notice that the FASTA format does not specify the alphabet. |
| 06:26 | So, the output does not specifies it as a DNA sequence. |
| 06:31 | The same steps can be repeated for parsing GenBank file. |
| 06:36 | For Demonstration we will use the GenBank file which we have downloaded earlier from the database. |
| 06:43 | Press up-arrow key to get the lines of code which we had used earlier. |
| 06:49 | Change the file name to sequence.gb . |
| 06:53 | Change the file format to genbank. |
| 06:56 | The rest of the code remains same. |
| 06:58 | Press Enter key twice to get the output. |
| 07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file. |
| 07:12 | Notice that the GenBank format specifies the sequence as DNA sequence. |
| 07:19 | Similarly, Swiss-prot and EMBL files can be parsed using the same code as above. |
| 07:27 | If your file contains a single record then type the following lines for parsing. |
| 07:34 | Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example. |
| 07:43 | Notice that we have used read function instead of parse function. Press Enter. |
| 07:50 | The output shows the contents for the file insulin.fasta. |
| 07:55 | It shows the sequence as sequence record object. |
| 07:59 | And other attributes such as GI, accession number and description. |
| 08:06 | We can also view the individual attributes of this record as follows. |
| 08:11 | At the prompt, type: record dot seq. Press Enter. |
| 08:18 | The output shows the sequence present in the file. |
| 08:22 | To view the identifiers for this record, type: record dot id. Press Enter. |
| 08:29 | The output shows the GI number and accession number etc. |
| 08:34 | You can use the function described above to parse the data files of your choice. |
| 08:40 | Now, let's summarize. |
| 08:42 | In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module |
| 08:55 | to extract data such as record ids, description and sequences from FASTA and GenBank files. |
| 09:03 | Now, for the assignment- |
| 09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database. |
| 09:13 | Convert the file of sequences to their reverse complements. |
| 09:17 | Your completed assignment should have the following lines of code. |
| 09:22 | Use parse function to load nucleotide sequences from the FASTA file. |
| 09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method. |
| 09:37 | Video at the following link summarizes the spoken-tutorial project. |
| 09:42 | Please download and watch it. |
| 09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test. |
| 09:51 | For more details, please write to us. |
| 09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
| 10:01 | More information on this mission is available at the link shown. |
| 10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |