Biopython/C2/Writing-Sequence-Files/English-timed
From Script | Spoken-Tutorial
| Time | Narration |
| 00:01 | Hello everyone.Welcome to this tutorial on Writing Sequence Files. |
| 00:07 | In this tutorial, we will learn: * How to create Sequence Record Objects |
| 00:13 | Write sequences files |
| 00:15 | Convert between file formats |
| 00:19 | And, sort records in a file by length. |
| 00:23 | To follow this tutorial, you should be familiar with |
| 00:27 | undergraduate Biochemistry or Bioinformatics |
| 00:31 | and basic Python programming. |
| 00:34 | Refer to the Python tutorials at the given link. |
| 00:38 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
| 00:45 | Python version 2.7.8 |
| 00:48 | Ipython interpretor version 2.3.0 and * Biopython version 1.64. |
| 00:55 | We have earlier learnt about parse and read functions to read contents of a file. |
| 01:03 | In this tutorial, we will learn how to use write function to write sequences to a file. |
| 01:09 | And, use Convert function for inter-conversion between various file formats. |
| 01:16 | Let me now demonstrate how to use write function. |
| 01:20 | Here is a text file with a protein sequence. |
| 01:24 | The sequence shown here is insulin protein. |
| 01:28 | The file also has information such as GI accession number and also description. |
| 01:36 | We will now create a file for this sequence in FASTA format. |
| 01:41 | The first step is to create sequence record object. |
| 01:45 | More information about Sequence Record Objects: |
| 01:49 | It is the basic data type for the sequence input/output interface. |
| 01:55 | In sequence record object, a sequence is associated with higher level features such as identifiers and descriptions. |
| 02:04 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously . |
| 02:10 | At the prompt, type: ipython, press Enter. |
| 02:15 | At the prompt, type the following lines: |
| 02:18 | from Bio dot Seq module import Seq class. |
| 02:24 | from Bio dot SeqRecord module import Sequence Record class |
| 02:31 | Next, from Bio dot Alphabet module import generic protein class. |
| 02:38 | Next, I will save the sequence record object in a variable record1. |
| 02:45 | Copy the sequence, id and description from the text file and paste it in the respective lines on the terminal. |
| 02:56 | Press Enter. |
| 02:58 | To view the output, type: record1. |
| 03:02 | Press Enter. |
| 03:04 | The output shows the insulin protein sequence as sequence record object. |
| 03:10 | It shows the sequence along with id and description. |
| 03:13 | We will use write function to convert the above sequence record object to a FASTA file. |
| 03:21 | Import SeqIO module from Bio package. |
| 03:26 | Next, type the command line with a write function to convert the sequence object to FASTA file. |
| 03:40 | The write function takes 3 arguments. |
| 03:44 | The first one is the variable storing the sequence record object. |
| 03:49 | The second is the file name to write the FASTA file. |
| 03:54 | The third is the file format to write. Press Enter. |
| 03:58 | The Output shows one, that is, we have converted one sequence record object to a FASTA file. |
| 04:07 | The file in FASTA format is saved in the home folder as "example.fasta". |
| 04:13 | Let me warn you,the output will over-write any pre-existing file of the same name. |
| 04:18 | To view the file, navigate to the file in the home folder. |
| 04:24 | Open this file in a text editor. |
| 04:27 | The protein sequence is now in FASTA format. |
| 04:31 | Close the text editor. |
| 04:33 | Many bioinformatics tools take different input file formats. |
| 04:38 | So, sometimes there is a need to inter-convert between sequence file formats. |
| 04:44 | We can do file conversions using convert function in SeqIO module. |
| 04:50 | For demonstration, I will convert a GenBank file to a FASTA file. |
| 04:55 | Have a GenBank file in my home folder. |
| 04:59 | Let me open this in a text editor. |
| 05:02 | The file contains HIV genome in GenBank format. |
| 05:07 | This GenBank file has descriptions of all the genes in the genome, in the first part of the file. |
| 05:14 | It is followed by a complete genome sequence. |
| 05:18 | Close the text editor. Type the following lines on the terminal. |
| 05:23 | Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file. Press Enter. |
| 05:33 | The new file in the FASTA format is now saved as HIV.fasta in the home folder. |
| 05:39 | Navigate to the file and open in the text editor. |
| 05:46 | Close the text editor. |
| 05:49 | Even though we can convert the file formats easily using convert function, it has limitations. |
| 05:56 | Writing some formats requires information which other file formats don’t contain. |
| 06:02 | For example: We can convert a GenBank file to a FASTA file, we can't do the reverse. |
| 06:09 | Similarly, we can turn a FASTQ file into a FASTA file but can’t do the reverse. |
| 06:15 | For more information regarding convert function, type the help command. |
| 06:21 | Press Enter. |
| 06:24 | Press 'q' on the key board to get back to the prompt. |
| 06:28 | We can also extract individual genes from the HIV genome in GenBank format. |
| 06:35 | These individual genes can be saved in FASTA or any other formats. |
| 06:41 | For this, type the following code at the prompt. |
| 06:47 | This code will write all individual CDS gene sequences, their ids and name of the gene in a file. |
| 06:56 | The file is saved as “HIV_geneseq.fasta” in your home folder. Press Enter. |
| 07:07 | Using Biopython tools, we can sort the records in a file by length. |
| 07:12 | Here, I have opened a FASTA file “hemoglobin.fasta” which has six records. |
| 07:19 | Each record is of a different length. |
| 07:23 | Type the following lines to arrange the longest record first. |
| 07:27 | The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder. |
| 07:38 | For short records first, reverse the arguments in the records.sort command line. |
| 07:45 | Let's summarize.In this tutorial, we have learnt :* to create Sequence Record Objects |
| 07:51 | Write sequence files using write function of Sequence Input/Output module. |
| 07:58 | Convert between sequence file formats using convert function. |
| 08:03 | And, sort records in a file by length. |
| 08:07 | For the assignment: |
| 08:09 | Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV. |
| 08:21 | The file “HIV.gb” is included in the code files of this tutorial. |
| 08:28 | Your completed assignment will have the following code. |
| 08:43 | The video at the following link summarizes the Spoken Tutorial project. |
| 08:48 | Please download and watch it. The Spoken Tutorial Project team conducts workshops and gives certificates for those who pass an online test. |
| 08:57 | For more details, please write to us. |
| 09:00 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
| 09:06 | More information on this mission is available at the link shown. |
| 09:10 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |