Biopython/C2/Writing-Sequence-Files/English-timed
From Script | Spoken-Tutorial
Revision as of 13:19, 10 March 2017 by PoojaMoolya (Talk | contribs)
|
|
---|---|
00:01 | Hello everyone.Welcome to this tutorial on Writing Sequence Files. |
00:07 | In this tutorial, we will learn: * How to create Sequence Record Objects |
00:13 | Write sequences files |
00:15 | Convert between file formats |
00:19 | And, sort records in a file by length. |
00:23 | To follow this tutorial, you should be familiar with |
00:27 | undergraduate Biochemistry or Bioinformatics |
00:31 | and basic Python programming. |
00:34 | Refer to the Python tutorials at the given link. |
00:38 | To record this tutorial, I am using: * Ubuntu OS version 14.10 |
00:45 | Python version 2.7.8 |
00:48 | Ipython interpretor version 2.3.0 and * Biopython version 1.64. |
00:55 | We have earlier learnt about parse and read functions to read contents of a file. |
01:03 | In this tutorial, we will learn how to use write function to write sequences to a file. |
01:09 | And, use Convert function for inter-conversion between various file formats. |
01:16 | Let me now demonstrate how to use write function. |
01:20 | Here is a text file with a protein sequence. |
01:24 | The sequence shown here is insulin protein. |
01:28 | The file also has information such as GI accession number and also description. |
01:36 | We will now create a file for this sequence in FASTA format. |
01:41 | The first step is to create sequence record object. |
01:45 | More information about Sequence Record Objects: |
01:49 | It is the basic data type for the sequence input/output interface. |
01:55 | In sequence record object, a sequence is associated with higher level features such as identifiers and descriptions. |
02:04 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously . |
02:10 | At the prompt, type: ipython, press Enter. |
02:15 | At the prompt, type the following lines: |
02:18 | from Bio dot Seq module import Seq class. |
02:24 | from Bio dot SeqRecord module import Sequence Record class |
02:31 | Next, from Bio dot Alphabet module import generic protein class. |
02:38 | Next, I will save the sequence record object in a variable record1. |
02:45 | Copy the sequence, id and description from the text file and paste it in the respective lines on the terminal. |
02:56 | Press Enter. |
02:58 | To view the output, type: record1. |
03:02 | Press Enter. |
03:04 | The output shows the insulin protein sequence as sequence record object. |
03:10 | It shows the sequence along with id and description. |
03:13 | We will use write function to convert the above sequence record object to a FASTA file. |
03:21 | Import SeqIO module from Bio package. |
03:26 | Next, type the command line with a write function to convert the sequence object to FASTA file. |
03:40 | The write function takes 3 arguments. |
03:44 | The first one is the variable storing the sequence record object. |
03:49 | The second is the file name to write the FASTA file. |
03:54 | The third is the file format to write. Press Enter. |
03:58 | The Output shows one, that is, we have converted one sequence record object to a FASTA file. |
04:07 | The file in FASTA format is saved in the home folder as "example.fasta". |
04:13 | Let me warn you,the output will over-write any pre-existing file of the same name. |
04:18 | To view the file, navigate to the file in the home folder. |
04:24 | Open this file in a text editor. |
04:27 | The protein sequence is now in FASTA format. |
04:31 | Close the text editor. |
04:33 | Many bioinformatics tools take different input file formats. |
04:38 | So, sometimes there is a need to inter-convert between sequence file formats. |
04:44 | We can do file conversions using convert function in SeqIO module. |
04:50 | For demonstration, I will convert a GenBank file to a FASTA file. |
04:55 | Have a GenBank file in my home folder. |
04:59 | Let me open this in a text editor. |
05:02 | The file contains HIV genome in GenBank format. |
05:07 | This GenBank file has descriptions of all the genes in the genome, in the first part of the file. |
05:14 | It is followed by a complete genome sequence. |
05:18 | Close the text editor. |
05:19 | Type the following lines on the terminal. |
05:23 | Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file. Press Enter. |
05:33 | The new file in the FASTA format is now saved as HIV.fasta in the home folder. |
05:39 | Navigate to the file and open in the text editor. |
05:46 | Close the text editor. |
05:49 | Even though we can convert the file formats easily using convert function, it has limitations. |
05:56 | Writing some formats requires information which other file formats don’t contain. |
06:02 | For example: We can convert a GenBank file to a FASTA file, we can't do the reverse. |
06:09 | Similarly, we can turn a FASTQ file into a FASTA file but can’t do the reverse. |
06:15 | For more information regarding convert function, type the help command. |
06:21 | Press Enter. |
06:24 | Press 'q' on the key board to get back to the prompt. |
06:28 | We can also extract individual genes from the HIV genome in GenBank format. |
06:35 | These individual genes can be saved in FASTA or any other formats. |
06:41 | For this, type the following code at the prompt. |
06:47 | This code will write all individual CDS gene sequences, their ids and name of the gene in a file. |
06:56 | The file is saved as “HIV_geneseq.fasta” in your home folder. Press Enter. |
07:07 | Using Biopython tools, we can sort the records in a file by length. |
07:12 | Here, I have opened a FASTA file “hemoglobin.fasta” which has six records. |
07:19 | Each record is of a different length. |
07:23 | Type the following lines to arrange the longest record first. |
07:27 | The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder. |
07:38 | For short records first, reverse the arguments in the records.sort command line. |
07:45 | Let's summarize.In this tutorial, we have learnt :* to create Sequence Record Objects |
07:51 | Write sequence files using write function of Sequence Input/Output module. |
07:58 | Convert between sequence file formats using convert function. |
08:03 | And, sort records in a file by length. |
08:07 | For the assignment: |
08:09 | Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV. |
08:21 | The file “HIV.gb” is included in the code files of this tutorial. |
08:28 | Your completed assignment will have the following code. |
08:43 | The video at the following link summarizes the Spoken Tutorial project. |
08:48 | Please download and watch it. The Spoken Tutorial Project team conducts workshops and gives certificates for those who pass an online test. |
08:57 | For more details, please write to us. |
09:00 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. |
09:06 | More information on this mission is available at the link shown. |
09:10 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining. |