Biopython/C2/Writing-Sequence-Files/English-timed

From Script | Spoken-Tutorial
Revision as of 23:55, 2 August 2016 by Sandhya.np14 (Talk | contribs)

Jump to: navigation, search
Time
Narration
00:01 Hello everyone.
00:02 Welcome to this tutorial on Writing Sequence Files.
00:07 In this tutorial, we will learn: * How to create Sequence Record Objects
00:13 * Write sequences files
00:15 * Convert between file formats
00:19 * And, sort records in a file by length.
00:23 To follow this tutorial, you should be familiar with
00:27 undergraduate Biochemistry or Bioinformatics
00:31 and basic Python programming.
00:34 Refer to the Python tutorials at the given link.
00:38 To record this tutorial, I am using: * Ubuntu OS version 14.10
00:45 * Python version 2.7.8
00:48 * Ipython interpretor version 2.3.0 and * Biopython version 1.64.
00:55 We have earlier learnt about parse and read functions to read contents of a file.
01:03 In this tutorial, we will learn how to use write function to write sequences to a file.
01:09 And, use Convert function for inter-conversion between various file formats.
01:16 Let me now demonstrate how to use write function.
01:20 Here is a text file with a protein sequence.
01:24 The sequence shown here is insulin protein.
01:28 The file also has information such as GI accession number and also description.
01:36 We will now create a file for this sequence in FASTA format.
01:41 The first step is to create sequence record object.
01:45 More information about Sequence Record Objects:
01:49 It is the basic data type for the sequence input/output interface.
01:55 In sequence record object, a sequence is associated with higher level features such as identifiers and descriptions.
02:04 Open the terminal by pressing Ctrl, Alt and t keys simultaneously .
02:10 At the prompt, type: ipython, press Enter.
02:15 At the prompt, type the following lines:
02:18 from Bio dot Seq module import Seq class.
02:24 from Bio dot SeqRecord module import Sequence Record class
02:31 Next from Bio dot Alphabet module import generic protein class
02:38 Next, I will save the sequence record object in a variable record1.
02:45 Copy the sequence, id and description from the text file and paste in the respective lines on the terminal.
02:56 Press Enter.
02:58 To view the output, type: record1.
03:02 Press Enter.
03:04 The output shows the insulin protein sequence as sequence record object.
03:10 It shows the sequence along with id and description.
03:13 We will use write function to convert the above sequence record object to a FASTA file.
03:21 Import SeqIO module from Bio package.
03:26 Next, type the command line with a write function to convert the sequence object to FASTA file.
03:40 The write function takes 3 arguments.
03:44 The first one is the variable storing the sequence record object.
03:49 The second is the file name to write the FASTA file.
03:54 The third is the file format to write. Press Enter.
03:58 The Output shows “one”, that is, we have converted one sequence record object to a FASTA file.
04:07 The file in FASTA format is saved in the home folder as "example.fasta".
04:13 Let me warn you,
04:14 the output will over-write any pre existing file of the same name.
04:18 To view the file, navigate to the file in the home folder.
04:24 Open this file in a text editor.
04:27 The protein sequence is now in FASTA format.
04:31 Close the text editor.
04:33 Many bioinformatics tools take different input file formats.
04:38 So, sometimes there is a need to inter-convert between sequence file formats.
04:44 We can do file conversions using convert function in SeqIO module.
04:50 For demonstration, I will convert a GenBank file to a FASTA file.
04:55 Have a GenBank file in my home folder.
04:59 Let me open this in a text editor.
05:02 The file contains HIV genome in GenBank format.
05:07 This GenBank file has descriptions of all the genes in the genome, in the first part of the file.
05:14 It is followed by a complete genome sequence.
05:18 Close the text editor.
05:19 Type the following lines on the terminal.
05:23 Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file. Press Enter.
05:33 The new file in FASTA format is now saved as HIV.fasta in the home folder.
05:39 Navigate to the file and open in the text editor.
05:46 Close the text editor.
05:49 Even though we can convert the file formats easily using convert function, it has limitations.
05:56 Writing some formats requires information which other file formats don’t contain.
06:02 For example: We can convert a GenBank file to a FASTA file, we can't do the reverse.
06:09 Similarly, we can turn a FASTQ file into a FASTA file but can’t do the reverse.
06:15 For more information regarding convert function, type the help command.
06:21 Press Enter.
06:24 Press 'q' on the key board to get back to the prompt.
06:28 We can also extract individual genes from the HIV genome in GenBank format.
06:35 These individual genes can be saved in FASTA or any other formats.
06:41 For this, type the following code at the prompt.
06:47 This code will write all individual CDS gene sequences , their ids and name of the gene in a file.
06:56 The file is saved as “HIV_geneseq.fasta” in your home folder. Press Enter.
07:07 Using Biopython tools, we can sort the records in a file by length.
07:12 Here, I have opened a FASTA file “hemoglobin.fasta” which has six records.
07:19 Each record is of a different length.
07:23 Type the following lines to arrange the longest record first.
07:27 The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder.
07:38 For short records first, reverse the arguments in the records.sort command line.
07:45 Let's summarize.
07:46 In this tutorial, we have learnt :* to create Sequence Record Objects.
07:51 * Write sequence files using write function of Sequence Input/Output module.
07:58 * Convert between sequence file formats using convert function.
08:03 * And, sort records in a file by length.
08:07 For the assignment:
08:09 Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV.
08:21 The file “HIV.gb” is included in code files of this tutorial.
08:28 Your completed assignment will have the following code.
08:43 The video at the following link summarizes the Spoken Tutorial project.
08:48 Please download and watch it.
08:49 The Spoken Tutorial Project team conducts workshops and gives certificates for those who pass an online test.
08:57 For more details, please write to us.
09:00 The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
09:06 More information on this mission is available at the link shown.
09:10 This is Snehalatha from IIT Bombay, signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Pratik kamble, Priyacst, Sandhya.np14