Biopython/C2/Writing-Sequence-Files/English

Visual Cue	Narration
Slide Number 1 Title Slide	Hello everyone. Welcome to this tutorial on Writing Sequence Files.
Slide Number 2 Learning Objectives	In this tutorial, we will learn how to, Create Sequence Record Objects. Write sequences files. Convert between file formats. And sort records in a file by length.
Slide Number 3 Pre-requisites	To follow this tutorial you should be familiar with, Undergraduate Biochemistry or Bioinformatics And basic Python programming Refer to the Python tutorials at the given link.
Slide Number 4 System Requirement	To record this tutorial I am using Ubuntu OS version. 14.10 Python version 2.7.8 Ipython interpretor version 2.3.0 And Biopython 1.64
Slide Number 5 SeqIO functions	We have earlier learnt about parse and read functions to read contents of a file. In this tutorial we will learn how to use write function to write sequences to a file. And use Convert function for interconversion between various file formats. Let me now demonstrate how to use write function.
Navigate to the file, “example-insulin”. Open text file “example-insulin” in a text editor. Highlight sequence, GI accession number.	Here is a text file with a protein sequence. The sequence shown here is insulin protein. The file also has information such as GI accession number and also description. We will now create a file for this sequence in FASTA format. The first step is to create sequence record object.
Slide Number 6 Sequence Record Objects	More information about Sequence Record Objects: It is the basic data type for the sequence input/output interface. In sequence record object, a sequence is associated with higher level features: such as identifiers and descriptions.
Press ctrl, alt ant t keys simultaneously on the keyboard.	Open the terminal by pressing ctrl, alt and t keys simultaneously .
Type, from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein	At the prompt type ipython, press enter. At the prompt type the following lines: From Bio dot Seq module import Seq class. From Bio dot SeqRecord module import Sequence Record class From Bio dot Alphabet module import generic protein class
record1 = SeqRecord(Seq(“MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG” \ + “GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN”, generic_protein), id= “gi\|386828\|gb\|AAA59172.1”, description= “insulin [Homo sapiens]”)	Next I will save the sequence record object in a variable record1. Copy the sequence, id and description from the text file and paste in the respective lines on the terminal. Press enter.
Type, record1. Press enter. Highlight the output.	To view the output, type, record1. Press enter. The output shows the insulin protein sequence as sequence record object. It shows the sequence along with id and description.
Type, from Bio import SeqIO SeqIO.write(record1, "example.fasta", "fasta") Press enter	We will use write function to convert the above sequence object to a FASTA file. Import SeqIO module from Bio package. Next type the command line with a write function to convert the sequence object to FASTA file. write function takes 3 arguments, first one is the variable storing the sequence record object. Second is the file name to write the FASTA file. The Third is the file format to write. Press enter.
Highlight the output. Cursor on the terminal.	Output shows “one”, that is we have converted one sequence record object to a FASTA file. The file in FASTA format is saved in the home folder as "example.fasta". Let me warn you: The output will over-write any pre-existing file of the same name.
Navigate to the home folder and click on the file, “my_example.fasta”.	To view the file Navigate to the file in the home folder. Open this file in a text editor. The protein sequence is now in FASTA format. Close the text editor.
Cursor on the terminal.	Many bioinformatics tools take different input file formats. So sometimes there is a need to interconvert between sequence file formats. We can do file conversions using convert function in SeqIO module.
Navigate to home folder and click on “HIV.gb” . Open HIV.gb using text editor. Scroll down the page. Close the text editor.	For demonstration I will convert a GenBank file to a FASTA file. I have a GenBank file in my home folder. Let me open this in a text editor. This file contains HIV genome in GenBank format. This GenBank file has descriptions of all the genes in the genome in the first part of the file. It is followed by a complete genome sequence. Close the text editor.
Type the following lines on the terminal. >>>from Bio import SeqIO SeqIO.convert("HIV.gb", "genbank", "HIV.fasta", "fasta") press enter	Type the following lines on the terminal. Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file. Press enter The new file in FASTA format is now saved as HIV.fasta in the home folder.
Navigate to the file and open my_example-2.fasta. Close the text editor.	Navigate to the file and open in the text editor. Close the text editor.
Slide Number 7 Limitations of convert function.	Even though we can convert the file formats easily using convert function, it has limitations. Writing some formats requires information which other file formats don’t contain. For example: We can convert a GenBank file to a FASTA file, we can't do the reverse. Similarly we can turn a FASTQ file into a FASTA file, but can’t do the reverse.
Cursor on the terminal. Type >>> from Bio import SeqIO >>> help(SeqIO.convert)	For more information regarding convert function, type the help command. Press enter.
Press “q” on the key board.	Press “q” on the key board to get back to the prompt.
Cursor on the terminal.	We can also extract individual genes from the HIV genome in GenBank format. These individual genes can be saved in FASTA or any other formats.
Type the following at the prompt: from Bio import SeqIO f = open('HIV_gene.fasta', 'w') for genome in SeqIO.parse('HIV.gb','genbank'): for gene in genome.features: if gene.type == "CDS": gene_seq = gene.extract(genome.seq) gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0] f.write(">GeneId %s %s\n%s\n" % (gi, gene.qualifiers['product'], gene_seq)) f.close() Press enter	For this type the following code at the prompt. This code will write all individual CDS gene sequences , their ids and name of the gene in a file. This file is saved as “HIV_geneseq.fasta” on your home folder.
Navigate to home folder and open “hemoglobin.fasta”.	Using Biopython tools we can sort the records in a file by length. Here I have opened a FASTA file “hemoglobin.fasta” which has six records. Each record is of a different length.
At the prompt type, from Bio import SeqIO records = list(SeqIO.parse("hemoglobin.fasta","fasta")) records.sort(cmp=lambda x,y: cmp(len(y),len(x))) SeqIO.write(records, "sorted_hemoglobin.fasta", "fasta")	Type the following lines to arrange the longest record first. The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder.
Cursor on the terminal.	For Short records first, reverse the arguments in the records.sort command line.
Slide Number 8 Summary	Lets Summarize, In this tutorial we have learnt to, Create Sequence Record Objects. Write sequence files using write function of Sequence Input/Output module( Bio.SeqIO. Convert between sequence file formats using convert function. And sort records in a file by length.
Slide Number 9 Assignment from Bio import SeqIO record = SeqIO.read(“HIV.gb”, “genbank”) record sub_record = record [4587:5165] # GI = 19172951, ID 155459, “HIV1gp3” SeqIO.write (sub_record, “sub_record-2.fasta”, “fasta”)	For Assignment: Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV. The file “HIV.gb” is included in code files of this tutorial. Your completed assignment should have the following code.
Slide Number 10 Acknowledgement	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Slide Number 11	The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test. For more details, please write to us.
Slide number 12	Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India. More information on this Mission is available at the link shown.
Slide number 12	This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Snehalathak

Biopython/C2/Writing-Sequence-Files/English

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools