Biopython/C2/Writing-Sequence-Files/English
|
|
---|---|
Slide Number 1
Title Slide |
Hello everyone.
Welcome to this tutorial on Writing Sequence Files. |
Slide Number 2
Learning Objectives |
In this tutorial, we will learn how to,
|
Slide Number 3
Pre-requisites |
To follow this tutorial you should be familiar with,
Refer to the Python tutorials at the given link. |
Slide Number 4
System Requirement |
To record this tutorial I am using
|
Slide Number 5
SeqIO functions |
We have earlier learnt about
parse and read functions to read contents of a file.
|
Navigate to the file, “example-insulin”.
|
Here is a text file with a protein sequence.
The sequence shown here is insulin protein.
|
Slide Number 6
Sequence Record Objects |
More information about Sequence Record Objects:
such as identifiers and descriptions. |
Press ctrl, alt ant t keys simultaneously on the keyboard. | Open the terminal by pressing ctrl, alt and t keys simultaneously . |
Type,
from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein |
At the prompt type ipython, press enter.
From Bio dot SeqRecord module import Sequence Record class From Bio dot Alphabet module import generic protein class
|
record1 = SeqRecord(Seq(“MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG” \
+ “GPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN”, generic_protein),
description= “insulin [Homo sapiens]”)
|
Next I will save the sequence record object in a variable record1.
|
Type, record1.
Press enter.
|
To view the output, type, record1.
Press enter.
It shows the sequence along with id and description. |
Type,
|
We will use write function to convert the above sequence object to a FASTA file.
|
Highlight the output.
|
Output shows “one”, that is we have converted one sequence record object to a FASTA file.
The output will over-write any pre-existing file of the same name. |
Navigate to the home folder and click on the file, “my_example.fasta”.
|
To view the file
|
Cursor on the terminal. | Many bioinformatics tools take different input file formats.
|
Navigate to home folder and click on “HIV.gb” .
|
For demonstration I will convert a GenBank file to a FASTA file.
|
Type the following lines on the terminal.
SeqIO.convert("HIV.gb", "genbank", "HIV.fasta", "fasta")
|
Type the following lines on the terminal.
Press enter
|
Navigate to the file and open my_example-2.fasta.
Close the text editor. |
Navigate to the file and open in the text editor.
|
Slide Number 7
Limitations of convert function. |
Even though we can convert the file formats easily using convert function, it has limitations.
|
Cursor on the terminal.
Type >>> from Bio import SeqIO >>> help(SeqIO.convert) |
For more information regarding convert function, type the help command.
|
Press “q” on the key board. | Press “q” on the key board to get back to the prompt. |
Cursor on the terminal.
|
We can also extract individual genes from the HIV genome in GenBank format.
|
Type the following at the prompt:
f = open('HIV_gene.fasta', 'w') for genome in SeqIO.parse('HIV.gb','genbank'): for gene in genome.features: if gene.type == "CDS": gene_seq = gene.extract(genome.seq) gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0] f.write(">GeneId %s %s\n%s\n" % (gi, gene.qualifiers['product'], gene_seq)) f.close()
|
For this type the following code at the prompt.
|
Navigate to home folder and open “hemoglobin.fasta”. | Using Biopython tools we can sort the records in a file by length.
|
At the prompt type,
records = list(SeqIO.parse("hemoglobin.fasta","fasta")) records.sort(cmp=lambda x,y: cmp(len(y),len(x))) SeqIO.write(records, "sorted_hemoglobin.fasta", "fasta")
|
Type the following lines to arrange the longest record first.
|
Cursor on the terminal. | For Short records first, reverse the arguments in the records.sort command line. |
Slide Number 8
Summary |
Lets Summarize,
In this tutorial we have learnt to,
|
Slide Number 9
Assignment
record = SeqIO.read(“HIV.gb”, “genbank”) record sub_record = record [4587:5165] # GI = 19172951, ID 155459, “HIV1gp3” SeqIO.write (sub_record, “sub_record-2.fasta”, “fasta”) |
For Assignment:
Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV. The file “HIV.gb” is included in code files of this tutorial.
|
Slide Number 10
Acknowledgement |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Slide Number 11 | The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test.
For more details, please write to us. |
Slide number 12 | Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
More information on this Mission is available at the link shown. |
Slide number 12 | This is Snehalatha from IIT Bombay signing off. Thank you for joining. |