Difference between revisions of "Biopython/C2/Writing-Sequence-Files/English"
Snehalathak (Talk | contribs) |
PoojaMoolya (Talk | contribs) |
||
(One intermediate revision by the same user not shown) | |||
Line 34: | Line 34: | ||
− | * Undergraduate Biochemistry or Bioinformatics | + | * Undergraduate '''Biochemistry''' or '''Bioinformatics''' |
− | * And basic Python programming | + | * And basic '''Python''' programming |
− | Refer to the Python tutorials at the given link. | + | Refer to the '''Python''' tutorials at the given link. |
|- | |- | ||
Line 45: | Line 45: | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To record this tutorial I am using | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To record this tutorial I am using | ||
− | * Ubuntu OS version. 14.10 | + | * '''Ubuntu OS''' version. 14.10 |
− | * Python version 2.7.8 | + | * '''Python''' version 2.7.8 |
− | * Ipython interpretor version 2.3.0 | + | * '''Ipython interpretor''' version 2.3.0 |
− | * And Biopython 1.64 | + | * And '''Biopython''' 1.64 |
Line 82: | Line 82: | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Here is a text file with a protein sequence. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Here is a text file with a protein sequence. | ||
− | The sequence shown here is insulin protein. | + | The sequence shown here is '''insulin protein.''' |
− | The file also has information such as GI accession number and also description. | + | The file also has information such as '''GI''' accession number and also description. |
− | We will now create a file for this sequence in FASTA format. | + | We will now create a file for this sequence in '''FASTA''' format. |
Line 120: | Line 120: | ||
'''from Bio.Alphabet import generic_protein''' | '''from Bio.Alphabet import generic_protein''' | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| At the prompt type ipython, press enter. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| At the prompt type '''ipython''', press enter. |
Line 126: | Line 126: | ||
− | + | '''from Bio dot Seq module import Seq class'''. | |
− | + | '''from Bio dot SeqRecord module import Sequence Record class''' | |
− | + | '''from Bio dot Alphabet module import generic protein class''' | |
Line 171: | Line 171: | ||
− | Press | + | Press '''Enter.''' |
|- | |- | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Type, '''record1'''. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Type, '''record1'''. | ||
− | Press | + | Press '''enter'''. |
Line 182: | Line 182: | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To view the output, type, '''record1'''. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To view the output, type, '''record1'''. | ||
− | Press enter. | + | Press '''enter.''' |
− | The output shows the insulin protein sequence as sequence record object. | + | The output shows the '''insulin protein''' sequence as '''sequence record '''object. |
It shows the sequence along with id and description. | It shows the sequence along with id and description. | ||
Line 193: | Line 193: | ||
− | + | from Bio import SeqIO''' | |
Line 200: | Line 200: | ||
Press enter | Press enter | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We will use write function to convert the above sequence object to a FASTA file. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We will use write function to convert the above sequence object to a '''FASTA''' file. |
− | Import SeqIO module from Bio package. | + | '''Import SeqIO module''' from '''Bio package.''' |
− | Next type the command line with a '''write''' function to convert the sequence object to FASTA file. | + | Next type the command line with a '''write''' function to convert the sequence object to '''FASTA''' file. |
Line 212: | Line 212: | ||
− | Second is the file name to write the FASTA file. | + | Second is the file name to write the '''FASTA''' file. |
− | The Third is the file format to write. | + | The Third is the file format to write. |
− | Press enter. | + | Press '''enter.''' |
|- | |- | ||
Line 225: | Line 225: | ||
Cursor on the terminal. | Cursor on the terminal. | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Output shows “one”, that is we have converted one sequence record object to a FASTA file. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Output shows “one”, that is we have converted one sequence record object to a '''FASTA''' file. |
− | The file in FASTA format is saved in the home folder as '''"example.fasta".''' | + | The file in '''FASTA''' format is saved in the home folder as '''"example.fasta".''' |
Line 243: | Line 243: | ||
− | Navigate to the file in the home folder. | + | Navigate to the file in the '''home''' folder. |
− | Open this file in a text editor. | + | Open this file in a '''text editor.''' |
− | The protein sequence is now in FASTA format. | + | The protein sequence is now in '''FASTA''' format. |
− | Close the text editor. | + | Close the '''text editor.''' |
|- | |- | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal. | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Many bioinformatics tools take different input file formats. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Many '''bioinformatics''' tools take different input file formats. |
Line 262: | Line 262: | ||
− | We can do file conversions using''' convert '''function in SeqIO module. | + | We can do file conversions using''' convert '''function in '''SeqIO module.''' |
|- | |- | ||
Line 275: | Line 275: | ||
Close the text editor. | Close the text editor. | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For demonstration I will convert a GenBank file to a FASTA file. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For demonstration I will convert a '''GenBank''' file to a '''FASTA''' file. |
− | I have a GenBank file in my home folder. | + | I have a '''GenBank''' file in my home folder. |
Line 284: | Line 284: | ||
− | This file contains HIV genome in GenBank format. | + | This file contains '''HIV genome''' in '''GenBank''' format. |
− | This GenBank file has descriptions of all the genes in the genome in the first part of the file. | + | This '''GenBank''' file has descriptions of all the '''genes''' in the '''genome''' in the first part of the file. |
− | It is followed by a complete genome sequence. | + | It is followed by a complete '''genome''' sequence. |
Line 311: | Line 311: | ||
− | Here the '''convert''' function converts the complete genome sequence present in the GenBank file to FASTA file. | + | Here the '''convert''' function converts the complete '''genome''' sequence present in the '''GenBank''' file to '''FASTA''' file. |
− | Press enter | + | Press '''enter''' |
− | The new file in FASTA format is now saved as HIV.fasta in the home folder. | + | The new file in '''FASTA''' format is now saved as '''HIV.fasta''' in the home folder. |
|- | |- | ||
Line 337: | Line 337: | ||
− | For example: We can convert a GenBank file to a FASTA file, we can't do the reverse. | + | For example: We can convert a '''GenBank''' file to a '''FASTA''' file, we can't do the reverse. |
− | Similarly we can turn a FASTQ file into a FASTA file, but can’t do the reverse. | + | Similarly we can turn a '''FASTQ''' file into a '''FASTA''' file, but can’t do the reverse. |
|- | |- | ||
Line 350: | Line 350: | ||
'''>>> help(SeqIO.convert)''' | '''>>> help(SeqIO.convert)''' | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For more information regarding convert function, type the help command. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For more information regarding convert function, type the '''help''' command. |
− | Press enter. | + | Press '''enter'''. |
|- | |- | ||
Line 364: | Line 364: | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We can also extract individual genes from the HIV genome in GenBank format. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We can also extract individual '''genes''' from the '''HIV genome''' in '''GenBank''' format. |
− | These individual genes can be saved in FASTA or any other formats. | + | These individual '''genes''' can be saved in '''FASTA''' or any other formats. |
|- | |- | ||
Line 402: | Line 402: | ||
− | This file is saved as “'''HIV_geneseq.fasta'''” on your home folder. | + | This file is saved as “'''HIV_geneseq.fasta'''” on your '''home''' folder.Press '''enter''' |
|- | |- | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Navigate to home folder and open “hemoglobin.fasta”. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Navigate to home folder and open “hemoglobin.fasta”. | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Using Biopython tools we can sort the records in a file by length. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Using '''Biopython''' tools we can sort the records in a file by length. |
Line 431: | Line 431: | ||
− | The new file with the sorted sequences will be saved as''' "sorted_hemoglobin.fasta" '''in your | + | The new file with the sorted sequences will be saved as''' "sorted_hemoglobin.fasta" '''in your '''home''' folder |
Line 438: | Line 438: | ||
|- | |- | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal. | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal. | ||
− | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Short records first, reverse the arguments in the records.sort command line. | + | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Short records first, reverse the arguments in the '''records.sort''' command line. |
|- | |- | ||
Line 475: | Line 475: | ||
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Assignment: | | style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Assignment: | ||
− | Extract the gene "'''HIV1gp3'''" at positions 4587 to 5165 from the genomic sequence of HIV. | + | Extract the gene "'''HIV1gp3'''" at positions 4587 to 5165 from the '''genomic''' sequence of HIV. |
The file “'''HIV.gb'''” is included in code files of this tutorial. | The file “'''HIV.gb'''” is included in code files of this tutorial. |
Latest revision as of 15:34, 28 September 2015
|
|
---|---|
Slide Number 1
Title Slide |
Hello everyone.
|
Slide Number 2
Learning Objectives |
In this tutorial, we will learn how to,
|
Slide Number 3
Pre-requisites |
To follow this tutorial you should be familiar with,
Refer to the Python tutorials at the given link. |
Slide Number 4
System Requirement |
To record this tutorial I am using
|
Slide Number 5
SeqIO functions |
We have earlier learnt about
parse and read functions to read contents of a file.
|
Navigate to the file, “example-insulin”.
|
Here is a text file with a protein sequence.
The sequence shown here is insulin protein.
|
Slide Number 6
Sequence Record Objects |
More information about Sequence Record Objects:
such as identifiers and descriptions. |
Press ctrl, alt ant t keys simultaneously on the keyboard. | Open the terminal by pressing ctrl, alt and t keys simultaneously . |
Type,
from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein |
At the prompt type ipython, press enter.
from Bio dot SeqRecord module import Sequence Record class from Bio dot Alphabet module import generic protein class
|
Type,
+ “GPGAGSLQPLALEG
description= “insulin [Homo sapiens]”)
|
Next I will save the sequence record object in a variable record1.
|
Type, record1.
Press enter.
|
To view the output, type, record1.
Press enter.
It shows the sequence along with id and description. |
Type,
|
We will use write function to convert the above sequence object to a FASTA file.
|
Highlight the output.
|
Output shows “one”, that is we have converted one sequence record object to a FASTA file.
The output will over-write any pre-existing file of the same name. |
Navigate to the home folder and click on the file, “my_example.fasta”.
|
To view the file
|
Cursor on the terminal. | Many bioinformatics tools take different input file formats.
|
Navigate to home folder and click on “HIV.gb” .
|
For demonstration I will convert a GenBank file to a FASTA file.
|
Type the following lines on the terminal.
SeqIO.convert("HIV.gb", "genbank", "HIV.fasta", "fasta")
|
Type the following lines on the terminal.
Press enter
|
Navigate to the file and open my_example-2.fasta.
Close the text editor. |
Navigate to the file and open in the text editor.
|
Slide Number 7
Limitations of convert function. |
Even though we can convert the file formats easily using convert function, it has limitations.
|
Cursor on the terminal.
Type >>> from Bio import SeqIO >>> help(SeqIO.convert) |
For more information regarding convert function, type the help command.
|
Press “q” on the key board. | Press “q” on the key board to get back to the prompt. |
Cursor on the terminal.
|
We can also extract individual genes from the HIV genome in GenBank format.
|
Type the following at the prompt:
f = open('HIV_gene.fasta', 'w') for genome in SeqIO.parse('HIV.gb','genbank'): for gene in genome.features: if gene.type == "CDS": gene_seq = gene.extract(genome.seq) gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0] f.write(">GeneId %s %s\n%s\n" % (gi, gene.qualifiers['product'], gene_seq)) f.close()
|
For this type the following code at the prompt.
|
Navigate to home folder and open “hemoglobin.fasta”. | Using Biopython tools we can sort the records in a file by length.
|
At the prompt type,
records = list(SeqIO.parse("hemoglobin.fasta","fasta")) records.sort(cmp=lambda x,y: cmp(len(y),len(x))) SeqIO.write(records, "sorted_hemoglobin.fasta", "fasta")
|
Type the following lines to arrange the longest record first.
|
Cursor on the terminal. | For Short records first, reverse the arguments in the records.sort command line. |
Slide Number 8
Summary |
Lets Summarize,
In this tutorial we have learnt to,
|
Slide Number 9
Assignment
record = SeqIO.read(“HIV.gb”, “genbank”) record sub_record = record [4587:5165] # GI = 19172951, ID 155459, “HIV1gp3” SeqIO.write (sub_record, “sub_record-2.fasta”, “fasta”) |
For Assignment:
Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV. The file “HIV.gb” is included in code files of this tutorial.
|
Slide Number 10
Acknowledgement |
The video at the following link summarizes the Spoken Tutorial project.
Please download and watch it. |
Slide Number 11 | The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test.
For more details, please write to us. |
Slide number 12 | Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
More information on this Mission is available at the link shown. |
Slide number 12 | This is Snehalatha from IIT Bombay signing off. Thank you for joining. |