Difference between revisions of "Biopython/C2/Writing-Sequence-Files/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
Line 82: Line 82:
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Here is a text file with a protein sequence.  
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Here is a text file with a protein sequence.  
  
The sequence shown here is insulin protein.  
+
The sequence shown here is '''insulin protein.'''
  
  
The file also has information such as GI accession number and also description.
+
The file also has information such as '''GI''' accession number and also description.
  
  
We will now create a file for this sequence in FASTA format.  
+
We will now create a file for this sequence in '''FASTA''' format.  
  
  
Line 120: Line 120:
  
 
'''from Bio.Alphabet import generic_protein'''
 
'''from Bio.Alphabet import generic_protein'''
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| At the prompt type ipython, press enter.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| At the prompt type '''ipython''', press enter.
  
  
Line 126: Line 126:
  
  
From '''Bio dot Seq module''' import '''Seq class'''.
+
'''from Bio dot Seq module import Seq class'''.
  
From '''Bio dot SeqRecord module''' import '''Sequence Record class'''
+
'''from Bio dot SeqRecord module import Sequence Record class'''
  
From B'''io dot Alphabet module''' import '''generic protein class'''
+
'''from Bio dot Alphabet module import generic protein class'''
  
  
Line 171: Line 171:
  
  
Press enter.
+
Press '''Enter.'''
  
 
|-
 
|-
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Type, '''record1'''.
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Type, '''record1'''.
  
Press enter'''.'''
+
Press '''enter'''.
  
  
Line 182: Line 182:
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To view the output, type, '''record1'''.
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| To view the output, type, '''record1'''.
  
Press enter.
+
Press '''enter.'''
  
  
The output shows the insulin protein sequence as sequence record object.
+
The output shows the '''insulin protein''' sequence as '''sequence record '''object.
  
 
It shows the sequence along with id and description.
 
It shows the sequence along with id and description.
Line 193: Line 193:
  
  
f'''rom Bio import SeqIO'''
+
from Bio import SeqIO'''
  
  
Line 200: Line 200:
  
 
Press enter
 
Press enter
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We will use write function to convert the above sequence object to a FASTA file.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We will use write function to convert the above sequence object to a '''FASTA''' file.
  
  
Import SeqIO module from Bio package.
+
'''Import SeqIO module''' from '''Bio package.'''
  
  
Next type the command line with a '''write''' function to convert the sequence object to FASTA file.
+
Next type the command line with a '''write''' function to convert the sequence object to '''FASTA''' file.
  
  
Line 212: Line 212:
  
  
Second is the file name to write the FASTA file.
+
Second is the file name to write the '''FASTA''' file.
  
  
The Third is the file format to write.''' '''
+
The Third is the file format to write.
  
  
Press enter.
+
Press '''enter.'''
  
 
|-
 
|-
Line 225: Line 225:
  
 
Cursor on the terminal.
 
Cursor on the terminal.
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Output shows “one”, that is we have converted one sequence record object to a FASTA file.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Output shows “one”, that is we have converted one sequence record object to a '''FASTA''' file.
  
  
The file in FASTA format is saved in the home folder as '''"example.fasta".'''
+
The file in '''FASTA''' format is saved in the home folder as '''"example.fasta".'''
  
  
Line 243: Line 243:
  
  
Navigate to the file in the home folder.
+
Navigate to the file in the '''home''' folder.
  
  
Open this file in a text editor.
+
Open this file in a '''text editor.'''
  
  
The protein sequence is now in FASTA format.
+
The protein sequence is now in '''FASTA''' format.
  
  
Close the text editor.
+
Close the '''text editor.'''
  
 
|-
 
|-
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal.
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal.
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Many bioinformatics tools take different input file formats.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Many '''bioinformatics''' tools take different input file formats.
  
  
Line 262: Line 262:
  
  
We can do file conversions using''' convert '''function in SeqIO module.
+
We can do file conversions using''' convert '''function in '''SeqIO module.'''
  
 
|-
 
|-
Line 275: Line 275:
  
 
Close the text editor.
 
Close the text editor.
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For demonstration I will convert a GenBank file to a FASTA file.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For demonstration I will convert a '''GenBank''' file to a '''FASTA''' file.
  
  
I have a GenBank file in my home folder.
+
I have a '''GenBank''' file in my home folder.
  
  
Line 284: Line 284:
  
  
This file contains HIV genome in GenBank format.
+
This file contains '''HIV genome''' in '''GenBank''' format.
  
  
This GenBank file has descriptions of all the genes in the genome in the first part of the file.  
+
This '''GenBank''' file has descriptions of all the '''genes''' in the '''genome''' in the first part of the file.  
  
  
It is followed by a complete genome sequence.
+
It is followed by a complete '''genome''' sequence.
  
  
Line 311: Line 311:
  
  
Here the '''convert''' function converts the complete genome sequence present in the GenBank file to FASTA file.
+
Here the '''convert''' function converts the complete '''genome''' sequence present in the '''GenBank''' file to '''FASTA''' file.
  
Press enter
+
Press '''enter'''
  
  
The new file in FASTA format is now saved as HIV.fasta in the home folder.
+
The new file in '''FASTA''' format is now saved as '''HIV.fasta''' in the home folder.
  
 
|-
 
|-
Line 337: Line 337:
  
  
For example: We can convert a GenBank file to a FASTA file, we can't do the reverse.
+
For example: We can convert a '''GenBank''' file to a '''FASTA''' file, we can't do the reverse.
  
  
Similarly we can turn a FASTQ file into a FASTA file, but can’t do the reverse.
+
Similarly we can turn a '''FASTQ''' file into a '''FASTA''' file, but can’t do the reverse.
  
 
|-
 
|-
Line 350: Line 350:
  
 
'''>>> help(SeqIO.convert)'''
 
'''>>> help(SeqIO.convert)'''
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For more information regarding convert function, type the help command.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For more information regarding convert function, type the '''help''' command.
  
  
Press enter.
+
Press '''enter'''.
  
 
|-
 
|-
Line 364: Line 364:
  
  
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We can also extract individual genes from the HIV genome in GenBank format.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| We can also extract individual '''genes''' from the '''HIV genome''' in '''GenBank''' format.
  
  
These individual genes can be saved in FASTA or any other formats.
+
These individual '''genes''' can be saved in '''FASTA''' or any other formats.
  
 
|-
 
|-
Line 402: Line 402:
  
  
This file is saved as “'''HIV_geneseq.fasta'''” on your home folder.
+
This file is saved as “'''HIV_geneseq.fasta'''” on your '''home''' folder.Press '''enter'''
  
 
|-
 
|-
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Navigate to home folder and open “hemoglobin.fasta”.
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Navigate to home folder and open “hemoglobin.fasta”.
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Using Biopython tools we can sort the records in a file by length.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Using '''Biopython''' tools we can sort the records in a file by length.
  
  
Line 431: Line 431:
  
  
The new file with the sorted sequences will be saved as''' "sorted_hemoglobin.fasta" '''in your home folder'''.'''
+
The new file with the sorted sequences will be saved as''' "sorted_hemoglobin.fasta" '''in your '''home''' folder
  
  
Line 438: Line 438:
 
|-
 
|-
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal.
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:none;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| Cursor on the terminal.
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Short records first, reverse the arguments in the records.sort command line.
+
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Short records first, reverse the arguments in the '''records.sort''' command line.
  
 
|-
 
|-
Line 475: Line 475:
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Assignment:  
 
| style="background-color:#ffffff;border-top:none;border-bottom:1pt solid #000001;border-left:1pt solid #000001;border-right:1pt solid #000001;padding-top:0.097cm;padding-bottom:0.097cm;padding-left:0.062cm;padding-right:0.097cm;"| For Assignment:  
  
Extract the gene "'''HIV1gp3'''" at positions 4587 to 5165 from the genomic sequence of HIV.
+
Extract the gene "'''HIV1gp3'''" at positions 4587 to 5165 from the '''genomic''' sequence of HIV.
  
 
The file “'''HIV.gb'''” is included in code files of this tutorial.
 
The file “'''HIV.gb'''” is included in code files of this tutorial.

Revision as of 15:17, 28 September 2015

Visual Cue
Narration
Slide Number 1

Title Slide

Hello everyone.


Welcome to this tutorial on Writing Sequence Files.

Slide Number 2

Learning Objectives

In this tutorial, we will learn how to,


  • Create Sequence Record Objects.
  • Write sequences files.
  • Convert between file formats.
  • And sort records in a file by length.


Slide Number 3

Pre-requisites

To follow this tutorial you should be familiar with,


  • Undergraduate Biochemistry or Bioinformatics
  • And basic Python programming

Refer to the Python tutorials at the given link.

Slide Number 4

System Requirement

To record this tutorial I am using
  • Ubuntu OS version. 14.10
  • Python version 2.7.8
  • Ipython interpretor version 2.3.0
  • And Biopython 1.64


Slide Number 5

SeqIO functions

We have earlier learnt about

parse and read functions to read contents of a file.


In this tutorial we will learn how to use write function to write sequences to a file.


And use Convert function for interconversion between various file formats.


Let me now demonstrate how to use write function.

Navigate to the file, “example-insulin”.


Open text file “example-insulin” in a text editor.


Highlight sequence, GI accession number.


Here is a text file with a protein sequence.

The sequence shown here is insulin protein.


The file also has information such as GI accession number and also description.


We will now create a file for this sequence in FASTA format.


The first step is to create sequence record object.

Slide Number 6

Sequence Record Objects

More information about Sequence Record Objects:


It is the basic data type for the sequence input/output interface.


In sequence record object, a sequence is associated with higher level features:

such as identifiers and descriptions.

Press ctrl, alt ant t keys simultaneously on the keyboard. Open the terminal by pressing ctrl, alt and t keys simultaneously .
Type,


from Bio.Seq import Seq

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import generic_protein

At the prompt type ipython, press enter.


At the prompt type the following lines:


from Bio dot Seq module import Seq class.

from Bio dot SeqRecord module import Sequence Record class

from Bio dot Alphabet module import generic protein class



Type,


record1 = SeqRecord(Seq(“MALWMRLL


PLLALLALWGPDPAAAFVN


QHLCGSHLVEALYLVCGER


GFFYTPKTRREAEDLQVGQVELGG” \

+ “GPGAGSLQPLALEG


SLQKRGIVEQCCTSICSLY


QLENYCN”, generic_protein),


id= “gi|386828|gb|AAA59172.1”,

description= “insulin [Homo sapiens]”)


Next I will save the sequence record object in a variable record1.


Copy the sequence, id and description from the text file and paste in the respective lines on the terminal.


Press Enter.

Type, record1.

Press enter.


Highlight the output.

To view the output, type, record1.

Press enter.


The output shows the insulin protein sequence as sequence record object.

It shows the sequence along with id and description.

Type,


from Bio import SeqIO


SeqIO.write(record1, "example.fasta", "fasta")


Press enter

We will use write function to convert the above sequence object to a FASTA file.


Import SeqIO module from Bio package.


Next type the command line with a write function to convert the sequence object to FASTA file.


write function takes 3 arguments, first one is the variable storing the sequence record object.


Second is the file name to write the FASTA file.


The Third is the file format to write.


Press enter.

Highlight the output.


Cursor on the terminal.

Output shows “one”, that is we have converted one sequence record object to a FASTA file.


The file in FASTA format is saved in the home folder as "example.fasta".


Let me warn you:

The output will over-write any pre-existing file of the same name.

Navigate to the home folder and click on the file, “my_example.fasta”.


To view the file


Navigate to the file in the home folder.


Open this file in a text editor.


The protein sequence is now in FASTA format.


Close the text editor.

Cursor on the terminal. Many bioinformatics tools take different input file formats.


So sometimes there is a need to interconvert between sequence file formats.


We can do file conversions using convert function in SeqIO module.

Navigate to home folder and click on “HIV.gb” .


Open HIV.gb using text editor.


Scroll down the page.


Close the text editor.

For demonstration I will convert a GenBank file to a FASTA file.


I have a GenBank file in my home folder.


Let me open this in a text editor.


This file contains HIV genome in GenBank format.


This GenBank file has descriptions of all the genes in the genome in the first part of the file.


It is followed by a complete genome sequence.


Close the text editor.

Type the following lines on the terminal.


>>>from Bio import SeqIO

SeqIO.convert("HIV.gb", "genbank", "HIV.fasta", "fasta")


press enter


Type the following lines on the terminal.


Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file.

Press enter


The new file in FASTA format is now saved as HIV.fasta in the home folder.

Navigate to the file and open my_example-2.fasta.

Close the text editor.

Navigate to the file and open in the text editor.


Close the text editor.

Slide Number 7

Limitations of convert function.

Even though we can convert the file formats easily using convert function, it has limitations.


Writing some formats requires information which other file formats don’t contain.


For example: We can convert a GenBank file to a FASTA file, we can't do the reverse.


Similarly we can turn a FASTQ file into a FASTA file, but can’t do the reverse.

Cursor on the terminal.

Type

>>> from Bio import SeqIO

>>> help(SeqIO.convert)

For more information regarding convert function, type the help command.


Press enter.

Press “q” on the key board. Press “q” on the key board to get back to the prompt.
Cursor on the terminal.


We can also extract individual genes from the HIV genome in GenBank format.


These individual genes can be saved in FASTA or any other formats.

Type the following at the prompt:


from Bio import SeqIO

f = open('HIV_gene.fasta', 'w')

for genome in SeqIO.parse('HIV.gb','genbank'):

for gene in genome.features:

if gene.type == "CDS":

gene_seq = gene.extract(genome.seq)

gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0]

f.write(">GeneId %s %s\n%s\n" % (gi, gene.qualifiers['product'], gene_seq))

f.close()


Press enter


For this type the following code at the prompt.


This code will write all individual CDS gene sequences , their ids and name of the gene in a file.


This file is saved as “HIV_geneseq.fasta” on your home folder.Press enter

Navigate to home folder and open “hemoglobin.fasta”. Using Biopython tools we can sort the records in a file by length.


Here I have opened a FASTA file “hemoglobin.fasta” which has six records.


Each record is of a different length.

At the prompt type,


from Bio import SeqIO

records = list(SeqIO.parse("hemoglobin.fasta","fasta"))

records.sort(cmp=lambda x,y: cmp(len(y),len(x)))

SeqIO.write(records, "sorted_hemoglobin.fasta", "fasta")


Type the following lines to arrange the longest record first.


The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder



Cursor on the terminal. For Short records first, reverse the arguments in the records.sort command line.
Slide Number 8

Summary

Lets Summarize,

In this tutorial we have learnt to,

  • Create Sequence Record Objects.
  • Write sequence files using write function of Sequence Input/Output module( Bio.SeqIO.
  • Convert between sequence file formats using convert function.
  • And sort records in a file by length.


Slide Number 9

Assignment


from Bio import SeqIO

record = SeqIO.read(“HIV.gb”, “genbank”)

record

sub_record = record [4587:5165] # GI = 19172951, ID 155459, “HIV1gp3”

SeqIO.write (sub_record, “sub_record-2.fasta”, “fasta”)

For Assignment:

Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV.

The file “HIV.gb” is included in code files of this tutorial.


Your completed assignment should have the following code.



Slide Number 10

Acknowledgement

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Slide Number 11 The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test.

For more details, please write to us.

Slide number 12 Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.

More information on this Mission is available at the link shown.

Slide number 12 This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Snehalathak