Biopython/C2/Writing-Sequence-Files/English

From Script | Spoken-Tutorial
Revision as of 15:21, 24 September 2015 by Snehalathak (Talk | contribs)

Jump to: navigation, search
Visual Cue
Narration
Slide Number 1

Title Slide

Hello everyone.


Welcome to this tutorial on Writing Sequence Files.

Slide Number 2

Learning Objectives

In this tutorial, we will learn how to,


  • Create Sequence Record Objects.
  • Write sequences files.
  • Convert between file formats.
  • And sort records in a file by length.


Slide Number 3

Pre-requisites

To follow this tutorial you should be familiar with,


  • Undergraduate Biochemistry or Bioinformatics
  • And basic Python programming

Refer to the Python tutorials at the given link.

Slide Number 4

System Requirement

To record this tutorial I am using
  • Ubuntu OS version. 14.10
  • Python version 2.7.8
  • Ipython interpretor version 2.3.0
  • And Biopython 1.64


Slide Number 5

SeqIO functions

We have earlier learnt about

parse and read functions to read contents of a file.


In this tutorial we will learn how to use write function to write sequences to a file.


And use Convert function for interconversion between various file formats.


Let me now demonstrate how to use write function.

Navigate to the file, “example-insulin”.


Open text file “example-insulin” in a text editor.


Highlight sequence, GI accession number.


Here is a text file with a protein sequence.

The sequence shown here is insulin protein.


The file also has information such as GI accession number and also description.


We will now create a file for this sequence in FASTA format.


The first step is to create sequence record object.

Slide Number 6

Sequence Record Objects

More information about Sequence Record Objects:


It is the basic data type for the sequence input/output interface.


In sequence record object, a sequence is associated with higher level features:

such as identifiers and descriptions.

Press ctrl, alt ant t keys simultaneously on the keyboard. Open the terminal by pressing ctrl, alt and t keys simultaneously .
Type,


from Bio.Seq import Seq

from Bio.SeqRecord import SeqRecord

from Bio.Alphabet import generic_protein

At the prompt type ipython, press enter.


At the prompt type the following lines:


From Bio dot Seq module import Seq class.

From Bio dot SeqRecord module import Sequence Record class

From Bio dot Alphabet module import generic protein class



Type,


record1 = SeqRecord(Seq(“MALWMRLL


PLLALLALWGPDPAAAFVN


QHLCGSHLVEALYLVCGER


GFFYTPKTRREAEDLQVGQVELGG” \

+ “GPGAGSLQPLALEG


SLQKRGIVEQCCTSICSLY


QLENYCN”, generic_protein),


id= “gi|386828|gb|AAA59172.1”,

description= “insulin [Homo sapiens]”)


Next I will save the sequence record object in a variable record1.


Copy the sequence, id and description from the text file and paste in the respective lines on the terminal.


Press enter.

Type, record1.

Press enter.


Highlight the output.

To view the output, type, record1.

Press enter.


The output shows the insulin protein sequence as sequence record object.

It shows the sequence along with id and description.

Type,


from Bio import SeqIO


SeqIO.write(record1, "example.fasta", "fasta")


Press enter

We will use write function to convert the above sequence object to a FASTA file.


Import SeqIO module from Bio package.


Next type the command line with a write function to convert the sequence object to FASTA file.


write function takes 3 arguments, first one is the variable storing the sequence record object.


Second is the file name to write the FASTA file.


The Third is the file format to write.


Press enter.

Highlight the output.


Cursor on the terminal.

Output shows “one”, that is we have converted one sequence record object to a FASTA file.


The file in FASTA format is saved in the home folder as "example.fasta".


Let me warn you:

The output will over-write any pre-existing file of the same name.

Navigate to the home folder and click on the file, “my_example.fasta”.


To view the file


Navigate to the file in the home folder.


Open this file in a text editor.


The protein sequence is now in FASTA format.


Close the text editor.

Cursor on the terminal. Many bioinformatics tools take different input file formats.


So sometimes there is a need to interconvert between sequence file formats.


We can do file conversions using convert function in SeqIO module.

Navigate to home folder and click on “HIV.gb” .


Open HIV.gb using text editor.


Scroll down the page.


Close the text editor.

For demonstration I will convert a GenBank file to a FASTA file.


I have a GenBank file in my home folder.


Let me open this in a text editor.


This file contains HIV genome in GenBank format.


This GenBank file has descriptions of all the genes in the genome in the first part of the file.


It is followed by a complete genome sequence.


Close the text editor.

Type the following lines on the terminal.


>>>from Bio import SeqIO

SeqIO.convert("HIV.gb", "genbank", "HIV.fasta", "fasta")


press enter


Type the following lines on the terminal.


Here the convert function converts the complete genome sequence present in the GenBank file to FASTA file.

Press enter


The new file in FASTA format is now saved as HIV.fasta in the home folder.

Navigate to the file and open my_example-2.fasta.

Close the text editor.

Navigate to the file and open in the text editor.


Close the text editor.

Slide Number 7

Limitations of convert function.

Even though we can convert the file formats easily using convert function, it has limitations.


Writing some formats requires information which other file formats don’t contain.


For example: We can convert a GenBank file to a FASTA file, we can't do the reverse.


Similarly we can turn a FASTQ file into a FASTA file, but can’t do the reverse.

Cursor on the terminal.

Type

>>> from Bio import SeqIO

>>> help(SeqIO.convert)

For more information regarding convert function, type the help command.


Press enter.

Press “q” on the key board. Press “q” on the key board to get back to the prompt.
Cursor on the terminal.


We can also extract individual genes from the HIV genome in GenBank format.


These individual genes can be saved in FASTA or any other formats.

Type the following at the prompt:


from Bio import SeqIO

f = open('HIV_gene.fasta', 'w')

for genome in SeqIO.parse('HIV.gb','genbank'):

for gene in genome.features:

if gene.type == "CDS":

gene_seq = gene.extract(genome.seq)

gi = str(gene.qualifiers['db_xref']).split(":")[1].split("'")[0]

f.write(">GeneId %s %s\n%s\n" % (gi, gene.qualifiers['product'], gene_seq))

f.close()


Press enter


For this type the following code at the prompt.


This code will write all individual CDS gene sequences , their ids and name of the gene in a file.


This file is saved as “HIV_geneseq.fasta” on your home folder.

Navigate to home folder and open “hemoglobin.fasta”. Using Biopython tools we can sort the records in a file by length.


Here I have opened a FASTA file “hemoglobin.fasta” which has six records.


Each record is of a different length.

At the prompt type,


from Bio import SeqIO

records = list(SeqIO.parse("hemoglobin.fasta","fasta"))

records.sort(cmp=lambda x,y: cmp(len(y),len(x)))

SeqIO.write(records, "sorted_hemoglobin.fasta", "fasta")


Type the following lines to arrange the longest record first.


The new file with the sorted sequences will be saved as "sorted_hemoglobin.fasta" in your home folder.



Cursor on the terminal. For Short records first, reverse the arguments in the records.sort command line.
Slide Number 8

Summary

Lets Summarize,

In this tutorial we have learnt to,

  • Create Sequence Record Objects.
  • Write sequence files using write function of Sequence Input/Output module( Bio.SeqIO.
  • Convert between sequence file formats using convert function.
  • And sort records in a file by length.


Slide Number 9

Assignment


from Bio import SeqIO

record = SeqIO.read(“HIV.gb”, “genbank”)

record

sub_record = record [4587:5165] # GI = 19172951, ID 155459, “HIV1gp3”

SeqIO.write (sub_record, “sub_record-2.fasta”, “fasta”)

For Assignment:

Extract the gene "HIV1gp3" at positions 4587 to 5165 from the genomic sequence of HIV.

The file “HIV.gb” is included in code files of this tutorial.


Your completed assignment should have the following code.



Slide Number 10

Acknowledgement

The video at the following link summarizes the Spoken Tutorial project.

Please download and watch it.

Slide Number 11 The Spoken Tutorial Project Team conducts workshops and gives certificates for those who pass an online test.

For more details, please write to us.

Slide number 12 Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.

More information on this Mission is available at the link shown.

Slide number 12 This is Snehalatha from IIT Bombay signing off. Thank you for joining.

Contributors and Content Editors

PoojaMoolya, Snehalathak