Biopython/C2/Parsing-Data/Khasi
|
|
---|---|
00:01 | Hello everyone.Welcome to this tutorial on Parsing Data.
(Khublei baroh. Ngi pdiang sngewbha ia phi sha kane ka tutorial shaphang ka Parsing Data. |
00:06 | In this tutorial, we will learn to download FASTA and GenBank files from NCBI database website.
(Ha kane ka tutorial, ngin sa nang kumno ban download ia ki FASTA bad GenBank files na ka database website. |
00:14 | And, Parse data files using functions in Sequence Input/Output module.
(Bad ban Parseia ki data files da kaba pyndonkam ia ki functions ha ka Sequence Input/Output module. |
00:19 | To follow this tutorial, you should be familiar with undergraduate biochemistry or bioinformatics
(Ban sngewthuh bha ia kane ka tutorial, phi dei ban long kiba shemphang ha ka undergraduate biochemistry lane bioinformatics. |
00:26 | and basic Python programming.
(bad ka basic Python programming. |
00:30 | Refer to the Python tutorials at the given link.
(Peit ia ka Python tutorials na ka link ba ai hapoh. |
00:34 | To record this tutorial, I am using: * Ubuntu OS version 14.10
(Ban record ia kane ka tutorial, nga pyndonkam da ka : * Ubuntu OS version 14.10 |
00:40 | Python version 2.7.8
(Python version 2.7.8 |
00:44 | Ipython interpretor version 2.3.0
(Ipython interpretor version 2.3.0 |
00:48 | Biopython version 1.64 and * Mozilla Firefox browser 35.0.
(Biopython version 1.64 bad * Mozilla Firefox browser 35.0. |
00:56 | Scientific data in biology is generally stored in text files such as FASTA, GenBank, EMBL, Swiss-Prot etc.
(Ki scientific data jong ka biology ju store barabor ia ki ha ka text file kum FASTA, GenBank, EMBL, Swiss-Prot kumta ter ter. |
01:07 | Data files can be downloaded from the database websites.
(Ia ki data files lah ban download na ka database websites. |
01:12 | Open the website link given below, in any web browser.
(Plie ia ka website link ba lah ai harum, da uno uno u web browser. |
01:17 | A web-page opens.
(Ka web-page kan sa plie. |
01:19 | Let us download FASTA and GenBank files for human insulin gene.
(To ngin ia download FASTA bad GenBank files na ka bynta ka human insulin gene. |
01:25 | In the search box, type: "human insulin", click on Search button.
(Ha ka search box, type: "human insulin", click ha Search button. |
01:31 | The web-page shows many files for human insulin gene.
(Ka web-page kan pyni bun ki flies na ka bynta ka human insulin gene. |
01:35 | For demonstration, I will select 4 files with the name “Homo sapiens Insulin mRNA”.
(Ban pyni nuksa, ngan jied 4 tylli ki files kiba kyrteng “Homo sapiens Insulin mRNA”. |
01:43 | I will choose files that have less than 500 base pairs.
(Ngan jied ia ki files ba duna ia ka 500 base pairs. |
01:48 | Click on the check-box to select the file, to download.
(Click ha ka check-box ban select ia ka file ban download. |
01:56 | Bring the cursor to the “Send to” option, located at the top right corner of the page.
(Wanrah ia u cursor sha “Send to” option, kaba don ha jrong duh sha ka liang ka mon jong ka page. |
02:02 | Click on the small selection button with a down arrow, present next to the “Send to” button.
(Click ha i selection button barit ba don u khnam ba kdew shapoh, ba don hajan ka “Send to” button. |
02:09 | Under the heading “Choose destination”, click on File option.
(Hapoh ka heading “Choose destination”, click ha File option. |
02:13 | You can save this file in any file format, listed under format drop-down list box.
(Phi lah ban save ia kane ka file ha kano kano ka format, kiba don hapoh format drop-down list box. |
02:21 | Choose FASTA from the given options.
(Jied FASTA na ki options ba ai hapoh. |
02:25 | Then click on Create file option.
(Nangta sa click ha Create file option. |
02:29 | A dialog-box appears on the screen.
(Ka dialog-box kan sa mih ha ka screen. |
02:32 | Select Open with, click on OK.
(Jied Open with, click ha OK. |
02:36 | A file opens in a text editor.
(Ka file kan sa mih ha ka text editor. |
02:39 | The file shows 4 records, since we had selected four files to download.
(Kane ka file ka pyni 4 tylli ki records, namar ngi la jied ban plie 4 tylli ki files ban download. |
02:46 | The first line in each record is an identifier line.
(U line banyngkong ha kawei pa kawei ka record u dei u identifier line. |
02:50 | It starts with a “greater than (>)” symbol.
(Un sdang da u “greater than (>)” symbol. |
02:53 | This is followed by a sequence.
(Nangta sa bud sa da usequence. |
02:56 | Save the file in your home folder as “sequence.fasta'”.
(Save ia ka file ha homefolder kum “sequence.fasta'”. |
03:01 | Close the text editor.
(Khang ia ka text editor. |
03:03 | Follow the same steps as above, to download the files in GenBank format
(Leh kumjuh ka rukom kum haneng ban download ia ki files ha GenBank format
|
03:08 | for the same files selected earlier.
(na ka bynta ki files ba jied nyngkong . |
03:12 | Select the file format as GenBank.
(Select ia ka file format kum GenBank. |
03:16 | Create a file. Open with a text editor.
(Shna ia ka file. Plie da u text editor. |
03:21 | Notice that the sequence file in GenBank format has more features than a FASTA file.
(Ha khmih ba ka sequence file ha GenBank format ka kham bun features ban ia ka FASTA file. |
03:27 | Save the file as "sequence.gb" in your home folder. Close the text editor.
(Save ia ka file kum "sequence.gb" ha ka home folder. Khang noh u text editor. |
03:34 | For demonstration purpose, we need a FASTA file with a single record.
(Ban peit nuksa, ngi donkam ia ka FASTA file ba don record tang iwei. |
03:39 | For this, clear the earlier selection by again clicking on the check boxes.
(Na ka bynta kane, pynkhuid ia ki selection ba nyngkong da kaba click biang ha ki check boxes. |
03:48 | Now, select the file “Human insulin gene complete cds”.
(Mynta, select ia ka file “Human insulin gene complete cds”.
|
03:54 | Click on the check-box.
(Click ha ka check-box. |
03:57 | And follow the same steps shown earlier to save the file in the home folder.
(Bad sa bud ia ki rukom kumba la pyni nyngkong ban save ia ka file ha ka home folder. |
04:01 | Save the file as "insulin.fasta".
(Save ia ka file kum "insulin.fasta". |
04:08 | Biological data stored in these files can be extracted and modified using Biopython libraries.
(Ki Biological data ba lah store ha kine ki files lah ban sei bad pynkylla da kaba pyndonkam da ka Biopython libraries. |
04:16 | Close the text-editor.
(Khang ia u text-editor. |
04:19 | Extracting data from data files is called as Parsing.
(Ban sei ia ki data na data files ki ju khot Parsing. |
04:23 | Most file formats can be parsed using functions available in SeqIO module.
(Bun ia kum kine jait file formats lah ban parsed da kaba pyndonkam functions kiba don ha SeqIO module. |
04:30 | Most commonly used functions of SeqIO module are: parse, read, write and convert.
(Kiba kham paw ki functions jong ka SeqIO module dei : parse, read, write bad convert. |
04:38 | Open the terminal by pressing Ctrl, Alt and t keys simultaneously.
(Plie ia ka terminal da kaba nion sah ia Ctrl, Alt bad t keys. |
04:44 | Start Ipython by typing "ipython" at the prompt. Press Enter.
(Plie ia ka Ipython da kaba type "ipython" ha ka prompt. Nion Enter. |
04:51 | Next, import "SeqIO" module from Bio package.
(Nangta, import "SeqIO" module na Bio package. |
04:56 | At the prompt, type: from Bio import SeqIO. Press Enter.
(Ha ka prompt, type: from Bio import SeqIO. Nion Enter. |
05:04 | We will start with the most important function “parse”.
(Ngin ia sdang da ka function kaba kham donkam ka “parse”. |
05:07 | For demonstration, I will use a FASTA file that has many records which we had downloaded earlier from the database.
(Na ka bynta ka nuksa, ngan pyndonkam ka FASTA file ka ba don bun records kaba ngi download shen na ka database. |
05:17 | For simple FASTA parsing, type the following at the prompt.
(Na ba bynta ka FASTA parsing ba kham suk, type kumne harum ha ka prompt. |
05:22 | Here, we are using the parse function to read the contents of the sequence.fasta file.
(Hangne, ngi pyndonkam ia ka parse function ban read ia ki contents jong ka sequence.fasta file. |
05:30 | For the output, print record id, sequence present in the record and also the length of the sequence.
(Na ka bynta ka output, print record id, sequence kaba don ha ka record bad ruh ia ka jingjrong jong ka sequence. |
05:41 | Also notice that the parse function is used to read sequence data as Sequence record objects.
(Peit bha ruh ba ia ka parse function shait pyndonkam ban read sequence data kum Sequence record objects. |
05:48 | It is generally used with a for loop.
(Shait pyndonkam barabor bad ka for loop. |
05:52 | It can accept two arguments, the first one is the file name to read the data.
(Ka lah ban shim ar tylli ki arguments, kaba nyngkong dei ka file name ban read ia ka data. |
05:59 | The second specifies the file format.
(Ka ba ar pat ka pyntikna ia ka format jong ka file. |
06:02 | Press Enter key twice to get the output.
(Nion ia u Enter key ar sien ban ioh ia ka output. |
06:07 | The output shows the identifier line, followed by the sequence contained in the file, also the length of the sequence for all the records in the file.
(Ka output ka pyni ia u identifier line, nangta bud sa u sequence uba don ha ka file, bad ruh ka jingjrong jong ka sequence na ka bynta baroh ki records ha ka file. |
06:21 | Notice that the FASTA format does not specify the alphabet.
(Phin shem ba ka FASTA format kan nym pyntikna ia u alphabet. |
06:26 | So, the output does not specifies it as a DNA sequence.
(Te, ka output kan nym pyni ia ka kum ka DNA sequence. |
06:31 | The same steps can be repeated for parsing GenBank file.
(Ki juh ki rukom lah ban bud ban parsing iaka GenBank file. |
06:36 | For Demonstration we will use the GenBank file which we have downloaded earlier from the database.
(Ban pyni nuksa ngin pyndonkam ia ka GenBank file kaba ngi lah download mynne na ka database. |
06:43 | Press up-arrow key to get the lines of code which we had used earlier.
(Nion ia u up-arrow key ban ioh ia ki lines jong ki code kiba ngi lah pyndonkam nyngkong . |
06:49 | Change the file name to sequence.gb .
(Pynkylla ia ka kyrteng jong ka file sha ka sequence.gb . |
06:53 | Change the file format to genbank.
(Pynkylla ia ka file format sha ka genbank. |
06:56 | The rest of the code remains same.
(Ki code ba sah kin neh kumjuh. |
06:58 | Press Enter key twice to get the output.
(Nion ia u Enter key arsien ban ioh ia ka output. |
07:03 | Here too the output shows the record id, sequence and the length of the sequence for all the records in the file.
(Hangne ruh ka output ka pyni ia ka record id, sequence bad ka jingjrong jong ka sequence na ka bynta baroh ki records ha ka file. |
07:12 | Notice that the GenBank format specifies the sequence as DNA sequence.
(Sa khmih ba ka GenBank format ka pyntikna ia ka sequence kum ka DNA sequence. |
07:19 | Similarly, Swiss-prot and EMBL files can be parsed using the same code as above.
(Kumjuh ruh, ia ki Swiss-prot bad EMBL files lah ban parse da kaba pyndonkam ia u juh u code kum haneng.
|
07:27 | If your file contains a single record then type the following lines for parsing.
(Lada ka file jong phi ka don uwei u record phi hap type ia ki line harum ban leh parsing. |
07:34 | Here, we will use the previously saved FASTA file with a single record, that is, insulin.fasta as an example.
(Hangne, ngin pyndonkam ia FASTA file ba lah save mynne, bad uwei u record, uta u dei insulin.fasta kum ka nuksa. |
07:43 | Notice that we have used read function instead of parse function. Press Enter.
(Phin sa shem ba ngi lah dep pyndonkam ia ka read function ha jaka jong parse function. Nion Enter. |
07:50 | The output shows the contents for the file insulin.fasta.
(Ka output ka pyni ki contents na ka bynta ka file insulin.fasta. |
07:55 | It shows the sequence as sequence record object.
(Ka pyni ia ka sequence kum sequence record object. |
07:59 | And other attributes such as GI, accession number and description.
(Bad kiwei ki attributes kum GI, accession number bad description. |
08:06 | We can also view the individual attributes of this record as follows.
(Ngi lah ruh ban peit ia ki individual attributes jong kane ka record kumne harum. |
08:11 | At the prompt, type: record dot seq. Press Enter.
(Ha ka prompt, type: record dot seq. Nion Enter. |
08:18 | The output shows the sequence present in the file.
(Ka output ka pyni ia ka sequence ba don ha ka file. |
08:22 | To view the identifiers for this record, type: record dot id. Press Enter.
(Ban peit ia ki identifiers jong kane ka record, type: record dot id. Nion Enter. |
08:29 | The output shows the GI number and accession number etc.
(Ka output ka pyni ia u GI number bad accession number bad kumta ter ter. |
08:34 | You can use the function described above to parse the data files of your choice.
(Phi lah ban pyndonkam ia u function ba lah ong haneng ban parse ia ki data files ha ka rukom kaba ngi mon. |
08:40 | Now, let's summarize.
(To ngin ia khmih ia kiba ngi lah kdew haneng ) |
08:42 | In this tutorial, we have learnt:to download FASTA and GenBank files from NCBI database website and use parse and read functions from the SeqIO module
(Ha kane ka jingbatai (tutorial), ngi lah nang ban: downloadFASTA bad GenBank files na ka NCBI database website bad pyndonkam ia parse bad read functions na ka SeqIO module. |
08:55 | to extract data such as record ids, description and sequences from FASTA and GenBank files.
(Ban sei ia ki data kum ki record ids, description bad sequences na FASTA bad GenBank files. |
09:03 | Now, for the assignment-
(Mynta, na ka bynta ka assignment- |
09:06 | Download FASTA files for nucleotide sequence of your choice from NCBI database.
(Download ia ki FASTA files na ka bynta ka nucleotide sequence kiba phi sngewiahap na NCBI database. |
09:13 | Convert the file of sequences to their reverse complements.
(Pynkylla ia ki file jong ki sequences sha reverse complements jong ki. |
09:17 | Your completed assignment should have the following lines of code.
(Ka assignment ba lah dep jong phi ka dei ban don ki lines of code kumne harum. |
09:22 | Use parse function to load nucleotide sequences from the FASTA file.
(Pyndonkam parse function ban load ia nucleotide sequences na ka FASTA file. |
09:28 | Next, print reverse complements using the Sequence object’s built in reverse complement method.
(Nangta, print ia ka reverse complements da kaba pyndonkam ia ka Sequence object’s built in reverse complement method. |
09:37 | Video at the following link summarizes the spoken-tutorial project.
(Ka video ha ka link hapoh kan pynsngewthuh shuh shuh ia kane ka spoken-tutorial project. |
09:42 | Please download and watch it.
(Sngewbha download bad peit ia ka. |
09:44 | The Spoken Tutorial Project team conducts workshops and gives certificates to those who pass an on-line test.
(Ka Spoken Tutorial Project team ka ju pynlong ia ki workshops bad ai ruh ia ki certificates sha kito ba pass ia ka on-line test. |
09:51 | For more details, please write to us.
(Na ka bynta ka jingtip ba kham bniah, sngewbha thoh sha ngi. |
09:55 | The Spoken Tutorial Project is funded by NMEICT, MHRD, Government of India.
(Ia ka Spoken Tutorial Project la bei tyngka da ka NMEICT, MHRD, Government of India. |
10:01 | More information on this mission is available at the link shown.
(Ki jingtip ba kham pura ia kane ka mission lah ban ioh na ka link harum.
|
10:06 | This is Snehalatha from IIT Bombay, signing off. Thank you for joining.
(Nga dei ka Snehalatha na IIT Bombay, signing off. Khublei shibun ) |