Difference between revisions of "Python/C3/Parsing-data/English-timed"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with '{| border=1 !Timing !Narration |- | 0:00 | Hello friends and welcome to the tutorial on "Parsing Data". |- | 0:06 | At the end of this tutorial, you will be able to, # Split a …')
 
Line 61: Line 61:
 
|-
 
|-
 
|  2:05
 
|  2:05
| We are now going to split this string on whitespace. type  
+
| We are now going to split this string on white space. type  
 
line.split()
 
line.split()
  
 
|-
 
|-
 
| 2:17
 
| 2:17
| As you can see, we get a list of strings, which means, when <tt>split</tt> is called without any arguments, it splits on whitespace.
+
| As you can see, we get a list of strings, which means, when split is called without any arguments, it splits on whitespace.
  
 
|-
 
|-
Line 74: Line 74:
 
|-
 
|-
 
|  2:29
 
|  2:29
| The function <tt>split</tt> can also split on a string of our choice.
+
| The function split can also split on a string of our choice.
  
 
|-
 
|-
Line 84: Line 84:
 
| But first lets define a sample record from the file.
 
| But first lets define a sample record from the file.
 
record = "A;015163;JOSEPH RAJ S;083;042;47;0;72;244"
 
record = "A;015163;JOSEPH RAJ S;083;042;47;0;72;244"
record.split(';')
+
now type record.split(';')
  
  
Line 97: Line 97:
 
|-
 
|-
 
|3:25
 
|3:25
|In short, <tt>split</tt> splits on whitespace if called without an argument and splits on the given argument if it is called with an argument.
+
|In short, split splits on white space if called without an argument and splits on the given argument if it is called with an argument.
  
 
|-
 
|-
Line 135: Line 135:
 
|-
 
|-
 
|4:32
 
|4:32
|This is possible by using the <tt>strip</tt> method of strings.
+
|This is possible by using the strip method of strings.
  
 
|-
 
|-
Line 145: Line 145:
 
|-
 
|-
 
|  5:01
 
|  5:01
| We can see that strip removes all the whitespace around the sentence.
+
| We can see that strip removes all the white space around the sentence.
  
 
|-
 
|-
Line 237: Line 237:
 
|-
 
|-
 
| 9:00
 
| 9:00
| Using <tt>int</tt>, it is also possible to convert float into integers.
+
| Using int, it is also possible to convert float into integers.
  
 
|-
 
|-
Line 256: Line 256:
  
 
for line in open within brackets and double quotes slash home slash fossee slash sslc dot txt colon
 
for line in open within brackets and double quotes slash home slash fossee slash sslc dot txt colon
     fields is equal to line dot split within brackets and double quotes ;
+
     fields is equal to line dot split within brackets in double quotes colon;
  
     region underscore code is equal to  fields within brackets 0.
+
     region underscore code is equal to  fields within square  brackets 0.
 
     region underscore code underscore stripped is equal to region underscore code dot strip and empty brackets.
 
     region underscore code underscore stripped is equal to region underscore code dot strip and empty brackets.
  
Line 265: Line 265:
  
 
     if region underscore code double equalto "B" colon
 
     if region underscore code double equalto "B" colon
         math underscore marks underscore B dot append within brackets math underscore mark
+
         math underscore marks underscore B dot append within brackets math underscore mark and hit enter
  
  
Line 302: Line 302:
 
|-
 
|-
 
|13:44
 
|13:44
| 2. Split a data separated by delimiters by using the function <tt>split()</tt>.
+
| 2. Split a data separated by delimiters by using the function split().
  
 
|-
 
|-
 
|13:50
 
|13:50
| 3. Get rid of extra white spaces around using the <tt>strip()</tt> function.
+
| 3. Get rid of extra white spaces around using the strip() function.
  
 
|-
 
|-
 
|13:55
 
|13:55
| Convert datatypes of numbers from one type to another.
+
| Convert data types of numbers from one type to another.
  
 
|-
 
|-
Line 352: Line 352:
 
|-
 
|-
 
| 14:47
 
| 14:47
| And the answers,
+
| And just look at  the answers,
  
 
|-
 
|-
 
|14:50
 
|14:50
|1. We can split the string the semi-colons by passing it as an argument to the <tt>split</tt> function as line.split(';')
+
|1. We can split the string the semi-colons by passing it as an argument to the split function as line.split(';')
  
 
|-
 
|-

Revision as of 13:03, 15 March 2013

Timing Narration
0:00 Hello friends and welcome to the tutorial on "Parsing Data".
0:06 At the end of this tutorial, you will be able to,
  1. Split a string using a delimiter.
  2. Remove the whitespace around the string.
  3. Convert the datatypes of variables from one type to other.
0:18 Before beginning this tutorial,we would suggest you to complete the tutorial on "Getting started with Lists".
0:23 Now, invoke the ipython interpreter by typing ipython on your terminal.
0:32 As you can see, each record consists of fields separated by a colon in the file sslc.txt.
0:51 The first record is region code, then roll number,name, marks of second language,first language, maths, science and social and total marks.
1:06 Our job is to calculate the arithmetic mean of all the maths marks in the region "B".
1:14 Now let us understand, what is meant by 'parsing data'.
1:19 From the input file, we can see that the data we have is in the form of text.
1:25 Parsing this data is all about reading it and converting it into a form which can be used for computations -- in our case,it will be a sequence of numbers.
1:40 We can clearly see that the problem involves reading files and tokenizing.
1:45 Let us learn about tokenizing strings.
1:48 We shall define a string first. Type
line = "parse this  (a long space)   string"
2:05 We are now going to split this string on white space. type

line.split()

2:17 As you can see, we get a list of strings, which means, when split is called without any arguments, it splits on whitespace.
2:24 In simple words, all the spaces are treated as one big space.
2:29 The function split can also split on a string of our choice.
2:34 This is achieved by passing that as an argument.
2:36 But first lets define a sample record from the file.

record = "A;015163;JOSEPH RAJ S;083;042;47;0;72;244"

now type record.split(';')


3:12 We can see that the string is split on the semi-colon and we get each field separately.
3:18 We can also observe that an empty string appears in the list since there are two semi colons without anything in between.
3:25 In short, split splits on white space if called without an argument and splits on the given argument if it is called with an argument.
3:33 Pause the video here, try out the following exercise and resume the video
3:39 Split the variable line using a space as argument.
3:43 Is it same as splitting without an argument ?

Type on terminal record.split()

3:57 We see that when we split on space, multiple whitespaces are not clubbed as one and there is an empty string every time there are two consecutive spaces.
4:09 Now that we know how to split a string, we can split the record and retrieve each field separately.
4:16 But there is one problem.
4:17 The region code "B" and a "B" surrounded by whitespace are treated as two different regions.
4:23 Hence, we must find a way to remove all the whitespace around a string so that the region code "B" and a "B" with white spaces are dealt as same.
4:32 This is possible by using the strip method of strings.
4:36 Let us define a string by typing

unstripped = within double quotes a long space B again a long space

unstripped.strip()
5:01 We can see that strip removes all the white space around the sentence.
5:07 Pause the video here, try out the following exercise and resume the video
5:13 What happens to the white space inside the sentence when it is stripped


5:19 We see that, the whitespace inside the sentence is only removed and the rest remains unaffected.


5:54 type
a_str = "  (a long space)       white (again long space)     space (and long space)      "
a_str.strip()
By now we know enough to separate fields from the record and to strip out any white space.
6:06 The only road block we now have, is conversion of string to float.
6:12 The splitting and stripping operations are done on a string and their result is also a string.
6:17 Hence the marks that we have, are still strings and mathematical operations are not possible on them.
6:21 We must convert them into numbers (integers or floats), before we can perform mathematical operations on them.
6:31 So, We shall now look at converting strings into floats.
6:33 We define a float string first.
6:36 So Type
mark_str = "1.25"
mark = int(float(mark_str))
type(mark_str)
type(mark)
7:22 We can see that string is converted to float.
7:26 We can perform mathematical operations on them now.
7:29 Pause the video here, try out the following exercise and resume the video
7:39 What happens if you do int within brackets "1.25"
7:46 So type int within brackets 1.25 in the terminal; it raises an error that converting an integer to float directly is not possible
8:00 It involves an intermediate type of conversion to float; hence you have to follow the following type of conversion
8:08 So type dcml underscore str is equal to within double quotes 1.25; then flt = float within brackets dcml underscore str ; then type flt
8:45 Then number is equal to int within brackets flt; then type number.


9:00 Using int, it is also possible to convert float into integers.
9:05 Now that we have all the machinery required to parse the file, let us solve the problem.
9:10 We first read the file line by line and parse each record.
9:14 We then see if the region code is B and store the marks accordingly.
9:26 So type math underscore marks underscore B is equalto empty brackets

for line in open within brackets and double quotes slash home slash fossee slash sslc dot txt colon

   fields is equal to line dot split within brackets in double quotes colon;
   region underscore code is equal to  fields within square  brackets 0.
   region underscore code underscore stripped is equal to region underscore code dot strip and empty brackets.
   math underscore mark underscore str is equalto fields within square brackets 5
   math underscore mark = float within brackets math underscore mark underscore str
   if region underscore code double equalto "B" colon
       math underscore marks underscore B dot append within brackets math underscore mark and hit enter


12:37 Now we have all the math marks of region "B" in the list math_marks_B.
12:45 To get the mean, we just have to sum the marks and divide by the length.

type

math_marks_mean = sum(math_marks_B) / len(math_marks_B)
math_marks_mean
13:24 Hence we get our final output.
13:27 This is how we split and read such a huge data and perform computations on it.
13:32 So, this brings us to the end of the tutorial.
13:37 In this tutorial, we have learnt to,
13:38 1. Tokenize a string using various delimiters like semi-colons.
13:44 2. Split a data separated by delimiters by using the function split().
13:50 3. Get rid of extra white spaces around using the strip() function.
13:55 Convert data types of numbers from one type to another.
13:59 Parse input data and perform computations on it.
14:05 Here are some self assessment questions for you to solve
14:12 1. How do you split the string "Guido;Rossum;Python" to get the words.
14:26 2. How will you remove the extra whitespace in this sentence " Hello World "
14:34 3. What does int("20.0") produce and the options are
14:40 20
14:42 20.0
14:43 Error
14:44 "20"


14:47 And just look at the answers,
14:50 1. We can split the string the semi-colons by passing it as an argument to the split function as line.split(';')
15:03 " Hello World ".strip() will remove the extra whitespaces around the string.
15:11 and the last one int("20.0") will give an error, because converting a float string, 20.0, directly into integer is not possible.


15:25 Hope you have enjoyed this tutorial and found it useful.
15:28 Thank you.

Contributors and Content Editors

Gaurav, Minal, PoojaMoolya, Ranjana, Sneha