Difference between revisions of "Python/C3/Parsing-data/English"
(Created page with '{| border=1 !Visual Cue !Narration |- | Show Slide 1 Containing title, name of the production team along with the logo of MHRD | Hello friends and welcome to the tutorial on "…') |
Pravin1389 (Talk | contribs) |
(No difference)
|
Latest revision as of 12:32, 2 December 2012
Visual Cue | Narration |
---|---|
Show Slide 1
Containing title, name of the production team along with the logo of MHRD |
Hello friends and welcome to the tutorial on "Parsing Data". |
Show Slide 2
Learning objectives |
At the end of this tutorial, you will be able to,
|
Show Slide 3
Pre-requisite slide |
Before beginning this tutorial,we would suggest you to complete the tutorial on "Getting started with Lists". |
Open the terminal
ipython |
Invoke the ipython interpreter by typing ipython on your terminal. |
Open the file sslc.txt and show | Let us start this tutorial with the help of an an exercise.
There is an input file containing huge no. of records. Each record corresponds to a student. |
Show Slide 4
'Data set' |
As you can see, each record consists of fields separated by a semicolon. The first record is region code, then roll number,name, marks of second language,first language, maths, science and social and total marks.
Our job is to calculate the arithmetic mean of all the maths marks in the region "B". |
Open the file sslc.txt and show | Now let us understand, what is meant by 'parsing data'. From the input file, we can see that the data we have is in the form of text. Parsing this data is all about reading it and converting it into a form which can be used for computations -- in our case,it will be a sequence of numbers.
We can clearly see that the problem involves reading files and tokenizing. |
Switch to the terminal
line = "parse this string" |
Let us learn about tokenizing strings. We shall define a string first. Type |
line.split() | We are now going to split this string on whitespace. |
As you can see, we get a list of strings, which means, when split is called without any arguments, it splits on whitespace. In simple words, all the spaces are treated as one big space. | |
record = "A;015163;JOSEPH RAJ S;083;042;47;0;72;244"
record.split(';') |
The function split can also split on a string of our choice. This is achieved by passing that as an argument. But first lets define a sample record from the file. |
We can see that the string is split on the semi-colon and we get each field separately. We can also observe that an empty string appears in the list since there are two semi colons without anything in between.
In short, split splits on whitespace if called without an argument and splits on the given argument if it is called with an argument. Pause the video here, try out the following exercise and resume the video | |
Show Slide 5
Assignment 1 |
Split the variable line using a space as argument. Is it same as
splitting without an argument ? |
Continue from paused state Switch to the terminal
record.split() |
Switch to terminal for the solution |
Show Slide 6
Solution 1 |
We see that when we split on space, multiple whitespaces are not clubbed as one and there is an empty string every time there are two consecutive spaces. |
Switch to the terminal | Now that we know how to split a string, we can split the record and retrieve each field separately. But there is one problem. The region code "B" and a "B" surrounded by whitespace are treated as two different regions. Hence, we must find a way to remove all the whitespace around a string so that the region code "B" and a "B" with white spaces are dealt as same.
This is possible by using the strip method of strings. Let us define a string by typing |
unstripped = " B "
unstripped.strip() |
We can see that strip removes all the whitespace around the sentence.
Pause the video here, try out the following exercise and resume the video |
Show Slide 7
Assignment 2 |
What happens to the white space inside the sentence when it is stripped |
Continue from paused state Switch to the terminal
a_str = " white space " a_str.strip() |
Switch to the terminal for solution |
We see that, the whitespace inside the sentence is only removed and the rest remains unaffected. | |
By now we know enough to separate fields from the record and to strip out any white space. The only road block we now have, is conversion of string to float.
The splitting and stripping operations are done on a string and their result is also a string. Hence the marks that we have, are still strings and mathematical operations are not possible on them. We must convert them into numbers (integers or floats), before we can perform mathematical operations on them. | |
mark_str = "1.25"
mark = int(float(mark_str)) type(mark_str) type(mark) |
We shall now look at converting strings into floats. We define a float string first. Type |
We can see that string is converted to float. We can perform mathematical operations on them now.
Pause the video here, try out the following exercise and resume the video | |
Show Slide 8
Assignment 3 |
What happens if you do int("1.25") |
Continue from paused state Switch to the terminal
int("1.25") |
Switch to the terminal for solution |
dcml_str = "1.25"
flt = float(dcml_str) flt number = int(flt) number |
|
Using int, it is also possible to convert float into integers.
Now that we have all the machinery required to parse the file, let us solve the problem. We first read the file line by line and parse each record. We then see if the region code is B and store the marks accordingly. | |
math_marks_B = [] # an empty list to store the marks
for line in open("/home/fossee/sslc.txt"): fields = line.split(";") region_code = fields[0] region_code_stripped = region_code.strip() math_mark_str = fields[5] math_mark = float(math_mark_str) if region_code == "B": math_marks_B.append(math_mark) |
|
math_marks_mean = sum(math_marks_B) / len(math_marks_B)
math_marks_mean |
Now we have all the math marks of region "B" in the list math_marks_B. To get the mean, we just have to sum the marks and divide by the length. |
Hence we get our final output. This is how we split and read such a huge data and perform computations on it. | |
Show Slide 9
Summary slide |
This brings us to the end of the tutorial. In this tutorial, we have learnt to,
|
Show Slide 10
Self assessment questions slide |
Here are some self assessment questions for you to solve
|
Show Slide 11
Solution of self assessment questions on slide |
And the answers,
|
Show Slide 12
Acknowledgment slide |
Hope you have enjoyed this tutorial and found it useful. Thank you. |