Python-3.4.3/C2/Parsing-data/English

From Script | Spoken-Tutorial
Revision as of 11:27, 3 May 2018 by Priyacst (Talk | contribs)

Jump to: navigation, search
Visual Cue
Narration
Show Slide


Welcome to the spoken tutorial on Parsing-data.
Show Slide

Objectives


In this tutorial, we will learn to-


  • Split a string using a delimiter.
  • Remove the leading, trailing and all whitespaces in a string and
  • Convert between different built-in datatypes


Show slide

System Specifications

To record this tutorial, I am using


  • Ubuntu Linux 16.04 operating system
  • Python 3.4.3 and
  • IPython 5.1.0


Show Slide

Prerequisite slide

To practice this tutorial, you should know how to use lists.


If not, see the relevant Python tutorials on this website.

Show Slide

Parsing Data


First, let us understand, what is meant by parsing data.


  • Parsing the data is reading data in text form.
  • It is converted into a form which can be used for computations.


Show Slide

split() function


Next we will learn about split() function.


split() function breaks up a larger string into smaller strings using a defined separator.


If no argument is specified, then whitespace is used as default separator.


Syntax is: str dot split inside parentheses argument

Show Slide

split() function

The split function parses a string and returns an array of tokens.


This is called string tokenizing.

Press Ctrl+Alt+T keys Let us first open the terminal by pressing Ctrl+Alt+T keys simultaneously.
Type ipython3 Type, ipython3 and press Enter.
%pylab and press Enter. Let us initialize the pylab package.


Type, percentage sign pylab and press Enter.

str1 = "Welcome to Python tutorials"


Highlight whitespaces


From here onwards, please remember to press the Enter key after typing every command on the terminal.

Let us define a variable str1 as string data type.


Type, str1 is equal to inside double quotes Welcome to insert some whitespaces, then Python tutorials


We can have any number of whitespaces between to and Python tutorials. But all the spaces are treated as one space.

str1.split()


Highlight output

Now, we are going to split this string on whitespace.


Type, str1 dot split open and close parentheses.


As you can see, we get a list of strings.

Type

x = "08-26-2009;08-27-2009;08-29-2009"


Let us take another example for split() function with argument.


Type as shown.

Type x.split(';') Type, x dot split inside parentheses inside single quotes semicolon.
Point to the output We get a list of strings separated by comma.
Show Slide


Exercise 1

Pause the video.


Try this exercise and then resume the video.


Split x using space as argument.


Is it same as splitting without an argument?

Switch to the terminal Switch to the terminal for the solution.
Type, b = x.split() Type, b is equal to x dot split open and close parentheses.



Type, c = x.split(' ') Type, c is equal to x dot split open and close parentheses and inside single quotes space.
Type, b Type, b
Type, c Type, c
Highlight the output We can see that splitting without argument is same as giving space as argument.
Show slide: Splitting the string without argument will split the string separated by any number of spaces.


And giving space as argument will split the sentence specifically on single whitespace.

Type str1 Let us recall the variable str1.
Type b= str1.split() Now, we will split this string without argument.


Type, b is equal to str1 dot split open and close parentheses.

Type c=str1.split(' ') Type, c is equal to str1 dot split open and close parentheses and inside single quotes space.
Type b Type, b
Type c Type, c
Highlight the output As you can see, here b is not equal to c since c has whitespaces as entries whereas b has only words.


<<PAUSE>>

show slide

strip() function

Next we will learn about strip method.


strip function removes all leading and trailing whitespaces in a string.

Type unstripped = " Hello world " Let us define a string by typing

unstripped is equal to inside double quotes space Hello world space

Type unstripped.strip() Now to remove the whitespace,


Type, unstripped dot strip open and close parentheses.

Highlight output We can see that strip removes all the white spaces in the beginning and at the end of the string.

After splitting and stripping we get a list of strings with leading and trailing spaces stripped off.

<<PAUSE>>

Type mark_str = "1.25" Now we shall look at converting strings into floats and integers.

Type, mark underscore str is equal to inside double quotes 1.25


Note that 1.25 is a string and not a float as it is within double quotes.

Type mark = float(mark_str)


Type, mark is equal to float inside parentheses mark underscore str


Here we are converting string to float.

Type type(mark_str) Type, type inside parentheses mark underscore str


This tells you the datatype of mark_str i.e. string.



Type type(mark) Type, type inside parentheses mark

This shows mark is a float datatype.

Highlight the output We can see that string is converted to float.


Now we can perform mathematical operations on them.

Show Slide

Exercise 2


Pause the video. Try this exercise and then resume the video.


What happens if you type, int inside parentheses inside double quotes 1.25 in the terminal?

Switch to terminal Switch to the terminal for the solution.
Type int("1.25")


Highlight ValueError

Type, int inside parentheses inside double quotes 1.25


We can see a ValueError.


We cannot convert a string to integer directly.

Type dcml_str = "1.25" Let us see the correct solution for this.


Type, dcml underscore str is equal to inside double quotes 1.25.

Type flt = float(dcml_str) Type, flt is equal to float inside parentheses dcml underscore str.


Here we are converting the string into float as we cannot directly convert it into integer.

Type flt Type, flt
Type number = int(flt) Type, number is equal to int inside parentheses flt


We are now converting float into integer.

Type number


Type, number

we got the output as integer.

This is how we should convert strings into floats and integers.

<<PAUSE>>

Open the file text editor.


Next, we will use a data file to parse the data.


Let me open the file student underscore record.txt in text editor.

Show text: student_record.txt is available in the Code files link.


A file student underscore record.txt is available in the Code files link of this tutorial.


Please download it in your Home directory and use it.



Scroll down and show the records


We will first read the file line by line and parse each record in this file.

It contains records of students and their marks in the State Secondary Board Examination.


It has 1 lakh 80 thousand lines of record.


We are going to read it and process this data.

Highlight A;015163;JOSEPH RAJ S;083;042;47;00;72;244


  • Highlight 'A'
  • Highlight 015163
  • Highlight JOSEPH RAJ S
  • Highlight 083;042;47;00;72
  • Highlight 24


Each line in the file is a set of fields separated by semicolons.


Consider a sample record from this file.


The following are the fields in any given line.

  • Region Code
  • Roll Number
  • Name
  • Marks of 5 subjects
  • Total marks


Open text editor Open a new text editor.
Copy paste the code from text editor Type the code as shown.
Highlight

for line in open("student_record.txt"):

fields = line.split(";")

Let me explain this program.


We have already learnt for loop in earlier tutorial.


The for loop will process the student record and split the fields of each record.

Highlight

math_mark = float(math_mark_str)


The math marks are then converted to float.
Highlight the code for this narration.


if region_code == "A": math_marks_A.append(math_mark)

Then it is appended and stored as a list in a variable math underscore marks underscore A for region code A.
Save python file as marks.py Save the file as marks.py in the home directory.
Switch to terminal Switch to the terminal.
Type, %run marks.py Execute the file with percentage sign run space marks.py.
Switch to editor


Highlight math_marks_A

Switch back to the editor.


Now we have all the math marks for region A in the list math underscore marks underscore A.

Add in the marks.py file

math_marks_mean = sum(math_marks_A) / len(math_marks_A)


print (math_marks_mean)

Highlight len(math_marks_A)

Add the below lines to calculate the mean of math marks for region A.


For this, we just have to sum the math marks and divide by the length.


Note that the length will give the number of students in region ‘A’.

Press ctrl + s Let us save the file.
Switch to terminal Switch to the terminal.
Type, %run marks.py Execute the file again with percentage sign run space marks.py.
Highlight output Hence we get our final output.


Here the mean value for region A is calculated roughly for 1 lakh 67 thousand records.


This is how we split and read a huge data and perform computations on it.


<<PAUSE>>

Show Slide

Summary slide


This brings us to the end of this tutorial.


In this tutorial, we learnt to,

  1. Tokenize a string
  2. Split a string separated by delimiters with split() function


Show Slide

Summary slide

# Remove whitespaces using the strip() function.
  1. Convert datatypes of numbers from one type to another
  2. Parse input data and perform computations on it.


Show Slide

Evaluation


Here are some self assessment questions for you to solve


  1. How do you split the string “Guido;Rossum;Python" to get the words.


Show Slide

Evaluation

2. What does int("20.0") produce
Show Slide


Solutions


And the answers,
  1. line.split(',')
  2. int("20.0") will give an error, because converting a string directly into integer is not possible.


Show Slide

Forum

Please post your timed queries in this forum.
Show Slide

Fossee Forum

Please post your general queries on Python in this forum.
Show Slide

Textbook Companion

FOSSEE team coordinates the TBC project.
Show Slide

Acknowledgment

http://spoken-tutorial.org

Spoken Tutorial Project is funded by NMEICT, MHRD, Govt. of India.


For more details, visit this website.

Show Slide

Thank You

This is Priya from IIT Bombay signing off.

Thanks for watching.

Contributors and Content Editors

Nancyvarkey, Nirmala Venkat, Priyacst