Python-3.4.3/C3/Parsing-data/English-timed

From Script | Spoken-Tutorial
Revision as of 17:36, 31 May 2019 by Pratik kamble (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Time Narration
00:01 Welcome to the spoken tutorial on Parsing data.
00:06 In this tutorial, we will learn to-

Split a string using a delimiter.

Remove the leading, trailing and all whitespaces in a string and

Convert between different built-in datatypes

00:22 To record this tutorial, I am using

Ubuntu Linux 16.04 operating system

Python 3.4.3 and

IPython 5.1.0

00:38 To practice this tutorial, you should know how to use lists.

If not, see the relevant Python tutorials on this website.

00:49 First, let us understand, what is meant by parsing data.

Parsing the data is reading data in text form.

It is converted into a form which can be used for computations.

01:04 Next we will learn about split() function.
01:08 split() function breaks up a larger string into smaller strings using a defined separator.
01:15 If no argument is specified, then whitespace is used as default separator.
01:22 Syntax is: str dot split inside parentheses argument
01:29 The split function parses a string and returns an array of tokens.

This is called string tokenizing.

01:38 Let us first open the terminal by pressing Ctrl+Alt+T keys simultaneously.
01:46 Type ipython3 and press Enter.
01:52 Let us initialize the pylab package. Type percentage sign pylab and press Enter.
02:02 From here onwards, please remember to press the Enter key after typing every command on the terminal.
02:09 Let us define a variable str1 as string data type.
02:14 Type, str1 is equal to inside double quotes Welcome to insert some whitespaces, then Python tutorials
02:24 We can have any number of whitespaces between to and Python tutorials.

But all the spaces are treated as one space.

02:34 Now, we are going to split this string on whitespace.
02:38 Type, str1 dot split open and close parentheses.
02:44 As we can see, we get a list of strings.
02:48 Let us take another example for split() function with argument. Type as shown.
02:57 Type, x dot split inside parentheses inside single quotes semicolon.
03:04 We get a list of strings separated by comma.
03:08 Pause the video. Try this exercise and then resume the video.
03:14 Split x using space as argument. Is it same as splitting without an argument?
03:22 Switch to the terminal for the solution.
03:26 Type b is equal to x dot split open and close parentheses.
03:32 Type c is equal to x dot split inside parentheses and inside single quotes space.
03:41 Type b
03:44 Type c
03:47 We can see that splitting without argument is same as giving space as argument.
03:54 Splitting the string without argument will split the string separated by any number of spaces.
04:01 And giving space as argument will split the sentence specifically on single whitespace.
04:08 Let us recall the variable str1.
04:12 Now, we will split this string without argument. Type b is equal to str1 dot split open and close parentheses.
04:24 Type c is equal to str1 dot split inside parentheses and inside single quotes space.
04:33 Type b
04:36 Type c
04:38 As you can see, here b is not equal to c since c has whitespaces as entries, whereas b has only words.
04:49 Next we will learn about strip method.
04:53 strip function removes all leading and trailing whitespaces in a string
04:59 Let us define a string by typing unstripped is equal to inside double quotes space Hello world space
05:09 Now to remove the whitespace, type, unstripped dot strip open and close parentheses.
05:18 We can see that strip removes all the whitespaces in the beginning and at the end of the string.
05:25 After splitting and stripping we get a list of strings with leading and trailing spaces stripped off.
05:32 Now we shall look at converting strings into floats and integers.
05:38 Type, mark underscore str is equal to inside double quotes 1.25
05:46 Note that 1.25 is a string and not a float as it is within double quotes.
05:53 Type, mark is equal to float inside parentheses mark underscore str. Here we are converting string to float.
06:05 Type type inside parentheses mark underscore str. This tells you the datatype of mark_str i.e. string.
06:17 Type type inside parentheses mark . This shows mark is a float datatype.
06:26 We can see that string is converted to float. Now we can perform mathematical operations on them.
06:34 Pause the video. Try this exercise and then resume the video.
06:40 What happens if you type, int inside parentheses inside double quotes 1.25 in the terminal?
06:48 Switch to the terminal for the solution.
06:52 Type, int inside parentheses inside double quotes 1.25
06:59 We can see a ValueError. We cannot convert a string to integer directly.
07:06 Let us see the correct solution for this. Type dcml underscore str is equal to inside double quotes 1.25.
07:18 Type flt is equal to float inside parentheses dcml underscore str.
07:27 Here we are converting the string into float as we cannot directly convert it into integer.
07:34 Type flt
07:37 Type, number is equal to int inside parentheses flt. We are now converting float into integer.
07:48 Type number We got the output as integer.
07:54 This is how we should convert strings into floats and integers.
07:59 Next, we will use a data file to parse the data.
08:04 Let me open the file student underscore record.txt in text editor.
08:10 The file student underscore record.txt is available in the Code files link of this tutorial.

Please download it in your Home directory and use it.

08:22 We will first read the file line by line and parse each record in this file.
08:28 It contains records of students and their marks in the State Secondary Board Examination.
08:35 It has 1 lakh 80 thousand lines of record. We are going to read it and process this data.
08:43 Each line in the file is a set of fields separated by semicolons.
08:49 Consider a sample record from this file.
08:53 The following are the fields in any given line.

Region Code Roll Number Name Marks of 5 subjects Total marks

09:08 Open a new text editor. Type the code as shown.
09:14 Let me explain this program.
09:17 We have already learnt for loop in earlier tutorial. The for loop will process the student record and split the fields of each record.
09:28 The math marks are then converted to float.
09:32 Then it is appended and stored as a list in a variable math underscore marks underscore A for region code A.
09:41 Save the file as marks.py in the Home directory.
09:48 Switch to the terminal.
09:51 Execute the file with percentage sign run space marks.py.
09:58 Switch back to the editor. Now we have all the math marks for region A in the list math underscore marks underscore A.
10:09 Add the below lines to calculate the mean of math marks for region A.
10:15 For this, we just have to sum the math marks and divide by the length.
10:21 Note that the length will give the number of students in region ‘A’.
10:26 Let us save the file.
10:29 Switch to the terminal.
10:32 Execute the file again with percentage sign run space marks.py.
10:40 Hence we got our final output.
10:43 Here the mean value for region A is calculated roughly for 1 lakh 80 thousand records.
10:51 This is how we split and read a huge data and perform computations on it.
10:57 This brings us to the end of this tutorial.
11:01 In this tutorial, we learnt to,

Tokenize a string

Split a string separated by delimiters with split() function

11:11 Remove whitespaces using the strip() function.

Convert datatypes of numbers from one type to another

Parse input data and perform computations on it.

11:25 Here are some self assessment questions for you to solve

1. How do you split the string “Guido;Rossum;Python" to get the words?

11:36 2. What does int inside paranthesis inside double quotes 20.0 produce?
11:43 And the answers -

1. line.split inside parantheses inside single quotes comma

2. int inside parantheses inside double quotes 20.0 will give an error, because converting a string directly into integer is not possible.

12:03 Please post your timed queries in this forum.
12:07 Please post your general queries on Python in this forum.
12:12 FOSSEE team coordinates the TBC project.
12:16 Spoken Tutorial Project is funded by NMEICT, MHRD, Govt. of India. For more details, visit this website.
12:27 This is from IIT Bombay signing off. Thanks for watching.

Contributors and Content Editors

Pratik kamble