Python-3.4.3/C3/Parsing-data/English-timed
Time | Narration |
00:01 | Welcome to the spoken tutorial on Parsing data. |
00:06 | In this tutorial, we will learn to-
Split a string using a delimiter. Remove the leading, trailing and all whitespaces in a string and Convert between different built-in datatypes |
00:22 | To record this tutorial, I am using
Ubuntu Linux 16.04 operating system Python 3.4.3 and IPython 5.1.0 |
00:38 | To practice this tutorial, you should know how to use lists.
If not, see the relevant Python tutorials on this website. |
00:49 | First, let us understand, what is meant by parsing data. |
00:54 | Parsing the data is reading data in text form. It is converted into a form which can be used for computations. |
01:04 | Next we will learn about split() function. |
01:08 | split() function breaks up a larger string into smaller strings using a defined separator. |
01:15 | If no argument is specified, then whitespace is used as default separator. |
01:22 | Syntax is: str dot split inside parentheses argument |
01:29 | The split function parses a string and returns an array of tokens.
This is called string tokenizing. |
01:38 | Let us first open the terminal by pressing Ctrl+Alt+T keys simultaneously. |
01:46 | Type ipython3 and press Enter. |
01:52 | Let us initialize the pylab package. Type percentage sign pylab and press Enter. |
02:02 | From here onwards, please remember to press the Enter key after typing every command on the terminal. |
02:09 | Let us define a variable str1 as string data type. |
02:14 | Type, str1 is equal to inside double quotes Welcome to insert some whitespaces, then Python tutorials |
02:24 | We can have any number of whitespaces between to and Python tutorials.
But all the spaces are treated as one space. |
02:34 | Now, we are going to split this string on whitespace. |
02:38 | Type, str1 dot split open and close parentheses. |
02:44 | As we can see, we get a list of strings. |
02:48 | Let us take another example for split() function with argument. Type as shown. |
02:57 | Type, x dot split inside parentheses inside single quotes semicolon. |
03:04 | We get a list of strings separated by comma. |
03:08 | Pause the video. Try this exercise and then resume the video. |
03:14 | Split x using space as argument. Is it same as splitting without an argument? |
03:22 | Switch to the terminal for the solution. |
03:26 | Type b is equal to x dot split open and close parentheses. |
03:32 | Type c is equal to x dot split inside parentheses and inside single quotes space. |
03:41 | Type b |
03:44 | Type c |
03:47 | We can see that splitting without argument is same as giving space as argument. |
03:54 | Splitting the string without argument will split the string separated by any number of spaces. |
04:01 | And giving space as argument will split the sentence specifically on single whitespace. |
04:08 | Let us recall the variable str1. |
04:12 | Now, we will split this string without argument. Type b is equal to str1 dot split open and close parentheses. |
04:24 | Type c is equal to str1 dot split inside parentheses and inside single quotes space. |
04:33 | Type b |
04:36 | Type c |
04:38 | As you can see, here b is not equal to c since c has whitespaces as entries, whereas b has only words. |
04:49 | Next we will learn about strip method. |
04:53 | strip function removes all leading and trailing whitespaces in a string |
04:59 | Let us define a string by typing unstripped is equal to inside double quotes space Hello world space |
05:09 | Now to remove the whitespace, type, unstripped dot strip open and close parentheses. |
05:18 | We can see that strip removes all the whitespaces in the beginning and at the end of the string. |
05:25 | After splitting and stripping we get a list of strings with leading and trailing spaces stripped off. |
05:32 | Now we shall look at converting strings into floats and integers. |
05:38 | Type, mark underscore str is equal to inside double quotes 1.25 |
05:46 | Note that 1.25 is a string and not a float as it is within double quotes. |
05:53 | Type, mark is equal to float inside parentheses mark underscore str. Here we are converting string to float. |
06:05 | Type type inside parentheses mark underscore str. This tells you the datatype of mark_str i.e. string. |
06:17 | Type type inside parentheses mark . This shows mark is a float datatype. |
06:26 | We can see that string is converted to float. Now we can perform mathematical operations on them. |
06:34 | Pause the video. Try this exercise and then resume the video. |
06:40 | What happens if you type, int inside parentheses inside double quotes 1.25 in the terminal? |
06:48 | Switch to the terminal for the solution. |
06:52 | Type, int inside parentheses inside double quotes 1.25 |
06:59 | We can see a ValueError. We cannot convert a string to integer directly. |
07:06 | Let us see the correct solution for this. Type dcml underscore str is equal to inside double quotes 1.25. |
07:18 | Type flt is equal to float inside parentheses dcml underscore str. |
07:27 | Here we are converting the string into float as we cannot directly convert it into integer. |
07:34 | Type flt |
07:37 | Type, number is equal to int inside parentheses flt. We are now converting float into integer. |
07:48 | Type number We got the output as integer. |
07:54 | This is how we should convert strings into floats and integers. |
07:59 | Next, we will use a data file to parse the data. |
08:04 | Let me open the file student underscore record.txt in text editor. |
08:10 | The file student underscore record.txt is available in the Code files link of this tutorial.
Please download it in your Home directory and use it. |
08:22 | We will first read the file line by line and parse each record in this file. |
08:28 | It contains records of students and their marks in the State Secondary Board Examination. |
08:35 | It has 1 lakh 80 thousand lines of record. We are going to read it and process this data. |
08:43 | Each line in the file is a set of fields separated by semicolons. |
08:49 | Consider a sample record from this file. |
08:53 | The following are the fields in any given line.
Region Code Roll Number Name Marks of 5 subjects Total marks |
09:08 | Open a new text editor. Type the code as shown. |
09:14 | Let me explain this program. |
09:17 | We have already learnt for loop in earlier tutorial. The for loop will process the student record and split the fields of each record. |
09:28 | The math marks are then converted to float. |
09:32 | Then it is appended and stored as a list in a variable math underscore marks underscore A for region code A. |
09:41 | Save the file as marks.py in the Home directory. |
09:48 | Switch to the terminal. |
09:51 | Execute the file with percentage sign run space marks.py. |
09:58 | Switch back to the editor. Now we have all the math marks for region A in the list math underscore marks underscore A. |
10:09 | Add the below lines to calculate the mean of math marks for region A. |
10:15 | For this, we just have to sum the math marks and divide by the length. |
10:21 | Note that the length will give the number of students in region ‘A’. |
10:26 | Let us save the file. |
10:29 | Switch to the terminal. |
10:32 | Execute the file again with percentage sign run space marks.py. |
10:40 | Hence we got our final output. |
10:43 | Here the mean value for region A is calculated roughly for 1 lakh 80 thousand records. |
10:51 | This is how we split and read a huge data and perform computations on it. |
10:57 | This brings us to the end of this tutorial. |
11:01 | In this tutorial, we learnt to,
Tokenize a string Split a string separated by delimiters with split() function |
11:11 | Remove whitespaces using the strip() function.
Convert datatypes of numbers from one type to another Parse input data and perform computations on it. |
11:25 | Here are some self assessment questions for you to solve
1. How do you split the string “Guido;Rossum;Python" to get the words? |
11:36 | 2. What does int inside paranthesis inside double quotes 20.0 produce? |
11:43 | And the answers -
1. line.split inside parantheses inside single quotes comma 2. int inside parantheses inside double quotes 20.0 will give an error, because converting a string directly into integer is not possible. |
12:03 | Please post your timed queries in this forum. |
12:07 | Please post your general queries on Python in this forum. |
12:12 | FOSSEE team coordinates the TBC project. |
12:16 | Spoken Tutorial Project is funded by NMEICT, MHRD, Govt. of India. For more details, visit this website. |
12:27 | This is from IIT Bombay signing off. Thanks for watching. |