Python/C3/Statistics/English-timed

From Script | Spoken-Tutorial
Revision as of 11:01, 18 March 2013 by Sneha (Talk | contribs)

Jump to: navigation, search
Timing Narration
0:00 Hello friends and welcome to the tutorial on 'Statistics' using Python.
0:06 At the end of this tutorial,you will be able to,
  1. Do statistical operations in Python
  2. Sum a set of numbers
  3. Find their mean,median and standard deviation


0:17 Before beginning this tutorial,we would suggest you to complete the tutorial on
0:21 "Loading Data from files" "Getting started with Lists" and "Accessing Pieces of Arrays".
0:29 Now, type in terminal ipython space hyphen pylab.
0:38 For this tutorial, we will use data file that is at the path slash home slash fossee slash sslc2 dot txt.
0:47 It contains record of students and their performance in one of the State Secondary Board Examination.
0:53 It has 180,000 lines of record.
0:57 We are going to read it and process this data.
1:02 We can see the content of file by double clicking on it.
1:06 It might take some time to open since it is quite a large file.
1:11 Please don't edit the data since it has a particular structure.
1:15 To check the contents of the file, we use the cat command.
1:18 So type cat space slash home slash fossee slash sslc2 dot txt. Hit enter.
1:31 Each line in the file is a set of 11 fields separated by semi-colons.
1:38 Consider a sample line from this file.
1:43 A semicolon 015163 semicolon JOSEPH RAJ S semicolon 083 semicolon 042 semicolon 47 semicolon 00 semicolon 72 semicolon 244 and three semicolons in a row.
2:11 The following are the fields in any given line.
2:16 * Region Code which is 'A' * Roll Number 015163 * Name JOSEPH RAJ S * Marks of 5 subjects: ** English 083 ** Hindi 042 ** Maths 47 **

Science 35 **Social Science 72 and Total marks 244

2:42 Lets load this data as an array and then run various functions on it.
2:48 To get the data as an array, we use the loadtxt command
2:53 So type on the terminal L is equal to loadtxt within brackets , single quotes slash home slash fossee slash sslc2 dot txt comma usecols is equal to within brackets 3,4,5,6,7 comma delimiter is equal to within single quotes semicolon) and hit Enter.
3:45 We get our output in the form of an array dot loadtxt function.
3:57 Now we have an error.
3:58 We have to type loadtxt before the brackets.
4:09 Delimiter specifies the kind of character, that the fields of data separated by usecols specifies the columns to be used.
4:19 So within brackets 3,4,5,6,7 loads those columns.
4:26 The 'comma' is added because usecols is a sequence.
4:31 As we can see L is an array.
4:35 We can get the shape of this array using in the terminal we can type L dot shape and hit Enter.
4:43 We get a tuple stating the numbers of rows and columns respectively.
4:50 Lets start applying statistical operations on these.
4:55 We will start with the most basic, summing.
4:59 How do you find the sum of marks of all subjects for the first student.
5:04 As we know from our knowledge of accessing pieces of arrays, to access the first row, we will do in terminal type L square brackets 0 comma colon.
5:19 Now to sum this we can say totalmarks is equal to sum within brackets L within square brackets 0 comma colon. Hit Enter.Then totalmarks. Then again Enter.
5:47 Now to get the mean we can divide the totalmarks by the length.
5:52 So type totalmarks slash len within brackets L in square brackets 0 comma colon.
6:10 Or simply use the function mean.
6:13 For that type mean within brackets L and in square brackets 0 comma colon and hit Enter.
6:31 But we have such a large data set and calculating the mean for each student one by one is impossible.
6:38 Is there a way to reduce the work.
6:40 For this we will look into the documentation of mean
6:42 So for that type mean question mark in the terminal.
6:49 As we know L is a two dimensional array.
6:52 We can calculate the mean across each of the axis of the array.
6:57 The axis of rows is referred by number 0 and columns by 1.
7:02 So to calculate mean across all columns, we will pass extra parameter 1 for the axis.
7:07 So type mean within brackets L comma 1 and hit Enter.
7:17 L here, is a two dimensional array.
7:20 Similarly to calculate average marks scored by all the students for each subject can be calculated using mean within brackets L comma 0.
7:36 Next, let us calculate the median of English marks for the all the students.
7:41 We can access English marks of all students using L in square brackets colon comma zero and hit Enter.
7:53 To get the median we will simply use the function median.
7:57 So type median within brackets L square brackets colon comma 0 .
8:17 For all the subjects we can use the same syntax as mean and calculate median across all rows using median
8:25 So type median in brackets L comma 0 and hit Enter.
8:35 Similarly to calculate standard deviation for English we will use the function std
8:41 So type std, in brackets L and in square brackets colon comma 0 and hit Enter
8:57 and for all rows, we do std within brackets L comma 0.
9:08 Pause the video here, try out the following exercise and resume the video.
9:13 In the given file football dot txt at path slash home slash fossee slash football dot txt , one column is player name,second is goals at home and third goals away.
9:28 1.Find the total goals for each player
9:33 2.Mean of home and away goals
9:37 3.Standard deviation of home and away goals
9:46 This is the required data.
9:49 For that open the football dot txt file.
9:54 The solution is on your screen.
10:00 This brings us to the end of the tutorial.
10:03 In this tutorial,we have learnt to,
10:07 1. Do the standard statistical operations sum , mean median and standard deviation in Python.
10:14 2. Combine text loading and the statistical operation to solve real world problems.
10:24 Here are some self assessment questions for you to solve
10:27 1. Given a two dimensional list, two_dimensional_list is equal to within square brackets [3,5,8,2,1],within another square brackets [4,3,6,2,1] how do we calculate the mean of each row?
10:49 2. Calculate the median of the given list? student_marks is equal to within square brackets 74,78,56,87,91,82
11:03 And the third question is Suppose there is a file with 6 columns but we wish to load text only in columns 2,3,4,5. How do we specify that?


11:16 And the answers,
11:20 1. To get the mean of each row, we just pass 1 as the second parameter to the function mean.
11:29 So we have to type mean within brackets two_dimensional_list comma 1
11:37 2. We use the function median to calculate the median of the list
11:42 by typing median within brackets student_marks.
11:47 And the final one To specify the particular columns of a file, we use the parameter usecols is equal to 2,3,4,5.
12:01 Hope you have enjoyed this tutorial and found it useful.
12:05 Thank you!

Contributors and Content Editors

Gaurav, Minal, PoojaMoolya, Sneha