Python-3.4.3/C2/Statistics/English

From Script | Spoken-Tutorial
Jump to: navigation, search
Visual Cue
Narration
Show Slide Hello Friends. Welcome to the tutorial on "Statistics” using Python
Show Slide

Objectives


At the end of this tutorial, you will be able to -
  • Do statistical operations in Python
  • Sum a set of numbers
  • Find their mean, median and standard deviation
Show Slide

System Specifications

To record this tutorial, I am using
  • Ubuntu Linux 16.04 operating system
  • Python 3.4.3 and
  • IPython 5.1.0
Show Slide:

Pre-requisites

  • Load data from files
  • Use Lists
  • Access parts of Arrays


To practise this tutorial, you should know how to -
  • load data from files
  • use Lists and
  • access parts of Arrays


If not, see the pre-requisite Python tutorials on this website.

[File Browser]

open and Show the file student_record.txt


1:08 - text box

For this tutorial, we will use the data file student_record.txt which we used in the earlier tutorial.


You can also find this file in the Code Files link of this tutorial.


Please download it in Home directory and use it.

[File Browser]

Show the file student_record.txt

We will use mathematical and logical operations on this array structured file.


For this, we need to install Numpy.

Numpy(Numerical Python)

slide:


NumPy, stands for Numerical Python.


It is a library consisting of pre-compiled functions for mathematical and numerical routines.


NumPy has to be installed separately.

Open terminal by pressing Ctrl+Alt+T keys simultaneously Let us first open the Terminal by pressing Ctrl+Alt+T keys simultaneously.
[Terminal] Install latest Python

type sudo apt-get install python3-pip

Let us install latest pip.


pip command is used to install python libraries.


Type, sudo apt-get install python3 hyphen pip and press Enter.


You need to have root access for installation as it asks for admin password.

Install numpy

type

sudo pip3 install numpy==1.13.3

Next, we need to install numpy library as we will be using numpy library throughout the tutorial.


Type, sudo pip3 install numpy is equal to is equal to 1.13.3 and press Enter.

Highlight prompt after installation The installation is completed successfully.


We can see the terminal prompt without any error.

Slide:loadtxt()


Next we will learn about loadtxt() function.


To get the data as an array, we use the loadtxt() function.

For loadtxt() function, we need to import numpy library first.

[Terminal] type ipython3 Switch back to the terminal.

Now, type ipython3 and press Enter.

[IPython Terminal]

Type

import numpy as np

Type import numpy as np and press Enter.

Where np is alias to numpy and it can be any name.

Type

L=np.loadtxt('student_record.txt', usecols=(3,4,5,6,7), delimiter=';')


Type L and press enter

Let us load the data from the file student_record.txt as an array.


Type, L is equal to np dot loadtxt inside parentheses inside quotes student_record.txt comma usecols is equal to inside parentheses 3 comma 4 comma 5 comma 6 comma 7 comma delimiter is equal to inside quotes semicolon. Press Enter.


Type L and press Enter.

Highlight the output We get the output in the form of an array.
Highlight command one by one loadtxt loads data from an external file.


Delimiter specifies the kind of character that the fields of data is separated by.


usecols specifies the columns to be used.


loadtxt, delimiter and usecols are keywords.

Highlight command one by one So columns 3,4,5,6,7 from student_record.txt are loaded here.


The 'comma' between column numbers is added because usecols is a sequence.

[IPython Terminal]

Type L.shape

As we can see L is an array.


We can get the shape of this array using shape.

Type L.shape Type, L dot shape and press Enter.
[IPython Terminal]

4:45


Highlight (185667, 5)

We get a tuple giving the numbers of rows and columns respectively.


In this example, the array L has one lakh eighty five thousand six hundred and sixty seven rows and 5 columns.

Let us switch back to the student_record.txt file.
Highlight record Let us start applying statistical operations on these.


How do you find the sum of marks of all subjects for the first student?

[IPython Terminal]

Type

L[0]

Switch back to the terminal.


To access the first row in an array, we will type L inside square brackets 0 and press Enter.

[IPython Terminal]

Type

totalmarks=sum(L[0])

Now to sum this, type,

totalmarks is equal to sum inside parentheses L inside square brackets 0


Press Enter.

Type totalmarks

Highlight 177.0

Type totalmarks and press Enter.


We got sum of marks of all subjects of the first student.

[IPython Terminal]

Type

totalmarks/len(L[0])

Highlight 35.399999999999999

Now to get the mean we can divide the totalmarks by the length of the array.


Type, totalmarks divided by len inside parentheses L inside square brackets 0 and press Enter.

[IPython Terminal]

Type

np.mean(L[0])

Or simply use the function mean.

Type np dot mean inside parentheses L inside square brackets 0 and press Enter.

[IPython Terminal]

Type

np.mean?

But we have such a large data set.


And calculating the mean for each student one by one is time consuming.


Is there a way to reduce the work?


For this, we will look into the documentation of mean.


Type, np dot mean questionmark and press Enter.

Read the text for more information.

Type q and press enter Type q to exit the documentation.
show slide

Two-Dimensional array

In the above example, L is a two dimensional array like matrix.


We can calculate the mean across each of the axis of the array.


The axis of rows is referred by 0 and columns by 1.


To calculate mean across all columns, we have to pass extra parameter 1 for the axis.

[IPython Terminal]

Type

np.mean(L,0)

Switch back to the terminal.


Let us calculate, mean of the marks scored by all the students for each subject.


Type np dot mean inside parentheses L comma 0 and press Enter.

[IPython Terminal]

Type

L[:,0]

Highlight output array([ 53., 58., 72., ..., 49., 33., 17.])

Next, we will calculate the median of English marks for all the students.


Type L inside square brackets colon comma 0 and press Enter.


Note colon comma zero displays first column in the array that is, English Mark.

[IPython Terminal]

Type

np.median(L[:,0])

To get the median we will simply use the function median.

Type np dot median inside parentheses L inside square brackets colon comma 0


Press Enter.

[IPython Terminal]

Type

np.median(L,0)

For all the subjects, we can calculate median across all rows using median function as shown here.


Type np dot median inside parentheses L comma 0


Press Enter.

[IPython Terminal]

Type

np.std(L[:,0])

Similarly to calculate standard deviation we will use the function std


Standard deviation for English subject can be found by typing np dot std inside parentheses L inside square brackets colon comma 0


Press Enter.

[IPython Terminal]Type

np.std(L,0)

And for all rows, we do, np dot std inside parentheses L comma 0 and press Enter.
Pause the video here, try out the following exercise and resume the video.
Show Slide

Exercise 1

Refer to the file football.txt, that is available in the Code Files link of this tutorial.


Download and save the file in the present working directory.


Currently the present working directory is the Home directory.

highlight In football.txt,
  • the first column is player name,
  • second is goals at home and
  • third column is goals away.
Show Slide

Exercise 1

# Find the total goals for each player
  1. Mean of home and goals away
  2. Standard deviation of home and goals away
Ipython Terminal

Type

L=np.loadtxt('football.txt',usecols=(1,2), delimiter=',')


sum(L,1)


Switch to the terminal.


The solution is, first, type,

L is equal to np dot loadtxt inside parentheses inside quotes football.txt comma usecols is equal to inside parentheses 1 comma 2 comma delimiter is equal to inside quotes comma.


Press Enter.


np dot sum inside parentheses L comma 1 and press Enter.

Ipython Terminal

Type np.mean(L,0)

Answer for the second, np dot mean inside parentheses L comma 0 and press Enter.
[Ipython Termina]

Type np.std(L,0)

Third, np dot std inside parentheses L comma 0 and press Enter.
Show Slide

Summary


This brings us to the end of the tutorial.


In this tutorial, we have learnt to do the standard statistical operations like:

sum

mean

median and

standard deviation in Python.

Show Slide

Assignment


Here are some self assessment questions for you to solve.
  1. Given a two dimensional list as shown, how do you calculate the mean of each row?
  2. Calculate the median of the given list.
Show Slide

Assignment

  1. There is a file with 6 columns. But we want to load text only from columns 2,3,4,5.

How do we specify that?

Show Slide


Solution

And the answers,

1. To get the mean of each row, we just pass 1 as the second parameter to the function mean

np.mean inside parentheses two_dimensional_list comma 1

2. We use the function median to calculate the median of the list

np.median inside parentheses student_marks

3. To specify the particular columns of a file, we use the parameter usecols is equal to inside parentheses 2, 3, 4, 5

Show SlideForum Please post your timed queries in this forum.
Show Slide

Fossee Forum

Please post your general queries on Python in this forum.
Show Slide Textbook Companion FOSSEE team coordinates the TBC project.
Show Slide

Acknowledgment http://spoken-tutorial.org

Spoken Tutorial Project is funded by NMEICT, MHRD, Govt. of India.

For more details, visit this website.

Previous slide Thats it for the tutorial.


This is Trupti Kini from IIT Bombay signing off. Thank you.

Contributors and Content Editors

Nancyvarkey, Nirmala Venkat, Priyacst