PhET-Simulations-for-Mathematics/C3/Least-Squares-Regression/English

From Script | Spoken-Tutorial
Jump to: navigation, search

Title of the script: Least Squares Regression

Author: Shraddha Kodavade

Keywords: Phet simulation, Least-Squares Regression, best-fit line, residuals, slope, intercept, square residuals, spoken tutorial, video tutorial.


Visual Cue Narration
Slide Number 1

Title Slide

Welcome to this spoken tutorial on Least Squares Regression.
Slide Number 2

Learning Objectives

In this tutorial, we will learn about:
  • Linear Regression
  • Correlation coefficient
  • Residuals and best fit line
  • Sum of squared residuals
Slide Number 3

System Requirements

This tutorial is recorded using,

Windows 10-64-bit operating system

Chrome version 101.0.49

Slide Number 4

Pre-requisites

https://spoken-tutorial.org

To follow this tutorial the learner should be familiar with topics in basic mathematics.

Please use the link below to access the tutorials on PhET Simulations.

Slide Number 5

Link for Phet Simulations

https://phet.colorado.edu/en/simulations/least-squares-regression

Please use the given link to download the PhET simulation.
Slide Number 6

Least Squares Regression

Least Squares Regression

A best fit line that reduces error by minimising the distance of the residuals.

Slide Number 7

PhET Simulations

In this tutorial, we will use Least-Squares Regression PhET Simulation.
Show the Downloads folder. I have already downloaded the Least-Squares Regression simulation to my Downloads folder.
Cursor on the interface. Let us begin.
Cursor on the interface. This is the interface of Least-Squares Regression.
Point to the white region In the middle of the screen we can see a white plotting region.

This is the Cartesian system .

Point to the left bottom side.

Point to the data points bucket.

A data points bucket is seen at the bottom left.

It consists of orange points.

The data points can be pulled out or put in the bucket.

Check the grid checkbox. Let us check the grid checkbox to show the grid.

This will help to place the points accurately.

Place one point at (15,15)

Place one point at (5,5)

Drag and place a point at (15, 15).

Next drag and place another point at (5, 5).

Point to the left hand side.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

Two panels can be seen on the left hand side.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

click the Best-fit line. Point to the line.

Point to equation.

Click the ‘Best-Fit line’ checkbox.

A line passing through the two points is seen on the screen.

Note the equation written in the white box.

Point to the two points. Necessary condition to plot a line in 2-D space is the presence of two points.
Click the Residuals.

Point to the line.

Click on the Residuals check box.

No residuals are observed.

This is because the condition of 2 points is satisfied.

click the Squared Residuals.

Point to the line.

Click on the Squared Residuals check box.

No squared residuals are observed.

Point to Correlation Coefficient.

Point to the value.

Correlation coefficient shows the degree of association between two variables.

It lies between -1 to +1.

Point to 1. Here the value is +1.

This means it has a positive relation.

This indicates that movement in both variables is in the same direction.

Move the point from (5,5) to (5,15).

Point to -1.

Move the point from (5, 5) to (5, 15).

The value changes to -1.

This means it has a negative relation.

It means that movement in both variables is in the opposite direction.

Place the third point at (10,3). Place a third point at (10, 3).

The correlation coefficient comes close to 0.

This means there is no association between the three points.

Point to the residuals. Here we can see three red vertical lines.

The distance from the data points to the best fit line shows the deviations for the line.

These are called residuals.

The goal of the best fit line is to minimise the distance of the error lines.

Point to the square boxes. Three square boxes can be seen on the screen.

They denote the square of the residuals.

The objective function minimises this to get the best fit line.

Point to the top right corner The My Line option has been ticked by default.
Point to the equation y = ax + b. The equation of line is in form y= ax + b
Point to a.

Point to b.

Here a stands for slope.

It is the rate of change of y with respect to x.

b stands for intercept which shows where the line intersects the y axis.

Click the Residuals and Squared Residuals check boxes. Click the Residuals and Squared Residuals check boxes.
Move the slides for a and b.


Point to the lines.

The same concept applied on the best fit line applies here.

Move the slider for a and b.

Observe the difference between Best Fit and My Line.

Click the Reset option.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

Click on the Reset option.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

Click the drop down list.


Point to the Custom option.

Click the Custom drop down box.

There are total of 15 options.

We have now explored Custom option.

Point to the curve.

Click the Internet users vs. Time option.

Let us explore the Internet users vs. Time option.

We can see an exponential shaped arrangement of data points.

Point to the x-axis.

Point to the y axis.

The x axis denotes the Years since 1990.

The y axis denotes Total Internet users (in billions).

Click the question mark.

Point to the information in the box.

Click the cross button.

Click the question mark button.

This data is from the World Bank.

It denotes the global Internet users from 1990 to 2010. Click the cross button.

Point to Correlation Coefficient. The correlation coefficient is +0.94.

It means that as the years increase, there is a surge in Internet users.

Click the Best-Fit line checkbox.

Point to the line.

Point to the equation.

Click the Best-Fit line checkbox.

A red line appears.

Its equation is y= 0.10x-0.38.

Click the Residuals check box.

Point to the vertical lines.

Click the Residuals check box.

Vertical lines appear.

This is the best combination of lines that minimises the residual value.

Drag slider a=0.10 x. click the residuals. Drag slider a to 0.10 x.

Click the Residuals checkbox.

Point to blue residuals.

Drag slider b=-0.36

For b=0, the blue residuals are higher than the red residuals.

For b= -0.36 the residuals reduce and the least sum of squares is obtained.

Click the Reset button.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

Click the Reset button.

Click the green plus sign to show the Best-Fit Line panel.

Click the green plus sign to show the Correlation Coefficient panel.

Click the Life Expectancy vs. TVs Let us explore another option.

Click the Life Expectancy vs. TV option.

Point to the curve The curve has no particular trend.

Majority of the data points are observed on the higher end of y-axis.

Point to the x-axis.

Point to the y axis.

The x axis denotes the average number of people per TV.

The y axis denotes Life Expectancy in years.

Click the question mark button.

Click the cross button.

Click the question mark button.

This data is from The World Almanac and Book of Facts.

It denotes the life expectancy in 40 countries.

It is compared to the average number of people per TV in that country.

Click the cross button to close the information box.

Point to Correlation Coefficient. The correlation coefficient is -0.61.

Hence, less the average number of people per TV, more the life expectancy.

Point to the higher end of the y-axis.

Point to other data points.

Observe the cluster formed on y-axis.

For values close to 0 people per TV, there is higher life expectancy.

The majority life expectancy is 70 to 80 years.

The extreme points are outliers over here.

Click the Best-Fit Line check box.

Point to the line.

Point to the equation.

Click the Best-Fit Line checkbox

A red line appears.

Its equation is y=-0.04x+ 69.6.

Click the Residuals checkbox.

Point to the cluster.

Click the Residuals checkbox.

The vertical lines appear.

Due to the close association of cluster values, all lines are not visible.

Click the Reset option.

Click the Best Fit box and correlation coefficient panels.

Click the Reset option.

Click the Best Fit and Correlation coefficient.

Click the Temperature (F) vs. Longitude. Let us explore another option.

Click the Temperature in Fahrenheit(F) vs. Longitude.

Point to the points. The data points are scattered on the higher end of x-axis.
Point to the x-axis.

Point to the y axis.

The x axis denotes the Longitude.

The y axis denotes Average January Temperature in Fahrenheit(F).

Click the question mark button.

Point to the information in the box.

Click the cross button.

Click the question mark button.

This data shows average January temperature in 50 major US cities.

They are compared with their longitudes.

Click the cross button.

Point to Correlation Coefficient. The correlation coefficient is +0.02.

This value is close to 0.

This means there is no strong relation between both the variables.

Click the Best-Fit line.

Point to the line.

Point to the equation.

Click the Best-Fit Line checkbox.

A red line appears.

Its equation is y= 0.02x+24.6.

Click the Residuals check box.. Click the Residuals check box.

The vertical lines appear.

These lines are very distant from the best fit line.

Click the Squared Residuals check box.. Click the Squared Residuals check box.

The square of the errors is denoted by these squares.

Since the residuals are high, the squared residual values have also increased.

Only Narration With this we have come to the end of this tutorial.

Let us summarise.

Slide Number 8

Summary

In this tutorial, we have learnt about:
  • Linear Regression
  • Correlation coefficient
  • Residuals and best fit line
  • Sum of squared residuals
Slide Number 9

Assignment

As an assignment,

Explore the various data combinations given in this simulation.

Slide Number 10

About the Spoken Tutorial Project

The video at the following link summarises the Spoken Tutorial project.

Please download and watch it.

Slide Number 11

Spoken Tutorial workshops

The Spoken Tutorial Project team:

conducts workshops using spoken tutorials and

gives certificates on passing online tests.

For more details, please write to us.

Slide Number 12

Forum for specific questions:

Do you have questions about THIS Spoken Tutorial?

Please visit this site.

Choose the minute and second where you have the question

Explain your question briefly

Someone from our team will answer them

Please post your timed queries in this forum.
Slide Number 13

Acknowledgements

The Spoken Tutorial project is funded by the Ministry of Education, Govt. of India
Slide Number 14

Thank you

This is Shraddha Kodavade, a FOSSEE summer fellow 2022, IIT Bombay signing off.

Thanks for joining.

Contributors and Content Editors

Madhurig