Difference between revisions of "Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English"

Revision as of 14:43, 4 June 2024

Title of the script: Introduction to Machine Learning in R

Author: Debatosh Chakraborty

Keywords: R, RStudio, machine learning, supervised, unsupervised, video tutorial.

Visual Cue	Narration
Show slide Opening Slide	Welcome to this spoken tutorial on Introduction to Machine Learning in R
Show slide Learning Objectives	In this tutorial, we will learn about: Machine Learning Supervised and Unsupervised Learning Workflow of ML CLassifier Algorithm Visualizing Feature Space Constructing a dummy classifier Evaluation of the chosen dummy classifier
Show slide System Specifications	This tutorial is recorded using, Windows 11 R version 4.3.0 RStudio version 2023.06.1 It is recommended to install R version 4.2.0 or higher.
Show slide Prerequisites https://spoken-tutorial.org	To follow this tutorial, the learner should know Basic programming in R. To use GGPlot2 and dplyr package. If not, please access the relevant tutorials on this website.
Show slide Machine Learning	About machine learning ML enables computers to learn from data. ML algorithms automate the learning process from data through patterns. Their primary role is prediction, classification or clustering of data. ML algorithms are applied in several applications. For example Natural Language Processing, Image and speech recognition, etc.
Show Slide Types of Machine Learning	ML algorithms include the following types and tasks: Supervised learning: Prediction and Classification, Unsupervised learning: Clustering, Semi-supervised learning Reinforcement learning. In this series, we will focus on Supervised and Unsupervised learning algorithms.
Show Slide Supervised and Unsupervised Learning	Supervised learning: Labeled data ML algorithms predict labels for unseen features They predict based on given features and labels of data. Unsupervised learning: Unlabeled data ML algorithms develop a mechanism to group similar features into clusters. And label them for future analysis.
Show Slides Classification and Regression	Supervised learning consists of Regression and Classification. Regression is applied to predict and learn continuous-valued responses from features. Regression techniques include Linear, Spline, Ridge, Lasso, and others. Classification is applied to predict the class of a discrete (labeled) response from features. Classification techniques include Logistic Regression, Decision Tree, SVM, and others.
Show Slides Workflow of an ML Classifier algorithm	The Workflow of an ML Classifier algorithm include Feature Space: Collection of all possible values of the features. A classification algorithm partitions the feature space into a number of classes. Data is split into training and testing sets to learn and evaluate the algorithm. The model learns from the training data to create partitions of feature space. The model is evaluated on the test dataset through performance metrics.
Show Slide Dataset	Let’s use Raisin dataset with two chosen variables or features to understand a classification problem. For more information on Raisin data please refer to Additional Reading Material on this tutorial page.
Show slide Download Files	We will use a script file Intro.R and Raisin Dataset ‘raisin.xlsx’ Please download these files from the Code files link of this tutorial. Make a copy and then use them while practicing.
[Computer screen] point to Intro.R and the folder Introduction. Point to the MLProject folder on the Desktop.	I have downloaded and moved these files to the Introduction folder. This folder is located in the MLProject folder on my Desktop. I have also set the Introduction folder as my working Directory. In this tutorial, we will introduce classification on the raisin dataset.
	Let us switch to RStudio.
Click Intro.R in RStudio Point to Intro.R in RStudio.	Let us open the script Intro.R in RStudio. Script Intro.R opens in RStudio.
[RStudio] Highlight the command library(readxl) Highlight the command library(caret) Highlight the command library(ggplot2) #install.packages(“package_name”) Point to the command.	Select and run these commands to import the packages. We will use the readxl package to load the excel file of our Raisin Dataset. We will use the caret package to create the confusion matrix. The ggplot2 package will be used to create the decision boundary plot. Please ensure that all the packages are installed correctly. As I have already installed the packages, I have imported them directly.
[RStudio] Highlight the command data<- read_xlsx("Raisin.xlsx")	Run this command to load the Raisin dataset. Drag boundary to see the Environment tab clearly. In the Environment tab below Data, you will see the data variable. Click on data to load the dataset in the Source window. Click on Intro.R in the Source window and close the tab.
Highlight the command. data<-data[c("minorAL",ecc,"class")] data$class <- factor(data$class) Select the commands and click the Run button	We now select three columns from data. 2 columns ("minorAL", "ecc") are chosen as features. The class column is chosen as a target variable. We convert the target variable data$class to a factor. Select and run the commands.
Click on the Environment tab. Click on data.	Click on data to load the modified data in the Source window.
	We will now understand the feature space of this data.
range_minor_al <- range(data$minorAL) range_ecc <- range(data$ecc)	In the Source window type these commands
Highlight the command range_minor_al <- range(data$minorAL) Highlight the command range_ecc <- range(data$ecc)	These commands show the range of the feature variables minorAL and ecc. Select and run the commands. Drag boundary to see the environment tab clearly. The minimum and maximum value of the minor_al and ecc are shown in their range variables
X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100) Y <- seq(min(data$ecc), max(data$ecc), length.out = 100) feature <- expand.grid(minorAL = X, ecc = Y)	We will now use the range to generate grid points to construct the feature space. In the Source window type these commands
Highlight X <- seq(min(data$minorAL), max(data$minorAL), length.out = 100) Y <- seq(min(data$ecc), max(data$ecc), length.out = 100) HIghlight feature <- expand.grid(minorAL = X, ecc = Y)	This command generates a sequence of points spanning the range of minorAL and ecc. This command creates a cartesian product of the two features to create a feature space. Select and run the commands.
ggplot(data = data, aes(x = minorAL, y = ecc)) + geom_point(aes(color = class), size = 2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Feature Space") + theme_minimal()	We will now plot the feature space created In the Source window type these commands
ggplot(data = data, aes(x = minorAL, y = ecc)) + geom_point(aes(color = class), size = 2) + scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Feature Space") + theme_minimal()	These commands plot the data points in the feature space. Select and run the commands.
Drag boundaries.	Drag boundaries to see the plot window clearly.
Point to the data.	Now let us split our data into training and testing data.
[RStudio] set.seed(1) *index_split<- sample(1:nrow(data),size=0.7nrow(data),replace=FALSE)**	Click on Intro.R in the Source window, and type these commands.
Highlight the command set.seed(1) Highlight the command *index_split<- sample(1:nrow(data),size=0.7nrow(data),replace=FALSE)**	Select the commands and run them.
[RStudio] train_data <- data[index_split, ] test_data <- data[-c(index_split), ]	In the Source window type these commands
Highlight the command train_data <- data[index_split, ] Highlight the command test_data <- data[-c(index_split), ]	This creates training data, consisting of 630 unique rows. This creates testing data, consisting of 270 unique rows.
Select the commands and click the Run button. Point to the sets in the Environment Tab Click the test_data and train_data	Select the commands and run them. The data sets are shown in the Environment tab. Drag boundary to see the Environment tab clearly Click on test_data and train_data to load them in the Source window.
	Here we try to partition the feature space to construct the classifier. To begin with, one might construct a heuristic line to build the classifier.
[Rstudio] *fit = function(x)((x (-0.0021)) + 1.445) model_predict <- function(x){ factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))** }	In the Source window type these commands.
Highlight the command *fit = function(x)((x (-0.0021)) + 1.445) Highlight the command model_predict <- function(x){ factor(ifelse(x$ecc < fit(x$minorAL), "Kecimen", "Besni"))** } Click Save and Click Run buttons.	Let us describe the steps of the classification algorithm. For that we will define a line to partition the data as a dummy classifier. It doesn’t involve training data so performance may be poor. We define a function that separates data points belonging to either side of the line. Click Save. Select and run the commands.
feature$class <- model_predict(feature) feature$classnum <- as.numeric(feature$class)	Let’s use the line to classify the feature space and draw the decision boundary. In the Source window type these commands
Highlight feature$class <- model_predict(feature) Highlight feature$classnum <- as.numeric(feature$class)	This command will use the line created to predict the class of every point in the grid of feature space. This command encodes the class string labels into numbers suitable for plotting Select and run the commands.
Click on feature in the Environment tab. Point to the data in the Source window.	Drag boundary to see the Environment window. Click on feature in the Environment tab. The feature set with the predicted classes loads in the source window.
ggplot() + geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+ scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Data Boundary") + theme_minimal()	In the Source window type these commands
Highlight the command ggplot() + geom_raster(data= feature, aes(x=minorAL, y=ecc, fill = class),alpha=0.3) + geom_point(data = data, aes(x = minorAL, y = ecc, color = class), size = 2) + geom_abline(slope = -0.0021, intercept = 1.445, size = 1.2)+ scale_fill_manual(values = c("#ffff46", "#FF46e9")) + scale_color_manual(values = c("red", "blue")) + labs(title = "Data Boundary") + theme_minimal()	We are visualising the feature space and the partition line using GGPlot2. Select and run the commands.
Drag boundary to see the plot window.	Drag boundary to see the plot window clearly. Overall plot shows that the chosen line approximately separates the training data classes.
prediction_test = model_predict(test_data)	Let us see how well the partition performs on the testing dataset. In the Source window type this command
Highlight the command prediction_test = model_predict(test_data)	We predict the classes from testing data and store it in the prediction_test variable. Select and run the command.
	Let us now measure the performance of the classification.
[RStudio] test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test)	In the Source window, type the command
Highlight the command test_confusion_matrix <- confusionMatrix(test_data$class,prediction_test) Click on Save and Run buttons.	We use the confusionMatrix function from the MASS package to calculate the performance matrix. Select and run the command.
test_confusion_matrix$overall["Accuracy"]	In the Source window, type this command
Highlight test_confusion_matrix$overall["Accuracy"]	It fetches the accuracy metric from the list created Select and run the command
	Drag boundary to see the console window clearly
Highlight Accuray 0.6962963	The accuracy of the testing dataset is 69%
Drag boundary to see the source window clearly	Drag boundary to see the source window clearly Let us now view the confusion matrix of the testing dataset
[RStudio] test_confusion_matrix$table	In the Source window type this command
Highlight the command test_confusion_matrix$table Click on Save and Run buttons.	Select and run the command. The output is seen in the console window
Point the output in the console window Reference Prediction Besni Kecimen Besni 50 82 Kecimen 0 138	Drag boundary to see the console window clearly Observe that: 0 samples of class Besni have been incorrectly classified. 82 samples of class Kecimen have been incorrectly classified. We can see that our partition line is skewed.
	For the same problem many partitions can be drawn. We can choose a complicated partition to reduce train misclassification error. But there will be no control on test data. We can aim to choose a classifier which is simple with a smaller test misclassification error.
	With this, we come to the end of this tutorial. Let us summarize.
Show Slide Summary	In this tutorial we have learned about: Machine Learning Classification and Regression Problems Workflow of an ML Classifier Algorithm Visualizing Feature Space Constructing a dummy classifier Evaluation of an ML algorithm
	Here is an assignment for you.
Show Slide Assignment	Use a vertical line as a classifier to partition the feature space. Plot the decision boundary for the same. Evaluate the classifier on the test dataset
Show slide About the Spoken Tutorial Project	The video at the following link summarizes the Spoken Tutorial project. Please download and watch it.
Show slide Spoken Tutorial Workshops	We conduct workshops using Spoken Tutorials and give certificates. Please contact us.
Show Slide Spoken Tutorial Forum to answer questions Do you have questions in THIS Spoken Tutorial? Choose the minute and second where you have the question. Explain your question briefly. Someone from our team will answer them. Please visit this site.	Please post your timed queries in this forum.
Show Slide Forum to answer questions	Do you have any general/technical questions? Please visit the forum given in the link.
Show Slide R Activities	The FOSSEE team coordinates the Textbook Companion, Lab Migration and the Case Study Projects. We give certificates to those who do this. For more details, please visit the website.
Show Slide Acknowledgment	The Spoken Tutorial project was established by the Ministry of Education Govt of India.
Show Slide Thank You	This tutorial is contributed by Debatosh Chakraborty from IIT Bombay. Thank you for joining.

Contributors and Content Editors

Ushav

@@ Line 539: / Line 539: @@
 Click on''' Save '''and '''Run '''buttons.
-|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate performance matrices.
+|| We use the '''confusionMatrix''' function from the '''MASS''' package to calculate the performance matrix.
 Select and run the command.

Difference between revisions of "Machine-Learning-using-R/C2/Introduction-to-Machine-Learning-in-R/English"

Revision as of 14:43, 4 June 2024

Contributors and Content Editors

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools