Difference between revisions of "Python-for-Automation/C3/Web-Scraping/English"
(Created page with " {| border="1" |- || '''Visual Cue''' || '''Narration''' |- |- style="border:1pt solid #000000;padding:0.176cm;" || Show Slide: '''Welcome''' || Hello and welcome to the Spok...") |
(No difference)
|
Revision as of 15:30, 15 October 2024
Visual Cue | Narration |
Show Slide:
Welcome |
Hello and welcome to the Spoken Tutorial on Web Scraping |
Show Slide:
Learning Objectives |
In this tutorial, we will learn to
|
Show Slide:
System Requirements
|
To record this tutorial, I am using
|
Show Slide:Pre-requisites | To follow this tutorial
|
Show Slide:Code Files |
|
Show Slide:
Web Scraping |
Web Scraping is the automated process of extracting data from websites with software.
We will automate extracting data and information from web pages and parsing HTML content. |
Show Slide: Web Scraping - Libraries | To automate the process of extracting multimedia from a website, we need:
|
Show Slide:
Web Scraping - Example |
For this tutorial, we will extract data from the spoken Tutorial statistics webpage.
Data analysis is done with workshops conducted between 2022 and 2023 on certain software. Data such as State, City, Institution, Department, Organizer, Date and Participants are handled. |
Point to the webscraping.py in downloads folder
Open the Text Editor with the source file |
I have created the source file webscraping.py for demonstration.
Now, we will go through the source code in the text editor. |
Looking at the code | This source code will extract the necessary data, analyze it and plot graphs. |
Highlight:import requests
from bs4 import BeautifulSoup import pandas as pd from datetime import datetime import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D |
First we need to import the necessary modules for web scraping in Python. |
Highlight: | We fetch the HTML of the page with requests.get to the URL.
Then, we parse it with BeautifulSoup to return a soup object for further analysis. |
Highlight: | extract_table_data function extracts relevant data from an HTML table in the soup object.
Then, we find the table with the provided class name in the HTML. An empty list is returned if no table is found. |
Highlight: | We find all the rows of the table except the first header row.
Then, an empty list is initialized to store the extracted data. For each row in the table, we extract all cells and strip the text content from each cell. Then, we store the values in a list. |
Highlight: | We finally check if the FOSS type column matches any of the values in the foss filter.
Then, we convert the date column to a datetime object. This is to verify if it falls within the specified time range. If both conditions are met, the row's data is added to the list. |
Highlight: | This function, locates the pagination element in the HTML and extracts the page numbers from it.
Pagination is the process of dividing content into discrete pages. If page numbers are found, it returns the last one as the total number of pages. If no pagination is found, it returns 1, assuming there's only one page. |
Highlight: | scrape_all_pages function scrapes data from all available pages.
First, it fetches the initial page’s content and then determines the number of pages. Finally we extract the relevant data based on the filters. |
Highlight: | Now, we iterate over all remaining pages starting from page 2.
First, we modify the url to request each specific page, fetch content and extract data. The data from each page is then appended to the overall dataset and returned. |
Highlight: | Next, we will define functions for data analysis and data visualization.
piechart_visualization function generates a pie chart showing the FOSS categories. Additionally, the FOSS counts are saved to an Excel sheet with the specified sheet name. |
Highlight: | barchart_visualization function creates a bar chart showing the number of workshops per city.
It then writes the data to an excel sheet with the specified sheet name. |
Highlight: | We now define a function to filter the data frame to find workshops held at a specified college.
Then counting and retrieving unique FOSS types, departments, and organizers. If no data is found for the given college, it returns None. Otherwise, it returns the unique FOSS categories, departments, and organizers. |
Highlight: | We now first convert the Participants column to numeric values, filling any invalid entries with 0.
Then we group the data by city and FOSS type and calculate the total number of participants. Unique city names and FOSS types are extracted, and numeric mappings are created to represent them. |
Highlight: | We can now create a 3D bar chart visualizing the participants in workshops by city and FOSS type.
The axes are labeled, and ticks are mapped to cities and FOSS values with a title. |
Highlight: | We define this function to ensure that an excel sheet name does not exceed the 31 character limit.
If the name is longer than 31 characters, it truncates the name to 28 characters with ellipsis. |
Highlight: | We now define the base url for scraping the data and set filters to focus on specific FOSS types.
We also set a date range from January 1 2022 to January 1 2023. We then convert the start and end dates from string format to datetime objects for comparison. |
Highlight: | We now scrape all pages of data using the defined filters and store it in a DataFrame.
The DataFrame is then created with specified columns, and duplicates are removed. |
Highlight: | We can now write the DataFrame data to an Excel file and generate visualizations.
We also analyze and save workshop data for a specific college if available. |
Highlight: | A 3D visualization is generated from the DataFrame.
The data is saved and a confirmation message is printed. |
Save the Code in the Downloads Folder | Save the code as webscraping.py in the Downloads folder. |
Open the terminal (Ctrl + Alt + T)
Start Virtual Environment Type > source Automation/bin/activate |
Open the terminal by pressing Control + Alt + T keys simultaneously.
We will open the virtual environment we created for the Automation series. Type source space Automation forward slash bin forward slash activate. Then press enter. |
Running the Code
Type > cd Downloads > python3 webscraping.py |
Now type, cd Downloads.
Then type python3 webscraping.py and press Enter |
Observing the graphs# Pie Chart
|
As soon as we execute the code, matplotlib will display the graphs.
The pie chart displays the distribution of workshops by FOSS category. This shows the proportion of each FOSS type as a percentage of the total workshops. Close the window to see the next graph. |
Observing the graphs# Bar Chart
|
The next graph we get is a Bar chart.
This shows the number of workshops conducted in each city. Close the window to go to the next graph. |
Observing the graphs# 3D Bar Chart
|
Finally, we see the 3D bar chart.
This graph displays cities, FOSS types and participant counts on the three axes. Close the window.
|
Let us check the data in the excel sheet. | |
Navigating to Downloads
Files App > Downloads > st_data.xlsx |
Go to the Downloads folder and double click to open the st_data.xlsx file. |
Observing the Excel sheet | We can see in the bottom left corner that we have created four sheets. |
Sheet 1 - Workshops Data
Zoom and show the data |
The first sheet has the Raw data that we extracted from the Spoken Tutorial website.
We can see the 10 columns here which contain all the data the website had. |
Sheet 2 - FOSS Visualization | In the second sheet named FOSS Visualization, the count of workshops for each FOSS category is shown.
This is the data of the pie chart we have seen earlier. |
Sheet 3 - City Visualization | The third sheet City Visualization shows the number of workshops conducted per city.
|
Sheet 4 - Shri Phanishwar Nath Renu En... | The last sheet shows the unique data - FOSS type, organizers, workshop dates and participants. |
Closing the virtual environment
Type > deactivate |
Switch back to the terminal to close the virtual environment.
Type deactivate. |
Show Slide:Applications of Web Scraping | Web Scraping has lots of applications across various fields.
|
Show Slide:Applications of Web Scraping |
|
Show Slide:Summary | This brings us to the end of this tutorial. Let us summarise.
In this tutorial, we have learnt to
|
Show Slide:
Assignment |
As an assignment, please do the following:
|
Show Slide:About the Spoken Tutorial Project | The video at the following link summarises the Spoken Tutorial Project.Please download and watch it. |
Show Slide:
Spoken Tutorial Workshops |
The Spoken Tutorial Project team conducts workshops and gives certificates.
For more details, please write to us. |
Show Slide:Answers for THIS Spoken Tutorial | Please post your timed queries in this forum. |
Show Slide:
FOSSEE Forum |
For any general or technical questions on Python for Automation, visit the FOSSEE forum and post your question. |
Show Slide:Acknowledgement | The Spoken Tutorial Project was established by the Ministry of Education, Government of India. |
Show Slide:Thank You | This is Sai Sathwik, a FOSSEE Semester Long Intern 2024, IIT Bombay signing off.
Thanks for joining. |