Difference between revisions of "Python-for-Automation/C3/Web-Scraping/English"

From Script | Spoken-Tutorial
Jump to: navigation, search
(Created page with " {| border="1" |- || '''Visual Cue''' || '''Narration''' |- |- style="border:1pt solid #000000;padding:0.176cm;" || Show Slide: '''Welcome''' || Hello and welcome to the Spok...")
 
 
(One intermediate revision by the same user not shown)
Line 5: Line 5:
 
|| '''Narration'''
 
|| '''Narration'''
 
|-
 
|-
|- style="border:1pt solid #000000;padding:0.176cm;"
+
 
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Welcome'''
 
'''Welcome'''
 
|| Hello and welcome to the Spoken Tutorial on '''Web Scraping'''  
 
|| Hello and welcome to the Spoken Tutorial on '''Web Scraping'''  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Learning Objectives'''
 
'''Learning Objectives'''
 
|| In this tutorial, we will learn to  
 
|| In this tutorial, we will learn to  
* <div style="margin-left:1.27cm;margin-right:0cm;">Scrape data from any website</div>
+
* Scrape data from any website
* <div style="margin-left:1.27cm;margin-right:0cm;">Extract it to a CSV file</div>
+
* Extract it to a '''CSV''' file
* <div style="margin-left:1.27cm;margin-right:0cm;">Perform basic data analysis</div>
+
* Perform basic data analysis and
* <div style="margin-left:1.27cm;margin-right:0cm;">Generate visualizations</div>
+
* Generate visualizations
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''System Requirements'''
 
'''System Requirements'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Ubuntu''' '''Linux OS 22.04'''</div>
+
* '''Ubuntu''' '''Linux OS 22.04'''
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Python 3.12.3'''</div>
+
* '''Python 3.12.3'''
  
 
|| To record this tutorial, I am using
 
|| To record this tutorial, I am using
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Ubuntu''' '''Linux OS version 22.04'''</div>
+
* '''Ubuntu''' '''Linux OS version 22.04''' and
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Python 3.12.3'''</div>
+
* '''Python 3.12.3'''
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:'''Pre-requisites'''
 
|| Show Slide:'''Pre-requisites'''
  
 
[https://www.spoken-tutorial.org/ https://www.spoken-tutorial.org]
 
[https://www.spoken-tutorial.org/ https://www.spoken-tutorial.org]
 
|| To follow this tutorial  
 
|| To follow this tutorial  
* <div style="margin-left:1.27cm;margin-right:0cm;">You must have basic knowledge of using Linux Terminal and Python</div>
+
* You must have basic knowledge of using Linux Terminal and Python
* <div style="margin-left:1.27cm;margin-right:0cm;">For pre-requisite Linux and Python Tutorials, please visit this website</div>
+
* For pre-requisite Linux and Python Tutorials, please visit this website
* <div style="margin-left:1.27cm;margin-right:0cm;">Python libraries required for automation must be installed</div>
+
* Python libraries required for automation must be installed
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:'''Code Files'''
 
|| Show Slide:'''Code Files'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">The files used in this tutorial are provided in the Code files''' '''link.</div>
+
* The files used in this tutorial are provided in the ''' Code files'''link.
* <div style="margin-left:1.27cm;margin-right:0cm;">Please download and extract the files.</div>
+
* Please download and extract the files.
* <div style="margin-left:1.27cm;margin-right:0cm;">Make a copy and then use them while practicing.</div>
+
* Make a copy and then use them while practicing.
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:
 
|| Show Slide:
  
 
'''Web Scraping'''
 
'''Web Scraping'''
|| '''Web Scraping''' is the''' '''automated process of extracting data from websites with software.
+
|| '''Web Scraping''' is the automated process of extracting data from websites with software.
  
 
We will automate extracting data and information from web pages and parsing '''HTML '''content.
 
We will automate extracting data and information from web pages and parsing '''HTML '''content.
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Show Slide:
+
|| Show Slide:
  
'''Web Scraping - Libraries'''<div style="margin-left:1.27cm;margin-right:0cm;"></div>
+
'''Web Scraping - Libraries'''
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | To automate the process of extracting multimedia from a website, we need:
+
|| To automate the process of extracting multimedia from a website, we need:
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Requests '''library to fetch HTML content from a web page </div>
+
* '''Requests '''library to fetch '''HTML''' content from a web page  
* <div style="margin-left:1.27cm;margin-right:0cm;">'''BeautifulSoup '''library to parse and extract information from the HTML content</div>
+
* '''BeautifulSoup '''library to parse and extract information from the HTML content
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Matplotlib''' library to create static, animated, and interactive visualizations</div>
+
* '''Matplotlib''' library to create static, animated, and interactive visualizations
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Pandas''' library to provide data structures and data analysis tools </div>
+
* '''Pandas''' library to provide data structures and data analysis tools  
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Xlsxwriter''' library is used for creating and formatting Excel files</div>
+
* '''Xlsxwriter''' library is used for creating and formatting '''Excel''' files
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Datetime '''library handles date operations like parsing strings into date objects</div>
+
* '''Datetime '''library handles date operations like parsing strings into date objects
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Show Slide:  
 
|| Show Slide:  
  
 
'''Web Scraping - Example'''
 
'''Web Scraping - Example'''
|| For this tutorial, we will extract data from the spoken Tutorial '''statistics '''webpage.
+
|| For this tutorial, we will extract data from the spoken Tutorial '''statistics ''' webpage.
 +
 
 
Data analysis is done with workshops conducted between 2022 and 2023 on certain '''software.'''
 
Data analysis is done with workshops conducted between 2022 and 2023 on certain '''software.'''
  
 
Data such as State, City, Institution, Department, Organizer, Date and Participants are handled.
 
Data such as State, City, Institution, Department, Organizer, Date and Participants are handled.
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Point to the '''webscraping.py''' in downloads folder
 
|| Point to the '''webscraping.py''' in downloads folder
  
Line 81: Line 82:
  
 
Now, we will go through the source code in the text editor.  
 
Now, we will go through the source code in the text editor.  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Looking at the code
 
|| Looking at the code
 
|| This source code will extract the necessary data, analyze it and plot graphs.
 
|| This source code will extract the necessary data, analyze it and plot graphs.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:'''import requests'''
 
|| Highlight:'''import requests'''
  
Line 97: Line 98:
 
'''from mpl_toolkits.mplot3d import Axes3D'''
 
'''from mpl_toolkits.mplot3d import Axes3D'''
 
|| First we need to import the necessary modules for '''web scraping''' in '''Python'''.  
 
|| First we need to import the necessary modules for '''web scraping''' in '''Python'''.  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We fetch the '''HTML '''of the page with''' requests.get''' to the '''URL'''.
 
|| We fetch the '''HTML '''of the page with''' requests.get''' to the '''URL'''.
  
 
Then, we parse it with '''BeautifulSoup '''to return a '''soup object''' for further analysis.
 
Then, we parse it with '''BeautifulSoup '''to return a '''soup object''' for further analysis.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
 
|| '''extract_table_data''' '''function '''extracts relevant data from an '''HTML '''table in the''' soup object'''.
 
|| '''extract_table_data''' '''function '''extracts relevant data from an '''HTML '''table in the''' soup object'''.
  
Line 110: Line 110:
  
 
An empty '''list '''is returned if no table is found.
 
An empty '''list '''is returned if no table is found.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
 
|| We find all the '''rows '''of the table except the first header row.
 
|| We find all the '''rows '''of the table except the first header row.
  
Line 120: Line 119:
  
 
Then, we store the values in a '''list'''.
 
Then, we store the values in a '''list'''.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
|| We finally check if the '''FOSS '''type column matches any of the values in the '''foss filter.'''
+
|| We finally check if the '''FOSS ''' type column matches any of the values in the '''foss filter.'''
  
 
Then, we convert the '''date '''column to a '''datetime object.'''
 
Then, we convert the '''date '''column to a '''datetime object.'''
Line 129: Line 128:
  
 
If both conditions are met, the '''row's '''data is added to the '''list'''.
 
If both conditions are met, the '''row's '''data is added to the '''list'''.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| This function, locates the''' pagination element''' in the '''HTML '''and extracts the page numbers from it.
 
|| This function, locates the''' pagination element''' in the '''HTML '''and extracts the page numbers from it.
Line 138: Line 137:
  
 
If no '''pagination '''is found, it returns 1, assuming there's only one page.  
 
If no '''pagination '''is found, it returns 1, assuming there's only one page.  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| '''scrape_all_pages''' '''function '''scrapes data from all available pages.
 
|| '''scrape_all_pages''' '''function '''scrapes data from all available pages.
Line 145: Line 144:
  
 
Finally we extract the relevant data based on the '''filters'''.
 
Finally we extract the relevant data based on the '''filters'''.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| Now, we iterate over all remaining pages starting from page 2.
 
|| Now, we iterate over all remaining pages starting from page 2.
Line 152: Line 151:
  
 
The data from each page is then appended to the overall '''dataset '''and returned.
 
The data from each page is then appended to the overall '''dataset '''and returned.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| Next, we will define functions for''' data analysis and data visualization'''.  
 
|| Next, we will define functions for''' data analysis and data visualization'''.  
  
'''piechart_visualization''' '''function '''generates a pie chart showing the '''FOSS categories'''.
+
'''piechart_visualization function '''generates a pie chart showing the '''FOSS categories'''.
  
 
Additionally, the '''FOSS '''counts are saved to an '''Excel sheet '''with the specified sheet name.
 
Additionally, the '''FOSS '''counts are saved to an '''Excel sheet '''with the specified sheet name.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
|| '''barchart_visualization''' '''function '''creates a '''bar chart''' showing the number of '''workshops '''per city.
+
|| '''barchart_visualization function '''creates a '''bar chart''' showing the number of '''workshops '''per city.
  
 
It then writes the data to an '''excel sheet''' with the specified sheet name.
 
It then writes the data to an '''excel sheet''' with the specified sheet name.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
  
|| We now define a '''function '''to''' '''filter the '''data frame''' to find '''workshops '''held at a specified college.
+
|| We now define a '''function '''to filter the '''data frame''' to find '''workshops '''held at a specified college.
  
 
Then counting and retrieving unique '''FOSS types''', departments, and organizers.
 
Then counting and retrieving unique '''FOSS types''', departments, and organizers.
  
If no data is found for the given college, it returns None.
+
If no data is found for the given college, it returns '''None'''.
  
 
Otherwise, it returns the unique '''FOSS categories''', departments, and organizers.
 
Otherwise, it returns the unique '''FOSS categories''', departments, and organizers.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We now''' '''first convert the '''Participants '''column to numeric values, filling any invalid entries with 0.
 
|| We now''' '''first convert the '''Participants '''column to numeric values, filling any invalid entries with 0.
  
Then we group the '''data '''by city and '''FOSS type''' and calculate the total number of participants.
+
Then we group the '''data '''by city and '''FOSS type''' and calculate the number of participants.
  
 
Unique city names and''' FOSS types''' are extracted, and numeric mappings are created to represent them.
 
Unique city names and''' FOSS types''' are extracted, and numeric mappings are created to represent them.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We can now create a '''3D bar chart''' visualizing the participants in workshops by city and '''FOSS type'''.
 
|| We can now create a '''3D bar chart''' visualizing the participants in workshops by city and '''FOSS type'''.
Line 187: Line 186:
 
The axes are labeled, and ticks are mapped to cities and '''FOSS '''values with a title.
 
The axes are labeled, and ticks are mapped to cities and '''FOSS '''values with a title.
  
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
|| We define this '''function to '''ensure that an excel sheet name does not exceed the 31 character limit.
+
|| We define this '''function''' to ensure that an excel sheet name does not exceed the 31 character limit.
  
 
If the name is longer than 31 characters, it '''truncates '''the name to 28 characters with '''ellipsis'''.
 
If the name is longer than 31 characters, it '''truncates '''the name to 28 characters with '''ellipsis'''.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We now define the base '''url '''for scraping the data and set filters to focus on specific '''FOSS types.'''
 
|| We now define the base '''url '''for scraping the data and set filters to focus on specific '''FOSS types.'''
Line 199: Line 198:
  
 
We then convert the start and end dates from string format to '''datetime objects''' for comparison.
 
We then convert the start and end dates from string format to '''datetime objects''' for comparison.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:  
 
|| Highlight:  
 
|| We now scrape all pages of data using the defined filters and store it in a '''DataFrame. '''
 
|| We now scrape all pages of data using the defined filters and store it in a '''DataFrame. '''
  
 
The '''DataFrame '''is then created with specified '''columns''', and duplicates are removed.
 
The '''DataFrame '''is then created with specified '''columns''', and duplicates are removed.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| We can now write the '''DataFrame '''data to an''' Excel file''' and generate visualizations.
 
|| We can now write the '''DataFrame '''data to an''' Excel file''' and generate visualizations.
  
 
We also analyze and save '''workshop '''data for a specific college if available.
 
We also analyze and save '''workshop '''data for a specific college if available.
|- style="border:1pt solid #000000;padding:0.176cm;"
+
|-  
 
|| Highlight:
 
|| Highlight:
 
|| A '''3D visualization''' is generated from the '''DataFrame'''.  
 
|| A '''3D visualization''' is generated from the '''DataFrame'''.  
Line 215: Line 214:
 
The data is saved and a confirmation message is printed.
 
The data is saved and a confirmation message is printed.
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Save the Code in the '''Downloads '''Folder
+
|| Save the Code in the '''Downloads '''Folder
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Save the code as '''webscraping.py '''in the '''Downloads '''folder.
+
|| Save the code as '''webscraping.py '''in the '''Downloads '''folder.
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Open the terminal ('''Ctrl + Alt + T''')
+
||Open the terminal ('''Ctrl + Alt + T''')
  
 
Start Virtual Environment
 
Start Virtual Environment
Line 225: Line 224:
  
 
'''> source Automation/bin/activate'''
 
'''> source Automation/bin/activate'''
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Open the '''terminal''' by pressing '''Control + Alt + T '''keys simultaneously.
+
|| Open the '''terminal''' by pressing '''Control + Alt + T '''keys simultaneously.
  
 
We will open the virtual environment we created for the '''Automation''' series.
 
We will open the virtual environment we created for the '''Automation''' series.
Line 233: Line 232:
 
Then press enter.
 
Then press enter.
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Running the Code
+
|| Running the Code
  
 
Type  
 
Type  
  
'''> <span style="background-color:#ffffff;">cd Downloads'''</span>
+
''' cd Downloads'''
  
 
'''> python3 webscraping.py'''
 
'''> python3 webscraping.py'''
|| <span style="background-color:#ffffff;color:#252525;">Now type, </span><span style="background-color:#ffffff;color:#252525;">'''cd Downloads'''</span><span style="background-color:#ffffff;color:#252525;">.</span>
+
|| Now type, '''cd Downloads'''
  
<span style="color:#252525;">Then type</span><span style="color:#252525;">''' python3 </span>webscraping<span style="color:#252525;">.py'''</span><span style="background-color:#ffffff;color:#252525;"> and press </span><span style="color:#252525;">'''Enter'''</span><span style="background-color:#ffffff;color:#252525;"> </span>
+
Then type''' python3 webscraping.py''' and press '''Enter'''
 
|-
 
|-
|| Observing the graphs# <div style="margin-left:1.27cm;margin-right:0cm;">'''Pie Chart'''</div>
+
|| Observing the graphs# '''Pie Chart'''
 +
|| As soon as we execute the code, '''matplotlib '''will display the graphs.
  
 +
The '''pie chart''' displays the distribution of '''workshops ''' by '''FOSS category'''
  
|| <span style="background-color:#ffffff;color:#252525;">As soon as we execute the code, </span><span style="background-color:#ffffff;color:#252525;">'''matplotlib '''</span><span style="background-color:#ffffff;color:#252525;">will display the graphs.</span>
+
This shows the proportion of each '''FOSS type''' as a percentage of the total '''workshops'''.
 
+
<span style="background-color:#ffffff;">The </span><span style="background-color:#ffffff;">'''pie chart'''</span><span style="background-color:#ffffff;"> displays the distribution of </span><span style="background-color:#ffffff;">'''workshops '''</span><span style="background-color:#ffffff;">by </span><span style="background-color:#ffffff;">'''FOSS category'''</span><span style="background-color:#ffffff;">.</span>
+
 
+
<span style="background-color:#ffffff;">This shows the proportion of each </span><span style="background-color:#ffffff;">'''FOSS type'''</span><span style="background-color:#ffffff;"> as a percentage of the total </span><span style="background-color:#ffffff;">'''workshops'''</span><span style="background-color:#ffffff;">.</span>
+
  
 
Close the window to see the next graph.
 
Close the window to see the next graph.
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Observing the graphs# <div style="margin-left:1.27cm;margin-right:0cm;">'''Bar Chart'''</div>
+
|| Observing the graphs# '''Bar Chart'''
 +
|| The next '''graph ''' we get is a '''Bar chart'''
  
|| <span style="background-color:#ffffff;color:#252525;">The next </span><span style="background-color:#ffffff;color:#252525;">'''graph '''</span><span style="background-color:#ffffff;color:#252525;">we get is a </span><span style="background-color:#ffffff;color:#252525;">'''Bar chart'''</span><span style="background-color:#ffffff;color:#252525;">. </span>
+
This shows the number of '''workshops '''conducted in each city.  
 
+
<span style="background-color:#ffffff;color:#252525;">This shows the number of </span><span style="background-color:#ffffff;color:#252525;">'''workshops '''</span><span style="background-color:#ffffff;color:#252525;">conducted in each city. </span>
+
  
 
Close the window to go to the next graph.
 
Close the window to go to the next graph.
 
|-
 
|-
|| Observing the graphs# <div style="margin-left:1.27cm;margin-right:0cm;">'''3D Bar Chart'''</div>
+
|| Observing the graphs# '''3D Bar Chart'''
 
+
|| Finally, we see the 3D bar chart.
|| <div style="color:#252525;">Finally, we see the 3D bar chart.</div>
+
  
<span style="background-color:#ffffff;color:#252525;">This graph displays cities, </span><span style="background-color:#ffffff;color:#252525;">'''FOSS types'''</span><span style="background-color:#ffffff;color:#252525;"> and participant counts on the three axes.</span>
+
This graph displays cities, '''FOSS types''' and participant counts on the three axes.
  
<div style="color:#252525;">Close the window.</div>
+
Close the window.
 
|-
 
|-
 
||  
 
||  
|| <span style="background-color:#ffffff;color:#252525;">Let us check the data in the </span><span style="background-color:#ffffff;color:#252525;">'''excel sheet'''</span><span style="background-color:#ffffff;color:#252525;">.</span>
+
|| Let us check the data in the '''excel sheet'''.
 
|-
 
|-
 
|| Navigating to Downloads
 
|| Navigating to Downloads
  
 
'''Files App > Downloads > st_data.xlsx'''
 
'''Files App > Downloads > st_data.xlsx'''
|| Go to the '''Downloads folder '''and double click to open the<span style="color:#ff0000;"> </span>'''st_data.xlsx''' file.
+
|| Go to the '''Downloads folder '''and double click to open the '''st_data.xlsx''' file.
 
|-
 
|-
 
|| Observing the Excel sheet
 
|| Observing the Excel sheet
Line 286: Line 281:
 
Zoom and show the data
 
Zoom and show the data
  
|| <span style="background-color:#ffffff;color:#252525;">The first </span><span style="background-color:#ffffff;color:#252525;">'''sheet '''</span><span style="background-color:#ffffff;color:#252525;">has the </span><span style="background-color:#ffffff;color:#252525;">'''Raw data'''</span><span style="background-color:#ffffff;color:#252525;"> that we extracted from the </span><span style="background-color:#ffffff;">'''Spoken Tutorial '''</span><span style="background-color:#ffffff;color:#252525;">website.</span>
+
|| The first '''sheet '''has the '''Raw data''' that we extracted from the '''Spoken Tutorial '''website.
  
<span style="background-color:#ffffff;color:#252525;">We can see the 10 </span><span style="background-color:#ffffff;color:#252525;">'''columns '''</span><span style="background-color:#ffffff;color:#252525;">here which contain all the data the website had.</span>
+
We can see the 10 '''columns '''here which contain all the data the website had.
 
|-
 
|-
 
|| Sheet 2 - FOSS Visualization
 
|| Sheet 2 - FOSS Visualization
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | <span style="background-color:#ffffff;color:#252525;">In the second sheet named </span><span style="background-color:#ffffff;color:#252525;">'''FOSS Visualization'''</span><span style="background-color:#ffffff;color:#252525;">, the count of workshops for each FOSS category is shown.</span>
+
|| In the second sheet named '''FOSS Visualization''', the count of workshops for each FOSS category is shown.
  
<span style="background-color:#ffffff;color:#252525;">This is the data of the </span><span style="background-color:#ffffff;color:#252525;">'''pie chart'''</span><span style="background-color:#ffffff;color:#252525;"> we have seen earlier. </span>
+
This is the data of the '''pie chart''' we have seen earlier.  
 
|-
 
|-
 
|| Sheet 3 - City Visualization
 
|| Sheet 3 - City Visualization
|| <span style="background-color:#ffffff;color:#252525;">The third sheet </span><span style="background-color:#ffffff;">'''City Visualization</span><span style="background-color:#ffffff;color:#252525;"> '''</span><span style="background-color:#ffffff;color:#252525;">shows the number of workshops conducted per city.</span>
+
|| The third sheet '''City Visualization '''shows the number of workshops conducted per city.
  
  
<span style="background-color:#ffffff;color:#252525;">This is the data of the </span><span style="background-color:#ffffff;color:#252525;">'''bar graph '''</span><span style="background-color:#ffffff;color:#252525;">we have seen earlier.</span>
+
This is the data of the '''bar graph '''we have seen earlier.
 
|-
 
|-
 
|| Sheet 4 - Shri Phanishwar Nath Renu En...
 
|| Sheet 4 - Shri Phanishwar Nath Renu En...
|| <span style="background-color:#ffffff;color:#252525;">The last sheet shows the unique data - </span><span style="background-color:#ffffff;color:#252525;">'''FOSS type'''</span><span style="background-color:#ffffff;color:#252525;">, organizers, workshop dates and participants.</span>
+
|| The last sheet shows the unique data - '''FOSS type''', organizers, workshop dates and participants.
 
|-
 
|-
 
|| Closing the virtual environment
 
|| Closing the virtual environment
Line 312: Line 307:
 
Type '''deactivate'''.
 
Type '''deactivate'''.
 
|-
 
|-
|| Show Slide:'''Applications of Web Scraping'''<div style="margin-left:1.27cm;margin-right:0cm;"></div>
+
|| Show Slide:'''Applications of Web Scraping'''
 
|| '''Web Scraping''' has lots of applications across various fields.
 
|| '''Web Scraping''' has lots of applications across various fields.
  
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Price Monitoring''' - E-commerce websites scrape their <span style="background-color:#ffffff;">Competitor</span> websites.</div>
+
* '''Price Monitoring''' - E-commerce websites scrape their Competitor websites.
* <div style="margin-left:1.27cm;margin-right:0cm;">They monitor their prices and adjust theirs accordingly.</div>
+
* They monitor their prices and adjust theirs accordingly.
  
 
|-
 
|-
|| Show Slide:'''Applications of Web Scraping'''<div style="margin-left:1.27cm;margin-right:0cm;"></div>
+
|| Show Slide:'''Applications of Web Scraping'''
 
||
 
||
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Academic Research''' - Researchers can scrape data from academic journals and websites.</div>
+
* '''Academic Research''' - Researchers can scrape data from academic journals and websites.
* <div style="margin-left:1.27cm;margin-right:0cm;">They collect information for studies and research.</div>
+
* They collect information for studies and research.
* <div style="margin-left:1.27cm;margin-right:0cm;">'''Financial Data Analysis''' - Analysts use web scraping to collect data from financial websites.</div>
+
* '''Financial Data Analysis''' - Analysts use web scraping to collect data from financial websites.
* <div style="margin-left:1.27cm;margin-right:0cm;">They analyze stock prices, market trends etc.</div>
+
* They analyze stock prices, market trends etc.
  
 
|-
 
|-
Line 331: Line 326:
  
 
In this tutorial, we have learnt to
 
In this tutorial, we have learnt to
* <div style="margin-left:1.27cm;margin-right:0cm;">Extract data from websites</div>
+
* Extract data from websites
* <div style="margin-left:1.27cm;margin-right:0cm;">Save data to a CSV file</div>
+
* Save data to a CSV file
* <div style="margin-left:1.27cm;margin-right:0cm;">Perform basic data analysis</div>
+
* Perform basic data analysis and
* <div style="margin-left:1.27cm;margin-right:0cm;">Generate visualizations</div>
+
* Generate visualizations
 
|-
 
|-
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:1pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | Show Slide:
+
||Show Slide:
  
 
'''Assignment'''
 
'''Assignment'''
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | As an assignment, please do the following:
+
||As an assignment, please do the following:
  
* <div style="margin-left:1.27cm;margin-right:0cm;">Extract the '''workshop '''data using different '''foss filters''', start and end date. </div>
+
* Extract the '''workshop '''data using different '''foss filters''', start and end date.  
* <div style="margin-left:1.27cm;margin-right:0cm;">Write the data to an '''Excel sheet'''.</div>
+
* Write the data to an '''Excel sheet'''.
  
 
|-
 
|-
 
|| Show Slide:'''About the Spoken Tutorial Project'''
 
|| Show Slide:'''About the Spoken Tutorial Project'''
| style="border-top:1pt solid #000000;border-bottom:0.5pt solid #000000;border-left:0.5pt solid #000000;border-right:0.5pt solid #000000;padding-top:0cm;padding-bottom:0cm;padding-left:0.092cm;padding-right:0.191cm;" | The video at the following link summarises the '''Spoken Tutorial Project.'''Please download and watch it.
+
||The video at the following link summarises the '''Spoken Tutorial Project.
 +
 
 +
'''Please download and watch it.
 
|-
 
|-
 
|| Show Slide:
 
|| Show Slide:
Line 367: Line 364:
 
|-
 
|-
 
|| Show Slide:'''Thank You'''
 
|| Show Slide:'''Thank You'''
|| This is '''Sai''' '''Sathwik''', a FOSSEE Semester Long Intern 2024, IIT Bombay signing off.
+
|| This is '''Sai Sathwik''', a FOSSEE Semester Long Intern 2024, IIT Bombay signing off.
  
 
Thanks for joining.
 
Thanks for joining.
 
|-
 
|-
 
|}
 
|}

Latest revision as of 12:35, 5 November 2024

Visual Cue Narration
Show Slide:

Welcome

Hello and welcome to the Spoken Tutorial on Web Scraping
Show Slide:

Learning Objectives

In this tutorial, we will learn to
  • Scrape data from any website
  • Extract it to a CSV file
  • Perform basic data analysis and
  • Generate visualizations
Show Slide:

System Requirements

  • Ubuntu Linux OS 22.04
  • Python 3.12.3
To record this tutorial, I am using
  • Ubuntu Linux OS version 22.04 and
  • Python 3.12.3
Show Slide:Pre-requisites

https://www.spoken-tutorial.org

To follow this tutorial
  • You must have basic knowledge of using Linux Terminal and Python
  • For pre-requisite Linux and Python Tutorials, please visit this website
  • Python libraries required for automation must be installed
Show Slide:Code Files
  • The files used in this tutorial are provided in the Code fileslink.
  • Please download and extract the files.
  • Make a copy and then use them while practicing.
Show Slide:

Web Scraping

Web Scraping is the automated process of extracting data from websites with software.

We will automate extracting data and information from web pages and parsing HTML content.

Show Slide:

Web Scraping - Libraries

To automate the process of extracting multimedia from a website, we need:
  • Requests library to fetch HTML content from a web page
  • BeautifulSoup library to parse and extract information from the HTML content
  • Matplotlib library to create static, animated, and interactive visualizations
  • Pandas library to provide data structures and data analysis tools
  • Xlsxwriter library is used for creating and formatting Excel files
  • Datetime library handles date operations like parsing strings into date objects
Show Slide:

Web Scraping - Example

For this tutorial, we will extract data from the spoken Tutorial statistics webpage.

Data analysis is done with workshops conducted between 2022 and 2023 on certain software.

Data such as State, City, Institution, Department, Organizer, Date and Participants are handled.

Point to the webscraping.py in downloads folder

Open the Text Editor with the source file

I have created the source file webscraping.py for demonstration.

Now, we will go through the source code in the text editor.

Looking at the code This source code will extract the necessary data, analyze it and plot graphs.
Highlight:import requests

from bs4 import BeautifulSoup

import pandas as pd

from datetime import datetime

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

First we need to import the necessary modules for web scraping in Python.
Highlight: We fetch the HTML of the page with requests.get to the URL.

Then, we parse it with BeautifulSoup to return a soup object for further analysis.

Highlight: extract_table_data function extracts relevant data from an HTML table in the soup object.

Then, we find the table with the provided class name in the HTML.

An empty list is returned if no table is found.

Highlight: We find all the rows of the table except the first header row.

Then, an empty list is initialized to store the extracted data.

For each row in the table, we extract all cells and strip the text content from each cell.

Then, we store the values in a list.

Highlight: We finally check if the FOSS type column matches any of the values in the foss filter.

Then, we convert the date column to a datetime object.

This is to verify if it falls within the specified time range.

If both conditions are met, the row's data is added to the list.

Highlight: This function, locates the pagination element in the HTML and extracts the page numbers from it.

Pagination is the process of dividing content into discrete pages.

If page numbers are found, it returns the last one as the total number of pages.

If no pagination is found, it returns 1, assuming there's only one page.

Highlight: scrape_all_pages function scrapes data from all available pages.

First, it fetches the initial page’s content and then determines the number of pages.

Finally we extract the relevant data based on the filters.

Highlight: Now, we iterate over all remaining pages starting from page 2.

First, we modify the url to request each specific page, fetch content and extract data.

The data from each page is then appended to the overall dataset and returned.

Highlight: Next, we will define functions for data analysis and data visualization.

piechart_visualization function generates a pie chart showing the FOSS categories.

Additionally, the FOSS counts are saved to an Excel sheet with the specified sheet name.

Highlight: barchart_visualization function creates a bar chart showing the number of workshops per city.

It then writes the data to an excel sheet with the specified sheet name.

Highlight: We now define a function to filter the data frame to find workshops held at a specified college.

Then counting and retrieving unique FOSS types, departments, and organizers.

If no data is found for the given college, it returns None.

Otherwise, it returns the unique FOSS categories, departments, and organizers.

Highlight: We now first convert the Participants column to numeric values, filling any invalid entries with 0.

Then we group the data by city and FOSS type and calculate the number of participants.

Unique city names and FOSS types are extracted, and numeric mappings are created to represent them.

Highlight: We can now create a 3D bar chart visualizing the participants in workshops by city and FOSS type.

The axes are labeled, and ticks are mapped to cities and FOSS values with a title.

Highlight: We define this function to ensure that an excel sheet name does not exceed the 31 character limit.

If the name is longer than 31 characters, it truncates the name to 28 characters with ellipsis.

Highlight: We now define the base url for scraping the data and set filters to focus on specific FOSS types.

We also set a date range from January 1 2022 to January 1 2023.

We then convert the start and end dates from string format to datetime objects for comparison.

Highlight: We now scrape all pages of data using the defined filters and store it in a DataFrame.

The DataFrame is then created with specified columns, and duplicates are removed.

Highlight: We can now write the DataFrame data to an Excel file and generate visualizations.

We also analyze and save workshop data for a specific college if available.

Highlight: A 3D visualization is generated from the DataFrame.

The data is saved and a confirmation message is printed.

Save the Code in the Downloads Folder Save the code as webscraping.py in the Downloads folder.
Open the terminal (Ctrl + Alt + T)

Start Virtual Environment

Type

> source Automation/bin/activate

Open the terminal by pressing Control + Alt + T keys simultaneously.

We will open the virtual environment we created for the Automation series.

Type source space Automation forward slash bin forward slash activate.

Then press enter.

Running the Code

Type

cd Downloads

> python3 webscraping.py

Now type, cd Downloads

Then type python3 webscraping.py and press Enter

Observing the graphs# Pie Chart As soon as we execute the code, matplotlib will display the graphs.

The pie chart displays the distribution of workshops by FOSS category

This shows the proportion of each FOSS type as a percentage of the total workshops.

Close the window to see the next graph.

Observing the graphs# Bar Chart The next graph we get is a Bar chart

This shows the number of workshops conducted in each city.

Close the window to go to the next graph.

Observing the graphs# 3D Bar Chart Finally, we see the 3D bar chart.

This graph displays cities, FOSS types and participant counts on the three axes.

Close the window.

Let us check the data in the excel sheet.
Navigating to Downloads

Files App > Downloads > st_data.xlsx

Go to the Downloads folder and double click to open the st_data.xlsx file.
Observing the Excel sheet We can see in the bottom left corner that we have created four sheets.
Sheet 1 - Workshops Data

Zoom and show the data

The first sheet has the Raw data that we extracted from the Spoken Tutorial website.

We can see the 10 columns here which contain all the data the website had.

Sheet 2 - FOSS Visualization In the second sheet named FOSS Visualization, the count of workshops for each FOSS category is shown.

This is the data of the pie chart we have seen earlier.

Sheet 3 - City Visualization The third sheet City Visualization shows the number of workshops conducted per city.


This is the data of the bar graph we have seen earlier.

Sheet 4 - Shri Phanishwar Nath Renu En... The last sheet shows the unique data - FOSS type, organizers, workshop dates and participants.
Closing the virtual environment

Type

> deactivate

Switch back to the terminal to close the virtual environment.

Type deactivate.

Show Slide:Applications of Web Scraping Web Scraping has lots of applications across various fields.
  • Price Monitoring - E-commerce websites scrape their Competitor websites.
  • They monitor their prices and adjust theirs accordingly.
Show Slide:Applications of Web Scraping
  • Academic Research - Researchers can scrape data from academic journals and websites.
  • They collect information for studies and research.
  • Financial Data Analysis - Analysts use web scraping to collect data from financial websites.
  • They analyze stock prices, market trends etc.
Show Slide:Summary This brings us to the end of this tutorial. Let us summarise.

In this tutorial, we have learnt to

  • Extract data from websites
  • Save data to a CSV file
  • Perform basic data analysis and
  • Generate visualizations
Show Slide:

Assignment

As an assignment, please do the following:
  • Extract the workshop data using different foss filters, start and end date.
  • Write the data to an Excel sheet.
Show Slide:About the Spoken Tutorial Project The video at the following link summarises the Spoken Tutorial Project.

Please download and watch it.

Show Slide:

Spoken Tutorial Workshops

The Spoken Tutorial Project team conducts workshops and gives certificates.

For more details, please write to us.

Show Slide:Answers for THIS Spoken Tutorial Please post your timed queries in this forum.
Show Slide:

FOSSEE Forum

For any general or technical questions on Python for Automation, visit the FOSSEE forum and post your question.
Show Slide:Acknowledgement The Spoken Tutorial Project was established by the Ministry of Education, Government of India.
Show Slide:Thank You This is Sai Sathwik, a FOSSEE Semester Long Intern 2024, IIT Bombay signing off.

Thanks for joining.

Contributors and Content Editors

Madhurig, Nirmala Venkat