Web Scrapper with Python and BeautifulSoup

Vishalmendekarhere
6 min readAug 1, 2020

Understanding Web Scrapping, its need and how to create one.

source: https://dimensionless.in/wp-content/uploads/2019/03/scraping_cover.jpeg

Artificial Intelligence and Machine Learning are in Boom today. There are many Machine Learning Engineer, Deep Learning Engineer and Data Scientists around the world. No matter who they are where they belong to all of them have one thing in common.

[Q]. What’s that?
[A]. Data.

What AI and ML have got to do with Web Scraping?

The core of all the AI, ML and DL systems are the data. The lifecycle of an MLE or Data Scientists include the below steps.

1. Understanding Bussiness Objective.
2. Data Acquisition
3. Data Visualization
4. Data Preprocessing
5. Feature Engineering
6. Model Building etc.

No matter how good someone is with maths, statistics, algorithms etc without having a dataset no Machine Learning or Deep Learning System can be developed. So, the sole of all these lies in Data. Although we always have data in our hand to work with so we don’t need to worry about acquiring the data.

[Q]. What if we don’t have the data?
[A]. We can scrap the data from the relevant source and go ahead.

What is Web Scrapping?

Web Scraping is the process of gathering information from the Web. Even downloading images from a website or copy-pasting content from Wikipedia is kind of scrapping, that’s done manually. However the term “Web Scraping” is used when we make this process automated by writing some code or scripts.

Why Web Scraping?

There can be multiple reasons for it.
1. Suppose I own a stock with name “XYZ”. I want to stay updated with its value every day but also I don’t want to visit the stock market website and manually search for it. So web scrapper can be handy in this case.

2. Similar to the above example. Suppose I am looking for Machine Learning Job in Mumbai. And have an account in Naukri.com. I don’t like visiting the website and scrolling and seeing irrelevant posts. In that case, I can have a web scraper to stay updated about the opening relevant for me.

3. When working in any ML or DL problem we have data given already. But getting our own data and working on it adds more value to your Portfolio as it shows the individual is capable of working from scratch by creating his/her own dataset.

Prerequisites For Web Scraping
1. Basics of HTML (https://www.w3schools.com/html/)
2. Python Basics
3. Python Libraries (urllib, requests, BeautifulSoup)

Scraping Rules

  1. One should check a website’s Terms and Conditions before scraping it. The data scraped shouldn’t be used for commercial purpose.
  2. While scrapping the pages do not bombard the page with multiple requests getting yourself blocked and crashing their website. One request per second is fine.

Different types of webpages available:
1. Static WebPages.
2. Dynamic WebPages.
3. Clever WebPages (I call them this).

source: https://www.webhostingsecretrevealed.net/wp-content/uploads/2020/01/how-hosting-works.jpg
  1. Static Webpage
    In static webpage, a request is sent to the webserver present over the internet and as a response, we get back an HTML Page. This HTML page is then loaded in our web browser. It is the easiest type of webpage to scrap.

Obtain the webpage

#Import Librariesimport urllib.requestfrom bs4 import BeautifulSoup#Specify the URLweb_page = 'https://www.naukri.com/machine-learning-jobs-in-mumbai'

Understanding the URL
Base URL: https://www.naukri.com
Search Query: machine-learning-jobs-in-mumbai

By changing this Seach Query we can directly change the results without even visiting the webpage.

Using the developer tools in your browser you can inspect the data you want to scrap.

Use Developer Tools to get the pointer of your required data

By inspecting the Job Title we came to know about the exact position of the Title. Now using these values we can scrap the Job Title.

Read the webpage

#Reading the webpage contentwith urllib.request.urlopen(web_page) as url:
s = url.read()

Parse the webpage in BeautifulSoup

# Soup variable have the HTML code we can work withjob = job_role.text.strip()soup = BeautifulSoup(web_page, ‘html.parser’)

Getting Job Role

#Will look for the class which has its value “jd-header-title” in h1 headingjob_role = soup.find("h1", attrs={"class"="jd-header-title"})# Will remove extra spaces
job = job_role.text.strip()

The gathered data can be easily stored into excel, Json, XML or CSV format.

import Datetime
import csv
with open(‘abc.csv’, ‘a’) as csv_file:
writer = csv.writer(csv_file)
writer.writerow([Job_Title, job, datetime.now()])

In same way, we can get any data by using developer tools in the webpage.
Things were quite easy here as it was a static webpage where we had the HTML code only

2. Dynamic WebPage
Dynamic webpage is the one which on sending a request to the webserver get javascript code along with HTML code
Now, this javascript code executes in our browser to render the view.
Before we used the URL to get the content and easily scraped it with the BeautifulSoup. But now the code which we will get to see while inspecting the webpage and one we get in our code will be different.

Most of the E-commerce site is Dynamic in Nature. Which can’t be scraped using urllib and BeautifulSoup.

[Q]. So can’t we scrap them?
[A].
Obviously we can and for this, we are going to use the well know framework known as “Selenium”.

Selenium is an open-source framework mostly used by the tester to test the webpage. Since the dynamic page has Javascript code in them which needs to be executed. So when we load the webpage using Selenium we get an option to execute Javascript making it more interactive and then we get the HTML code too.

Loading Selenium and the webpage

# Loading selenium components
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

# Establish chrome driver and giving the URL
url = "xyz.com"
driver = webdriver.Chrome()
driver.get(url)

For more information about Selenium Scrapping Visit :https://www.scrapingbee.com/blog/selenium-python/

3. Clever WebPages
Many large scale companies have some legal policies for unauthorized scraping. Whereas there are some company which even won’t let us scrap their website. But, how? There are webpages whose content keep changing maybe by some random number prefixed or suffixed or such similar thing.

One such example is:

The price is inside a class. And the value of this class keeps on changing every second. Making it tough for a person to scrap the webpage.

Also, we need to keep checking our code and the website structure. Sometimes websites get updated changing their internal structure making our scraper go useless.

Conclusion

This was my first blog with the aim to provide some basic understanding of web scraping and building the simplest web scraper from scratch.

References
For an in-depth understanding, you can check out these links.
https://hackernoon.com/building-a-web-scraper-from-start-to-finish-bb6b95388184

https://www.scrapingbee.com/blog/selenium-python/

https://realpython.com/beautiful-soup-web-scraper-python/

--

--